6. WSDM 2013:Rome, Italy

Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013, Rome, Italy, February 4-8, 2013. ACM 【DBLP Link

Paper Num: 95 || Session Num: 16

Keynote address 1

1. The virtual lab.

Paper Link】 【Pages】:1-2

【Authors】: Duncan J. Watts

【Abstract】: The Internet and the Web have transformed society, spawning new industries, altering social and cultural practices, and challenging long-accepted notions of individual privacy, intellectual property, and national security. In this talk, I argue that social science is also being transformed. In particular, I describe how crowd sourcing sites like Amazon's Mechanical Turk are increasingly being used by researchers to create "virtual labs" in which they can conduct behavioral experiments on a scale and speed that would have been hard to imagine just a decade ago. To illustrate the point, I describe some recent experiments that showcase the advantages of virtual over traditional physical labs, as well as some of the limitations. I then discuss how this relatively new experimental capability may unfold in the near future, along with some implications for social and behavioral science.

【Keywords】: online experimental social science

Social networks and information dynamics 6

2. Information-theoretic measures of influence based on content dynamics.

Paper Link】 【Pages】:3-12

【Authors】: Greg Ver Steeg ; Aram Galstyan

【Abstract】: The fundamental building block of social influence is for one person to elicit a response in another. Researchers measuring a "response" in social media typically depend either on detailed models of human behavior or on platform-specific cues such as re-tweets, hash tags, URLs, or mentions. Most content on social networks is difficult to model because the modes and motivation of human expression are diverse and incompletely understood. We introduce content transfer, an information-theoretic measure with a predictive interpretation that directly quantifies the strength of the effect of one user's content on another's in a model-free way. Estimating this measure is made possible by combining recent advances in non-parametric entropy estimation with increasingly sophisticated tools for content representation. We demonstrate on Twitter data collected for thousands of users that content transfer is able to capture non-trivial, predictive relationships even for pairs of users not linked in the follower or mention graph. We suggest that this measure makes large quantities of previously under-utilized social media content accessible to rigorous statistical causal analysis.

【Keywords】: causality; entropy; influence; social networks

3. Characterizing and curating conversation threads: expansion, focus, volume, re-entry.

Paper Link】 【Pages】:13-22

【Authors】: Lars Backstrom ; Jon M. Kleinberg ; Lillian Lee ; Cristian Danescu-Niculescu-Mizil

【Abstract】: Discussion threads form a central part of the experience on many Web sites, including social networking sites such as Facebook and Google Plus and knowledge creation sites such as Wikipedia. To help users manage the challenge of allocating their attention among the discussions that are relevant to them, there has been a growing need for the algorithmic curation of on-line conversations --- the development of automated methods to select a subset of discussions to present to a user. Here we consider two key sub-problems inherent in conversational curation: length prediction --- predicting the number of comments a discussion thread will receive --- and the novel task of re-entry prediction --- predicting whether a user who has participated in a thread will later contribute another comment to it. The first of these sub-problems arises in estimating how interesting a thread is, in the sense of generating a lot of conversation; the second can help determine whether users should be kept notified of the progress of a thread to which they have already contributed. We develop and evaluate a range of approaches for these tasks, based on an analysis of the network structure and arrival pattern among the participants, as well as a novel dichotomy in the structure of long threads. We find that for both tasks, learning-based approaches using these sources of information.

【Keywords】: comment threads; conversations; facebook; feed ranking; likes; on-line communities; recommendation; social networks; user-generated content; wikipedia

4. Structure and dynamics of information pathways in online media.

Paper Link】 【Pages】:23-32

【Authors】: Manuel Gomez-Rodriguez ; Jure Leskovec ; Bernhard Schölkopf

【Abstract】: Diffusion of information, spread of rumors and infectious diseases are all instances of stochastic processes that occur over the edges of an underlying network. Many times networks over which contagions spread are unobserved, and such networks are often dynamic and change over time. In this paper, we investigate the problem of inferring dynamic networks based on information diffusion data. We assume there is an unobserved dynamic network that changes over time, while we observe the results of a dynamic process spreading over the edges of the network. The task then is to infer the edges and the dynamics of the underlying network. We develop an on-line algorithm that relies on stochastic convex optimization to efficiently solve the dynamic network inference problem. We apply our algorithm to information diffusion among 3.3 million mainstream media and blog sites and experiment with more than 179 million different pieces of information spreading over the network in a one year period. We study the evolution of information pathways in the online media space and find interesting insights. Information pathways for general recurrent topics are more stable across time than for on-going news events. Clusters of news media sites and blogs often emerge and vanish in matter of days for on-going news events. Major social movements and events involving civil population, such as the Libyan'{}s civil war or Syria'{}s uprise, lead to an increased amount of information pathways among blogs as well as in the overall increase in the network centrality of blogs and social media sites.

【Keywords】: blogs; information cascades; meme-tracking; networks of diffusion; news media; social networks

5. Cascade-based community detection.

Paper Link】 【Pages】:33-42

【Authors】: Nicola Barbieri ; Francesco Bonchi ; Giuseppe Manco

【Abstract】: Given a directed social graph and a set of past informa- tion cascades observed over the graph, we study the novel problem of detecting modules of the graph (communities of nodes), that also explain the cascades. Our key observation is that both information propagation and social ties forma- tion in a social network can be explained according to the same latent factor, which ultimately guide a user behavior within the network. Based on this observation, we propose the Community-Cascade Network (CCN) model, a stochas- tic mixture membership generative model that can fit, at the same time, the social graph and the observed set of cas- cades. Our model produces overlapping communities and for each node, its level of authority and passive interest in each community it belongs. For learning the parameters of the CCN model, we devise a Generalized Expectation Maximization procedure. We then apply our model to real-world social networks and in- formation cascades: the results witness the validity of the proposed CCN model, providing useful insights on its signif- icance for analyzing social behavior.

【Keywords】: community detection; information cascades; social networks

6. Patent partner recommendation in enterprise social networks.

Paper Link】 【Pages】:43-52

【Authors】: Sen Wu ; Jimeng Sun ; Jie Tang

【Abstract】: It is often challenging to incorporate users' interactions into a recommendation framework in an online model. In this paper, we propose a novel interactive learning framework to formulate the problem of recommending patent partners into a factor graph model. The framework involves three phases: 1) candidate generation, where we identify the potential set of collaborators; 2) candidate refinement, where a factor graph model is used to adjust the candidate rankings; 3) interactive learning method to efficiently update the existing recommendation model based on inventors' feedback. We evaluate our proposed model on large enterprise patent networks. Experimental results demonstrate that the recommendation accuracy of the proposed model significantly outperforms several baselines methods using content similarity, collaborative filtering and SVM-Rank. We also demonstrate the effectiveness and efficiency of the interactive learning, which performs almost as well as offline re-training, but with only 1 percent of the running time.

【Keywords】: cross collaboration; predictive model; social network

7. Exploiting homophily effect for trust prediction.

Paper Link】 【Pages】:53-62

【Authors】: Jiliang Tang ; Huiji Gao ; Xia Hu ; Huan Liu

【Abstract】: Trust plays a crucial role for online users who seek reliable information. However, in reality, user-specified trust relations are very sparse, i.e., a tiny number of pairs of users with trust relations are buried in a disproportionately large number of pairs without trust relations, making trust prediction a daunting task. As an important social concept, however, trust has received growing attention and interest. Social theories are developed for understanding trust. Homophily is one of the most important theories that explain why trust relations are established. Exploiting the homophily effect for trust prediction provides challenges and opportunities. In this paper, we embark on the challenges to investigate the trust prediction problem with the homophily effect. First, we delineate how it differs from existing approaches to trust prediction in an unsupervised setting. Next, we formulate the new trust prediction problem into an optimization problem integrated with homophily, empirically evaluate our approach on two datasets from real-world product review sites, and compare with representative algorithms to gain a deep understanding of the role of homophily in trust prediction.

【Keywords】: homophily effect; homophily regularization; social correlation; trust network; trust prediction

Searching and ranking 14

8. Efficient and effective retrieval using selective pruning.

Paper Link】 【Pages】:63-72

【Authors】: Nicola Tonellotto ; Craig Macdonald ; Iadh Ounis

【Abstract】: Retrieval can be made more efficient by deploying dynamic pruning strategies such as WAND, which do not degrade effectiveness up to a given rank. It is possible to increase the efficiency of such techniques by pruning more 'aggressively'. However, this may reduce effectiveness. In this work, we propose a novel selective framework that determines the appropriate amount of pruning aggressiveness on a per-query basis, thereby increasing overall efficiency without significantly reducing overall effectiveness. We postulate two hypotheses about the queries that should be pruned more aggressively, which generate two approaches within our framework, based on query performance predictors and query efficiency predictors, respectively. We thoroughly experiment to ascertain the efficiency and effectiveness impacts of the proposed approaches, as part of a search engine deploying state-of-the-art learning to rank techniques. Our results on 50 million documents of the TREC ClueWeb09 collection show that by using query efficiency predictors to target inefficient queries, we observe that a 36% reduction in mean response time and a 50% reduction of the response times experienced by the slowest 10% of queries can be achieved while still ensuring effectiveness.

【Keywords】: dynamic pruning; efficient & effective search engines; learning to rank; query efficiency prediction

9. Document selection for tiered indexing in commerce search.

Paper Link】 【Pages】:73-82

【Authors】: Debmalya Panigrahi ; Sreenivas Gollapudi

【Abstract】: A search engine aims to return a set of relevant documents in response to a query, while minimizing the response time. This has led to the use of a tiered index, where the search engine maintains a small cache of documents that can serve a large fraction of queries. We give a novel algorithm for the selection of documents in a tiered index for commerce search (i.e. users searching for products on the web) that effectively exploits the superior structural characteristics of commerce search queries. This is in sharp contrast to previous approaches to tiered indexing that were aimed at general web search where queries are typically unstructured. We theoretically analyze our algorithms and give performance guarantees even in worst-case scenarios. We then complement and strengthen our theoretical claims by performing exhaustive experiments on real-world commerce search data, and show that our algorithm outperforms state-of-the-art tiered indexing techniques that were developed for general web search.

【Keywords】: structured search; tiered indexing

10. Quasi-succinct indices.

Paper Link】 【Pages】:83-92

【Authors】: Sebastiano Vigna

【Abstract】: Compressed inverted indices in use today are based on the idea of gap compression: documents pointers are stored in increasing order, and the gaps between successive document pointers are stored using suitable codes which represent smaller gaps using less bits. Additional data such as counts and positions is stored using similar techniques. A large body of research has been built in the last 30 years around gap compression, including theoretical modeling of the gap distribution, specialized instantaneous codes suitable for gap encoding, and ad hoc document reorderings which increase the efficiency of instantaneous codes. This paper proposes to represent an index using a different architecture based on quasi-succinct representation of monotone sequences. We show that, besides being theoretically elegant and simple, the new index provides expected constant-time operations, space savings, and, in practice, significant performance improvements on conjunctive, phrasal and proximity queries.

【Keywords】: compressed indices; succinct data structures

11. Identifying users' topical tasks in web search.

Paper Link】 【Pages】:93-102

【Authors】: Wen Hua ; Yangqiu Song ; Haixun Wang ; Xiaofang Zhou

【Abstract】: A search task represents an atomic information need of a user in web search. Tasks consist of queries and their reformulations, and identifying tasks is important for search engines since they provide valuable information for determining user satisfaction with search results, predicting user search intent, and suggesting queries to the user. Traditional approaches to identifying tasks exploit either temporal or lexical features of queries. However, many query refinements are topical, which means that a query and its refinements may not be similar on the lexical level. Furthermore, multiple tasks in the same search session may interleave, which means we cannot simply order the searches by their timestamps and divide the session into multiple tasks. Thus, in order to identify tasks correctly, we need to be able to compare two queries at the semantic level. In this paper, we use a knowledgebase known as Probase to infer the conceptual meanings of queries, and automatically identify the topical query refinements in the tasks. Experimental results on real search log data demonstrate that Probase can indeed help estimate the topical affinity between queries, and thus enable us to merge queries that are topically related but dissimilar at the lexical level.

【Keywords】: conceptualization; interleaved task; probase; task identification

Paper Link】 【Pages】:103-112

【Authors】: Alexandra Chouldechova ; David Mease

【Abstract】: The query-document relevance judgments used in web search engine evaluation are traditionally provided by human assessors who have no particular association with the specific queries selected for the evaluation. Most commonly, queries are randomly sampled from search logs and in turn randomly assigned to the human assessors. In this paper, we consider a very different approach in which we instead ask the human assessors to provide their own queries from their recent search experiences. Using these queries as our sample, we compare the relevance judgments from the "owners" of the queries to the relevance judgments of the non-owners. We conduct experiments which reveal that query ownership has a substantial and beneficial impact on the accuracy of relevance judgments. In particular, we observe that owners are more consistently able to distinguish a higher quality set of search results from a lower quality set in a blind comparison. The implication for web search evaluation is that query owners provide more valuable relevance judgments than non-owners, presumably due to the background knowledge associated with their queries. We quantify the benefit of using owner assessments versus non-owner assessments in terms of sample size reduction. We also touch on some of the practical challenges associated with using query owners as assessors.

【Keywords】: evaluation; experiment design; search engines; user queries

13. Optimizing top-k document retrieval strategies for block-max indexes.

Paper Link】 【Pages】:113-122

【Authors】: Constantinos Dimopoulos ; Sergey Nepomnyachiy ; Torsten Suel

【Abstract】: Large web search engines use significant hardware and energy resources to process hundreds of millions of queries each day, and a lot of research has focused on how to improve query processing efficiency. One general class of optimizations called early termination techniques is used in all major engines, and essentially involves computing top results without an exhaustive traversal and scoring of all potentially relevant index entries. Recent work in [9,7] proposed several early termination algorithms for disjunctive top-k query processing, based on a new augmented index structure called Block-Max Index that enables aggressive skipping in the index. In this paper, we build on this work by studying new algorithms and optimizations for Block-Max indexes that achieve significant performance gains over the work in [9,7]. We start by implementing and comparing Block-Max oriented algorithms based on the well-known Maxscore and WAND approaches. Then we study how to build better Block-Max index structures and design better index-traversal strategies, resulting in new algorithms that achieve a factor of 2 speed-up over the best results in [9] with acceptable space overheads. We also describe and evaluate a hierarchical algorithm for a new recursive Block-Max index structure.

【Keywords】: block-max inverted index; docid-oriented block-max index; early termination; top-k query processing

14. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data.

Paper Link】 【Pages】:123-132

【Authors】: Alex Deng ; Ya Xu ; Ron Kohavi ; Toby Walker

【Abstract】: Online controlled experiments are at the heart of making data-driven decisions at a diverse set of companies, including Amazon, eBay, Facebook, Google, Microsoft, Yahoo, and Zynga. Small differences in key metrics, on the order of fractions of a percent, may have very significant business implications. At Bing it is not uncommon to see experiments that impact annual revenue by millions of dollars, even tens of millions of dollars, either positively or negatively. With thousands of experiments being run annually, improving the sensitivity of experiments allows for more precise assessment of value, or equivalently running the experiments on smaller populations (supporting more experiments) or for shorter durations (improving the feedback cycle and agility). We propose an approach (CUPED) that utilizes data from the pre-experiment period to reduce metric variability and hence achieve better sensitivity. This technique is applicable to a wide variety of key business metrics, and it is practical and easy to implement. The results on Bing's experimentation system are very successful: we can reduce variance by about 50%, effectively achieving the same statistical power with only half of the users, or half the duration.

【Keywords】: a/b testing; controlled experiment; power; pre-experiment; search quality evaluation; sensitivity; variance

Paper Link】 【Pages】:133-142

【Authors】: Youngho Kim ; Ahmed Hassan Awadallah ; Ryen W. White ; Yi-Min Wang

【Abstract】: Understanding the characteristics of queries where a search engine is failing is important for improving engine performance. Previous work largely relies on user-interaction features (e.g., clickthrough statistics) to identify such underperforming queries. However, relying on interaction behavior means that searchers need to become dissatisfied and need to exhibit that in their search behavior, by which point it may be too late to help them. In this paper, we propose a method to generate underperforming query identification rules instantly using topical and lexical attributes. The method first generates query attributes using sources such as topics, concepts (entities), and keywords in queries. Then, association rules are learned by exploiting the FP-growth algorithm and decision trees using underperforming query examples. We develop a query classification model capable of accurately estimating dissatisfaction using the generated rules, and demonstrate significant performance gains over state-of-the-art query performance prediction models.

【Keywords】: underperforming query analysis; user dissatisfaction

16. NCDawareRank: a novel ranking method that exploits the decomposable structure of the web.

Paper Link】 【Pages】:143-152

【Authors】: Athanasios N. Nikolakopoulos ; John D. Garofalakis

【Abstract】: Research about the topological characteristics of the hyperlink graph has shown that Web possesses a nested block structure, indicative of its innate hierarchical organization. This crucial observation opens the way for new approaches that can usefully regard Web as a Nearly Completely Decomposable(NCD) system; In recent years, such approaches gave birth to various efficient methods and algorithms that exploit NCD from a computational point of view and manage to considerably accelerate the extraction of the PageRank vector. However, very little have been done towards the qualitative exploitation of NCD. In this paper we propose NCDawareRank, a novel ranking method that uses the intuition behind NCD to generalize and refine PageRank. NCDawareRank considers both the link structure and the hierarchical nature of the Web in a way that preserves the mathematically attractive characteristics of PageRank and at the same time manages to successfully resolve many of its known problems, including Web Spamming Susceptibility and Biased Ranking of Newly Emerging Pages. Experimental results show that NCDawareRank is more resistant to direct manipulation, alleviates the problems caused by the sparseness of the link graph and assigns more reasonable ranking scores to newly added pages, while maintaining the ability to be easily implemented on a large-scale and in a computationally efficient manner.

【Keywords】: link analysis; link spamming; ncdawarerank; near complete decomposability; pagerank; ranking; sparsity

17. Rank quantization.

Paper Link】 【Pages】:153-162

【Authors】: Ravi Kumar ; Ronny Lempel ; Roy Schwartz ; Sergei Vassilvitskii

【Abstract】: We study the problem of aggregating and summarizing partial orders, on a large scale. Our motivation is two-fold: to discover elements at similar preference levels and to reduce the number of bits needed to store an element's position in a full ranking.We proceed in two steps: first, we find a total order by linearizing the rankings induced by the multiple partial orders and removing potentially inconsistent pairwise preferences. Next, given a total order, we introduce and formalize the rank quantization problem, which intuitively aims to bucketize the total order in a manner that mostly preserves the relations appearing in the partial orders. We show an exact quadratic-time quantization algorithm, as well as a greedy 2/3-approximation algorithm whose running is substantially faster on sparse instances. As an application, we aggregate rankings of top-10 search results over millions of search engine queries, approximately reproducing and then efficiently encoding the underlying static ranks used by the engine. We evaluate the performance of our algorithms on a web dataset of 12 million(2^{23.5}) unique pages and show that we can quantize the pages' static ranks using as few as eight bits, with only a minor degradation in search quality.

【Keywords】: bucketization of static ranks; rank linearization

18. Time-sensitive web image ranking and retrieval via dynamic multi-task regression.

Paper Link】 【Pages】:163-172

【Authors】: Gunhee Kim ; Eric P. Xing

【Abstract】: In this paper, we investigate a time-sensitive image retrieval problem, in which given a query keyword, a query time point, and optionally user information, we retrieve the most relevant and temporally suitable images from the database. Inspired by recently emerging interests on query dynamics in information retrieval research, our time-sensitive image retrieval algorithm can infer users' implicit search intent better and provide more engaging and diverse search results according to temporal trends of Web user photos. We model observed image streams as instances of multivariate point processes represented by several different descriptors, and develop a regularized multi-task regression framework that automatically selects and learns stochastic parametric models to solve the relations between image occurrence probabilities and various temporal factors that influence them. Using Flickr datasets of more than seven million images of 30 topics, our experimental results show that the proposed algorithm is more successful in time-sensitive image retrieval than other candidate methods, including ranking SVM, a PageRank-based image ranking, and a generative temporal topic model.

【Keywords】: multi-task regression; point process; web image retrieval

19. Absence time and user engagement: evaluating ranking functions.

Paper Link】 【Pages】:173-182

【Authors】: Georges Dupret ; Mounia Lalmas

【Abstract】: In the online industry, user engagement is measured with various engagement metrics used to assess users' depth of engagement with a website. Widely-used metrics include clickthrough rates, page views and dwell time. Relying solely on these metrics can lead to contradictory if not erroneous conclusions regarding user engagement. In this paper, we propose the time between two user visits, or the absence time, to measure user engagement. Our assumption is that if users find a website interesting, engaging or useful, they will return to it sooner -a reflection of their engagement with the site -than if this is not the case. This assumption has the advantage of being simple and intuitive and applicable to a large number of settings. As a case study, we use a community Q&A website, and compare the behaviour of users exposed to six functions used to rank past answers, both in terms of traditional metrics and absence time. We use Survival Analysis to show the relation between absence time and other engagement metrics. We demonstrate that the absence time leads to coherent, interpretable results and helps to better understand other metrics commonly used to evaluate user engagement in search.

【Keywords】: clickthrough data; metrics; ranking evaluation; search engine; time between visits; user engagement; usinterleaving

20. Reusing historical interaction data for faster online learning to rank for IR.

Paper Link】 【Pages】:183-192

【Authors】: Katja Hofmann ; Anne Schuth ; Shimon Whiteson ; Maarten de Rijke

【Abstract】: Online learning to rank for information retrieval (IR) holds promise for allowing the development of "self-learning" search engines that can automatically adjust to their users. With the large amount of e.g., click data that can be collected in web search settings, such techniques could enable highly scalable ranking optimization. However, feedback obtained from user interactions is noisy, and developing approaches that can learn from this feedback quickly and reliably is a major challenge. In this paper we investigate whether and how previously collected (historical) interaction data can be used to speed up learning in online learning to rank for IR. We devise the first two methods that can utilize historical data (1) to make feedback available during learning more reliable and (2) to preselect candidate ranking functions to be evaluated in interactions with users of the retrieval system. We evaluate both approaches on 9 learning to rank data sets and find that historical data can speed up learning, leading to substantially and significantly higher online performance. In particular, our pre-selection method proves highly effective at compensating for noise in user feedback. Our results show that historical data can be used to make online learning to rank for IR much more effective than previously possible, especially when feedback is noisy.

【Keywords】: information retrieval; interleaved comparison; learning to rank

21. Pairwise ranking aggregation in a crowdsourced setting.

Paper Link】 【Pages】:193-202

【Authors】: Xi Chen ; Paul N. Bennett ; Kevyn Collins-Thompson ; Eric Horvitz

【Abstract】: Inferring rankings over elements of a set of objects, such as documents or images, is a key learning problem for such important applications as Web search and recommender systems. Crowdsourcing services provide an inexpensive and efficient means to acquire preferences over objects via labeling by sets of annotators. We propose a new model to predict a gold-standard ranking that hinges on combining pairwise comparisons via crowdsourcing. In contrast to traditional ranking aggregation methods, the approach learns about and folds into consideration the quality of contributions of each annotator. In addition, we minimize the cost of assessment by introducing a generalization of the traditional active learning scenario to jointly select the annotator and pair to assess while taking into account the annotator quality, the uncertainty over ordering of the pair, and the current model uncertainty. We formalize this as an active learning strategy that incorporates an exploration-exploitation tradeoff and implement it using an efficient online Bayesian updating scheme. Using simulated and real-world data, we demonstrate that the active learning strategy achieves significant reductions in labeling cost while maintaining accuracy.

【Keywords】: crowdsourcing; pairwise preference; ranking

Large-scale data and social networks 4

22. Optimizing parallel algorithms for all pairs similarity search.

Paper Link】 【Pages】:203-212

【Authors】: Maha Ahmed Alabduljalil ; Xun Tang ; Tao Yang

【Abstract】: All pairs similarity search is used in many web search and data mining applications. Previous work has used comparison filtering, inverted indexing, and parallel accumulation of partial intermediate results to expedite its execution. However, shuffling intermediate results can incur significant communication overhead as data scales up. This paper studies a scalable two-step approach called Partition-based Similarity Search (PSS) which incorporates several optimization techniques. First, PSS uses a static partitioning algorithm that places dissimilar vectors into different groups and balance the comparison workload with a circular assignment. Second, PSS executes comparison tasks in parallel, each using a hybrid data structure that combines the advantages of forward and inverted indexing. Our evaluation results show that the proposed approach leads to an early elimination of unnecessary I/O and data communication while sustaining parallel efficiency. As a result, it improves performance by an order of magnitude when dealing with large datasets.

【Keywords】: filtering; load balancing; parallel similarity search; partitioning

23. Bursty subgraphs in social networks.

Paper Link】 【Pages】:213-222

【Authors】: Milad Eftekhar ; Nick Koudas ; Yashar Ganjali

【Abstract】: Data available through social media and content sharing platforms present opportunities for analysis and mining. In the context of social networks, it is interesting to formalize and locate bursts of activities amongst users, related to a particular event and to report sets of socially connected users participating in such bursts. Such collections present new opportunities for understanding social events, and render new ways of online marketing. In this paper, we model social information using two conceptualized graph models. The first one (the action graph) provides a detailed model of all activities of all users while the second one (the holistic graph) provides an aggregate view on each user in the social media. We also propose two models to define the notion of "burst". The first model (intrinsic burst model) takes the intrinsic characteristics of each user into account to recognize the bursty behaviors; while the second model (social burst model) considers neighbors' influences when identifying bursts. We provide two linear algorithms to detect bursts based on the proposed models. These algorithms have been extensively evaluated on a month of full Twitter dataset certifying the practicality of our approach. A detailed qualitative study of our techniques is also presented.

【Keywords】: bursty subgraphs; information burst; social networks; twitter

24. Sharding social networks.

Paper Link】 【Pages】:223-232

【Authors】: Quang Duong ; Sharad Goel ; Jake M. Hofman ; Sergei Vassilvitskii

【Abstract】: Online social networking platforms regularly support hundreds of millions of users, who in aggregate generate substantially more data than can be stored on any single physical server. As such, user data are distributed, or sharded, across many machines. A key requirement in this setting is rapid retrieval not only of a given user's information, but also of all data associated with his or her social contacts, suggesting that one should consider the topology of the social network in selecting a sharding policy. In this paper we formalize the problem of efficiently sharding large social network databases, and evaluate several sharding strategies, both analytically and empirically. We find that random sharding---the de facto standard---results in provably poor performance even when frequently accessed nodes are replicated to many shards. By contrast, we demonstrate that one can substantially reduce querying costs by identifying and assigning tightly knit communities to shards. In particular, our theoretical analysis motivates a novel, scalable sharding algorithm that outperforms both random and location-based sharding schemes.

【Keywords】: community detection; sharding; social networks

25. Arrival and departure dynamics in social networks.

Paper Link】 【Pages】:233-242

【Authors】: Shaomei Wu ; Atish Das Sarma ; Alex Fabrikant ; Silvio Lattanzi ; Andrew Tomkins

【Abstract】: In this paper, we consider the natural arrival and departure of users in a social network, and ask whether the dynamics of arrival, which have been studied in some depth, also explain the dynamics of departure, which are not as well studied. Through study of the DBLP co-authorship network and a large online social network, we show that the dynamics of departure behave differently from the dynamics of formation. In particular, the probability of departure of a user with few friends may be understood most accurately as a function of the raw number of friends who are active. For users with more friends, however, the probability of departure is best predicted by the overall fraction of the user's neighborhood that is active, independent of size. We then study global properties of the sub-graphs induced by active and inactive users, and show that active users tend to belong to a core that is densifying and is significantly denser than the inactive users. Further, the inactive set of users exhibit a higher density and lower conductance than the degree distribution alone can explain. These two aspects suggest that nodes at the fringe are more likely to depart and subsequent departure are correlated among neighboring nodes in tightly-knit communities.

【Keywords】: social graph analysis; social networks; user engagement

Keynote address 1

26. Three findings regarding privacy online.

Paper Link】 【Pages】:243-244

【Authors】: Catherine Tucker

【Abstract】: The Internet now enables firms to collect detailed and potentially intrusive data about their customers both easily and cheaply. I discuss three empirical results related to customer privacy-protection that is enacted in response to this change. 1) Privacy protection that focuses on obtaining consent appears to lead to less effective advertising. This is based on results from extensive empirical analysis into how effectiveness of online advertising changed in response to the implementation of the E-Privacy Directive in Europe. 2) Privacy protection which gives direct control over customers' privacy appears to enhance economic outcomes. This is based on a detailed case study of customer responsiveness to different forms of advertising in the wake of a change in Facebook privacy policies. 3) Restricting the length of time that potentially private data is stored appears to have little economic impact. This is based on empirical analysis of aggregate search behavior following a change in the length of time search engines were told they could store search engine query data in Europe.

【Keywords】: economics; privacy online; search

Best paper award 1

27. Optimized interleaving for online retrieval evaluation.

Paper Link】 【Pages】:245-254

【Authors】: Filip Radlinski ; Nick Craswell

【Abstract】: Interleaving is an online evaluation technique for comparing the relative quality of information retrieval functions by combining their result lists and tracking clicks. A sequence of such algorithms have been proposed, each being shown to address problems in earlier algorithms. In this paper, we formalize and generalize this process, while introducing a formal model: We identify a set of desirable properties for interleaving, then show that an interleaving algorithm can be obtained as the solution to an optimization problem within those constraints. Our approach makes explicit the parameters of the algorithm, as well as assumptions about user behavior. Further, we show that our approach leads to an unbiased and more efficient interleaving algorithm than any previous approach, using a novel log-based analysis of user search behavior.

【Keywords】: evaluation; interleaving; web search

Web and usage mining 6

28. Mining the web to predict future events.

Paper Link】 【Pages】:255-264

【Authors】: Kira Radinsky ; Eric Horvitz

【Abstract】: We describe and evaluate methods for learning to forecast forthcoming events of interest from a corpus containing 22 years of news stories. We consider the examples of identifying significant increases in the likelihood of disease outbreaks, deaths, and riots in advance of the occurrence of these events in the world. We provide details of methods and studies, including the automated extraction and generalization of sequences of events from news corpora and multiple web resources. We evaluate the predictive power of the approach on real-world events withheld from the system.

【Keywords】: future prediction; news prediction; web knowledge for future prediction

29. Studying inter-national mobility through IP geolocation.

Paper Link】 【Pages】:265-274

【Authors】: Bogdan State ; Ingmar Weber ; Emilio Zagheni

【Abstract】: The increasing ubiquity of Internet use has opened up new avenues in the study of human mobility. Easily-obtainable geolocation data resulting from repeated logins to the same website offer the possibility of observing long-term patterns of mobility for a large number of individuals. We use data on the geographic locations from where over 100 million anonymized users log into Yahoo!~services to generate the first global map of short- and medium-term mobility flows. We develop a protocol to identify anonymized users who, over a one-year period, had spent more than 3 months in a different country from their stated country of residence ("migrants"), and users who spent less than a month in another country ("tourists"). We compute aggregate estimates of migration propensities between countries, as inferred from a user's location over the observed period. Geolocation data allow us to characterize also the pendularity of migration flows -- i.e., the extent to which migrants travel back and forth between their countries of origin and destination. We use data regarding visa regimes, colonial ties, geographic location and economic development to predict migration and tourism flows. Our analysis shows the persistence of traditional migration patterns as well as the emergence of new routes. Migrations tend to be more pendular between countries that are close to each other. We observe particularly high levels of pendularity within the European Economic Area, even after we control for distance and visa regimes. The dataset, methodology and results presented have important implications for the travel industry, as well as for several disciplines in social sciences, including geography, demography and the sociology of networks.

【Keywords】: ip addresses; migrations; mobility; tourism

30. From machu_picchu to "rafting the urubamba river": anticipating information needs via the entity-query graph.

Paper Link】 【Pages】:275-284

【Authors】: Ilaria Bordino ; Gianmarco De Francisci Morales ; Ingmar Weber ; Francesco Bonchi

【Abstract】: We study the problem of anticipating user search needs, based on their browsing activity. Given the current web page p that a user is visiting we want to recommend a small and diverse set of search queries that are relevant to the content of p, but also non-obvious and serendipitous. We introduce a novel method that is based on the content of the page visited, rather than on past browsing patterns as in previous literature. Our content-based approach can be used even for previously unseen pages. We represent the topics of a page by the set of Wikipedia entities extracted from it. To obtain useful query suggestions for these entities, we exploit a novel graph model that we call EQGraph (Entity-Query Graph), containing entities, queries, and transitions between entities, between queries, as well as from entities to queries. We perform Personalized PageRank computation on such a graph to expand the set of entities extracted from a page into a richer set of entities, and to associate these entities with relevant query suggestions. We develop an efficient implementation to deal with large graph instances and suggest queries from a large and diverse pool. We perform a user study that shows that our method produces relevant and interesting recommendations, and outperforms an alternative method based on reverse IR.

【Keywords】: entity extraction; implicit search; query suggestions; serendipity

Paper Link】 【Pages】:285-294

【Authors】: Carsten Eickhoff ; Kevyn Collins-Thompson ; Paul N. Bennett ; Susan T. Dumais

【Abstract】: Most research in Web search personalization models users as static or slowly evolving entities with a given set of preferences defined by their past behavior. However, recent publications as well as empirical evidence suggest that for a significant number of search sessions, users diverge from their regular search profiles in order to satisfy atypical, limited-duration information needs. In this work, we conduct a large-scale inspection of real-life search sessions to further understand this scenario. Subsequently, we design an automatic means of detecting and supporting such atypical sessions. We demonstrate significant improvements over state-of-the-art Web search personalization techniques by accounting for the typicality of search sessions. The proposed method is evaluated based on Web-scale search session data spanning several months of user activity.

【Keywords】: adaptive interfaces; domain expertise; personalized search; user modeling

Paper Link】 【Pages】:295-304

【Authors】: Nadav Golbandi ; Liran Katzir ; Yehuda Koren ; Ronny Lempel

【Abstract】: The massive volume of queries submitted to major Web search engines reflects human interest at a global scale. While the popularity of many search queries is stable over time or fluctuates with periodic regularity, some queries experience a sudden and ephemeral rise in popularity that is unexplained by their past volumes. Typically the popularity surge is precipitated by some real-life event in the news cycle. Such queries form what are known as search trends. All major search engines, using query log analysis and other signals, invest in detecting such trends. The goal is to surface trends accurately, with low latency relative to the actual event that sparked the trend. This work formally defines precision, recall and latency metrics related to top-k search trend detection. Then, observing that many trend detection algorithms rely on query counts, we develop a linear auto-regression model to predict future query counts. Subsequently, we tap the predicted counts to expedite search trend detection by plugging them into an existing trend detection scheme. Experimenting with query logs from a major Web search engine, we report both the stand-alone accuracy of our query count predictions, as well as the task-oriented effects of the prediction on the emitted trends. We show an average reduction in trend detection latency of roughly twenty minutes, with a negligible impact on the precision and recall metrics.

【Keywords】: query count prediction; search trend detection

33. News recommendation via hypergraph learning: encapsulation of user behavior and news content.

Paper Link】 【Pages】:305-314

【Authors】: Lei Li ; Tao Li

【Abstract】: Personalized news recommender systems have gained increasing attention in recent years. Within a news reading community, the implicit correlations among news readers, news articles, topics and named entities, e.g., what types of named entities in articles are preferred by users, and why users like the articles, could be valuable for building an effective news recommender. In this paper, we propose a novel news personalization framework by mining such correlations. We use hypergraph to model various high-order relations among different objects in news data, and formulate news recommendation as a ranking problem on fine-grained hypergraphs. In addition, by transductive inference, our proposed algorithm is capable of effectively handling the so-called cold-start problem. Extensive experiments on a data set collected from various news websites have demonstrated the effectiveness of our proposed algorithm.

【Keywords】: hypergraph learning; named entity; news recommendation; personalization; transductive inference

Web mining, prediction, and recommendation 14

34. Group sparse topical coding: from code to topic.

Paper Link】 【Pages】:315-324

【Authors】: Lu Bai ; Jiafeng Guo ; Yanyan Lan ; Xueqi Cheng

【Abstract】: Learning low dimensional representations of text corpora is critical in many content analysis and data mining applications. It is even more desired and challenging to learn a sparse representation in practice for large scale text modeling. However, traditional probabilistic topic models (PTM) lack a mechanism to directly control the posterior sparsity of the inferred representations; While the emerged non-probabilistic models (NPM) can explicitly control sparsity using sparse constraint like l_1 norm, they convey different limitations in latent representations. To address the existing problems, we propose a novel non-probabilistic topic model for discovering sparse latent representations of large text corpora, referred as group sparse topical coding (GSTC). Our model enjoys both the merits of the PTMs and NPMs. On one hand, GSTC can naturally derive document-level admixture proportions in topic simplex like PTMs, which is useful for semantic analysis, classification or retrieval. On the other hand, GSTC can directly control the sparsity of the inferred representations with group lasso by relaxing the normalization constraint. Moreover, the relaxed non-probabilistic GSTC can be effectively learned using coordinate descent method. Experimental results on benchmark datasets show that GSTC can discover meaningful compact latent representations of documents, and improve the document classification accuracy and time efficiency.

【Keywords】: document representation; group lasso; sparse coding; topic model

35. TYPiMatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration.

Paper Link】 【Pages】:325-334

【Authors】: Yongtao Ma ; Thanh Tran

【Abstract】: Instance matching and blocking, a preprocessing step used for selecting candidate matches, require determining the most representative attributes of instances called keys, based on which similarities between instances are computed. We show that for the problem of learning blocking keys and key values, both generic techniques that do not exploit type information and supervised learning techniques optimized for one single predefined type of instances do not perform well on heterogeneous Web data capturing instances for which the predefined type is too general. That is, they actually belong to some subtypes that are not explicitly specified in the data. We propose an unsupervised approach for learning these subtypes and the subtype-specific blocking keys and key values. Compared to state-of-the-art supervised and unsupervised learning approaches that are optimized for one single type, our approach improves efficiency as well as result quality. In particular, we show that the proposed strategy of learning subtype-specific blocking keys and key values improves both blocking and instance matching results.

【Keywords】: heterogeneous web data; information integration; unsupervised learning

36. Robust query rewriting using anchor data.

Paper Link】 【Pages】:335-344

【Authors】: Nick Craswell ; Bodo Billerbeck ; Dennis Fetterly ; Marc Najork

【Abstract】: Query rewriting algorithms can be used as a form of query expansion, by combining the user's original query with automatically generated rewrites. Rewriting algorithms bring linguistic datasets to bear without the need for iterative relevance feedback, but most studies of rewriting have used proprietary datasets such as large-scale search logs. By contrast this paper uses readily available data, particularly ClueWeb09 link text with over 1.2 billion anchor phrases, to generate rewrites. To avoid overfitting, our initial analysis is performed using Million Query Track queries, leading us to identify three algorithms which perform well. We then test the algorithms on Web and newswire data. Results show good properties in terms of robustness and early precision.

【Keywords】: anchor text; query rewriting

37. Wiki3C: exploiting wikipedia for context-aware concept categorization.

Paper Link】 【Pages】:345-354

【Authors】: Peng Jiang ; Huiman Hou ; Lijiang Chen ; Shimin Chen ; Conglei Yao ; Chengkai Li ; Min Wang

【Abstract】: Wikipedia is an important human generated knowledge base containing over 21 million articles organized by millions of categories. In this paper, we exploit Wikipedia for a new task of text mining: Context-aware Concept Categorization. In the task, we focus on categorizing concepts according to their context. We exploit article link feature and category structure in Wikipedia, followed by introducing Wiki3C, an unsupervised and domain independent concept categorization approach based on context. In the approach, we investigate two strategies to select and filter Wikipedia articles for the category representation. Besides, a probabilistic model is employed to compute the semantic relatedness between two concepts in Wikipedia. Experimental evaluation using manually labeled ground truth shows that our proposed Wiki3C can achieve a noticeable improvement over the baselines without considering contextual information.

【Keywords】: context-aware concept categorization; text mining; wikipedia

38. Crawling deep web entity pages.

Paper Link】 【Pages】:355-364

【Authors】: Yeye He ; Dong Xin ; Venkatesh Ganti ; Sriram Rajaraman ; Nirav Shah

【Abstract】: Deep-web crawl is concerned with the problem of surfacing hidden content behind search interfaces on the Web. While many deep-web sites maintain document-oriented textual content (e.g., Wikipedia, PubMed, Twitter, etc.), which has traditionally been the focus of the deep-web literature, we observe that a significant portion of deep-web sites, including almost all online shopping sites, curate structured entities as opposed to text documents. Although crawling such entity-oriented content is clearly useful for a variety of purposes, existing crawling techniques optimized for document oriented content are not best suited for entity-oriented sites. In this work, we describe a prototype system we have built that specializes in crawling entity-oriented deep-web sites. We propose techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep-web sites. These techniques are experimentally evaluated and shown to be effective.

【Keywords】: deep-web crawl; entities; web data

39. Using early view patterns to predict the popularity of youtube videos.

Paper Link】 【Pages】:365-374

【Authors】: Henrique Pinto ; Jussara M. Almeida ; Marcos André Gonçalves

【Abstract】: Predicting Web content popularity is an important task for supporting the design and evaluation of a wide range of systems, from targeted advertising to effective search and recommendation services. We here present two simple models for predicting the future popularity of Web content based on historical information given by early popularity measures. Our approach is validated on datasets consisting of videos from the widely used YouTube video-sharing portal. Our experimental results show that, compared to a state-of-the-art baseline model, our proposed models lead to significant decreases in relative squared errors, reaching up to 20% reduction on average, and larger reductions (of up to 71%) for videos that experience a high peak in popularity in their early days followed by a sharp decrease in popularity.

【Keywords】: popularity prediction; regression models; youtube

40. Geo topic model: joint modeling of user's activity area and interests for location recommendation.

Paper Link】 【Pages】:375-384

【Authors】: Takeshi Kurashima ; Tomoharu Iwata ; Takahide Hoshide ; Noriko Takaya ; Ko Fujimura

【Abstract】: This paper proposes a method that analyzes the location log data of multiple users to recommend locations to be visited. The method uses our new topic model, called Geo Topic Model, that can jointly estimate both the user's interests and activity area hosting the user's home, office and other personal places. By explicitly modeling geographical features of locations and users, the user's interests in other features of locations, which we call latent topics, can be inferred effectively. The topic interests estimated by our model 1) lead to high accuracy in predicting visit behavior as driven by personal interests, 2) make possible the generation of recommendations when the user is in an unfamiliar area (e.g. sightseeing), and 3) enable the recommender system to suggest an interpretable representation of the user profile that can be customized by the user. Experiments are conducted using real location logs of landmark and restaurant visits to evaluate the recommendation performance of the proposed method in terms of the accuracy of predicting visit selections. We also show that our model can estimate latent features of locations such as art, nature and atmosphere as latent topics, and describe each user's preference based on them.

【Keywords】: location recommendation; topic model

41. Latent factor models with additive and hierarchically-smoothed user preferences.

Paper Link】 【Pages】:385-394

【Authors】: Amr Ahmed ; Bhargav Kanagal ; Sandeep Pandey ; Vanja Josifovski ; Lluis Garcia Pueyo ; Jeffrey Yuan

【Abstract】: Items in recommender systems are usually associated with annotated attributes: for e.g., brand and price for products; agency for news articles, etc. Such attributes are highly informative and must be exploited for accurate recommendation. While learning a user preference model over these attributes can result in an interpretable recommender system and can hands the cold start problem, it suffers from two major drawbacks: data sparsity and the inability to model random effects. On the other hand, latent-factor collaborative filtering models have shown great promise in recommender systems; however, its performance on rare items is poor. In this paper we propose a novel model LFUM, which provides the advantages of both of the above models. We learn user preferences (over the attributes) using a personalized Bayesian hierarchical model that uses a combination(additive model) of a globally learned preference model along with user-specific preferences. To combat data-sparsity, we smooth these preferences over the item-taxonomy using an efficient forward-filtering and backward-smoothing inference algorithm. Our inference algorithms can handle both discrete attributes (e.g., item brands) and continuous attributes (e.g., item prices). We combine the user preferences with the latent-factor models and train the resulting collaborative filtering system end-to-end using the successful BPR ranking algorithm. In our extensive experimental analysis, we show that our proposed model outperforms several commonly used baselines and we carry out an ablation study showing the benefits of each component of our model.

【Keywords】: inference; latent variable models; recomcollaborative filtering; recomfactor models; recommendation

42. App recommendation: a contest between satisfaction and temptation.

Paper Link】 【Pages】:395-404

【Authors】: Peifeng Yin ; Ping Luo ; Wang-Chien Lee ; Min Wang

【Abstract】: Due to the huge and still rapidly growing number of mobile applications (apps), it becomes necessary to provide users an app recommendation service. Different from conventional item recommendation where the user interest is the primary factor, app recommendation also needs to consider factors that invoke a user to replace an old app (if she already has one) with a new app. In this work we propose an Actual- Tempting model that captures such factors in the decision process of mobile app adoption. The model assumes that each owned app has an actual satisfactory value and a new app under consideration has a tempting value. The former stands for the real satisfactory value the owned app brings to the user while the latter represents the estimated value the new app may seemingly have. We argue that the process of app adoption therefore is a contest between the owned apps' actual values and the candidate app's tempting value. Via the extensive experiments we show that the AT model performs significantly better than the conventional recommendation techniques such as collaborative filtering and content-based recommendation. Furthermore, the best recommendation performance is achieved when the AT model is combined with them.

【Keywords】: app recommendation; contest; smartphone apps

43. Threading machine generated email.

Paper Link】 【Pages】:405-414

【Authors】: Nir Ailon ; Zohar Shay Karnin ; Edo Liberty ; Yoelle Maarek

【Abstract】: Viewing email messages as parts of a sequence or a thread is a convenient way to quickly understand their context. Current threading techniques rely on purely syntactic methods, matching sender information, subject line, and reply/forward prefixes. As such, they are mostly limited to personal conversations. In contrast, machine-generated email, which amount, as per our experiments, to more than 60% of the overall email traffic, requires a different kind of threading that should reflect how a sequence of emails is caused by a few related user actions. For example, purchasing goods from an online store will result in a receipt or a confirmation message, which may be followed, possibly after a few days, by a shipment notification message from an express shipping service. In today's mail systems, they will not be a part of the same thread, while we believe they should. In this paper, we focus on this type of threading that we coin "causal threading". We demonstrate that, by analyzing recurring patterns over hundreds of millions of mail users, we can infer a causality relation between these two individual messages. In addition, by observing multiple causal relations over common messages, we can generate "causal threads" over a sequence of messages. The four key stages of our approach consist of: (1) identifying messages that are instances of the same email type or "template" (generated by the same machine process on the sender side) (2) building a causal graph, in which nodes correspond to email templates and edges indicate potential causal relations (3) learning a causal relation prediction function, and (4) automatically "threading" the incoming email stream. We present detailed experimental results obtained by analyzing the inboxes of 12.5 million Yahoo! Mail users, who voluntarily opted-in for such research. Supervised editorial judgments show that we can identify more than 70% (recall rate) of all "causal threads" at a precision level of 90%. In addition, for a search scenario we show that we achieve a precision close to 80% at 90% recall. We believe that supporting causal threads in email clients opens new grounds for improving both email search and browsing experiences.

【Keywords】: algorithms; email threading; emamodels; frequent sets and patterns; user experience

44. Predicting content change on the web.

Paper Link】 【Pages】:415-424

【Authors】: Kira Radinsky ; Paul N. Bennett

【Abstract】: Accurate prediction of changing web page content improves a variety of retrieval and web related components. For example, given such a prediction algorithm one can both design a better crawling strategy that only recrawls pages when necessary as well as a proactive mechanism for personalization that pushes content associated with user revisitation directly to the user. While many techniques for modeling change have focused simply on past change frequency, our work goes beyond that by additionally studying the usefulness in page change prediction of: the page's content; the degree and relationship among the prediction page's observed changes; the relatedness to other pages and the similarity in the types of changes they undergo. We present an expert prediction framework that incorporates the information from these other signals more effectively than standard ensemble or basic relational learning techniques. In an empirical analysis, we find that using page content as well as related pages significantly improves prediction accuracy and compare it to common approaches. We present numerous similarity metrics to identify related pages and focus specifically on measures of temporal content similarity. We observe that the different metrics yield related pages that are qualitatively different in nature and have different effects on the prediction performance.

【Keywords】: web change prediction; web crawling; web dynamics

45. Triplex transfer learning: exploiting both shared and distinct concepts for text classification.

Paper Link】 【Pages】:425-434

【Authors】: Fuzhen Zhuang ; Ping Luo ; Changying Du ; Qing He ; Zhongzhi Shi

【Abstract】: Transfer learning focuses on the learning scenarios when the test data from target domains and the training data from source domains are drawn from similar but different data distribution with respect to the raw features. Some recent studies argued that the high-level concepts (e.g. word clusters) can help model the data distribution difference, and thus are more appropriate for classification. Specifically, these methods assume that all the data domains have the same set of shared concepts, which are used as the bridge for knowledge transfer. However, besides these shared concepts each domain may have its own distinct concepts. To address this point, we propose a general transfer learning framework based on non-negative matrix tri-factorization which allows to explore both shared and distinct concepts among all the domains simultaneously. Since this model provides more flexibility in fitting the data it may lead to better classification accuracy. To solve the proposed optimization problem we develop an iterative algorithm and also theoretically analyze its convergence. Finally, extensive experiments show the significant superiority of our model over the baseline methods. In particular, we show that our method works much better in the more challenging tasks when distinct concepts may exist.

【Keywords】: common concept; distinct concept; distribution mismatch; non-negative matrix tri-factorization; triplex transfer learning

46. Have you done anything like that?: predicting performance using inter-category reputation.

Paper Link】 【Pages】:435-444

【Authors】: Marios Kokkodis ; Panagiotis G. Ipeirotis

【Abstract】: Online labor markets such as oDesk and Amazon Mechanical Turk have been growing in importance over the last few years. In these markets, employers post tasks on which remote contractors work and deliver the product of their work. As in most online marketplaces, reputation mechanisms play a very important role in facilitating transactions, since they instill trust and are often predictive of the future satisfaction of the employer. However, labor markets are usually highly heterogeneous in terms of available task categories; in such scenarios, past performance may not be a representative signal of future performance. To account for this heterogeneity, in our work, we build models that predict the performance of a worker based on prior, category-specific feedback. Our models assume that each worker has a category-specific quality, which is latent and not directly observable; what is observable, though, is the set of feedback ratings of the worker and of other contractors with similar work histories. Based on this information, we build a multi-level, hierarchical scheme that deals effectively with the data sparseness, which is inherent in many cases of interest (i.e., contractors with relatively brief work histories). We evaluate our models on a large corpus of real transactional data from oDesk, an online labor market with hundreds of millions of dollars in transaction volume. Our results show an improved accuracy of up to 47% compared to the existing baseline.

【Keywords】: bayesian modeling; online labor markets; reputation; task performance prediction

47. Learning multiple-question decision trees for cold-start recommendation.

Paper Link】 【Pages】:445-454

【Authors】: Mingxuan Sun ; Fuxin Li ; Joonseok Lee ; Ke Zhou ; Guy Lebanon ; Hongyuan Zha

【Abstract】: For cold-start recommendation, it is important to rapidly profile new users and generate a good initial set of recommendations through an interview process --- users should be queried adaptively in a sequential fashion, and multiple items should be offered for opinion solicitation at each trial. In this work, we propose a novel algorithm that learns to conduct the interview process guided by a decision tree with multiple questions at each split. The splits, represented as sparse weight vectors, are learned through an L_1-constrained optimization framework. The users are directed to child nodes according to the inner product of their responses and the corresponding weight vector. More importantly, to account for the variety of responses coming to a node, a linear regressor is learned within each node using all the previously obtained answers as input to predict item ratings. A user study, preliminary but first in its kind in cold-start recommendation, is conducted to explore the efficient number and format of questions being asked in a recommendation survey to minimize user cognitive efforts. Quantitative experimental validations also show that the proposed algorithm outperforms state-of-the-art approaches in terms of both the prediction accuracy and user cognitive efforts.

【Keywords】: cold-start problem; collaborative filtering; decision tree; recommender systems

Learning and modeling 5

48. Online multi-modal distance learning for scalable multimedia retrieval.

Paper Link】 【Pages】:455-464

【Authors】: Hao Xia ; Pengcheng Wu ; Steven C. H. Hoi

【Abstract】: In many real-word scenarios, e.g., multimedia applications, data often originates from multiple heterogeneous sources or are represented by diverse types of representation, which is often referred to as "multi-modal data". The definition of distance between any two objects/items on multi-modal data is a key challenge encountered by many real-world applications, including multimedia retrieval. In this paper, we present a novel online learning framework for learning distance functions on multi-modal data through the combination of multiple kernels. In order to attack large-scale multimedia applications, we propose Online Multi-modal Distance Learning (OMDL) algorithms, which are significantly more efficient and scalable than the state-of-the-art techniques. We conducted an extensive set of experiments on multi-modal image retrieval applications, in which encouraging results validate the efficacy of the proposed technique.

【Keywords】: graph laplacian; multi-modal distance; multimedia retrieval; online learning

49. Unsupervised graph-based topic labelling using dbpedia.

Paper Link】 【Pages】:465-474

【Authors】: Ioana Hulpus ; Conor Hayes ; Marcel Karnstedt ; Derek Greene

【Abstract】: Automated topic labelling brings benefits for users aiming at analysing and understanding document collections, as well as for search engines targetting at the linkage between groups of words and their inherent topics. Current approaches to achieve this suffer in quality, but we argue their performances might be improved by setting the focus on the structure in the data. Building upon research for concept disambiguation and linking to DBpedia, we are taking a novel approach to topic labelling by making use of structured data exposed by DBpedia. We start from the hypothesis that words co-occuring in text likely refer to concepts that belong closely together in the DBpedia graph. Using graph centrality measures, we show that we are able to identify the concepts that best represent the topics. We comparatively evaluate our graph-based approach and the standard text-based approach, on topics extracted from three corpora, based on results gathered in a crowd-sourcing experiment. Our research shows that graph-based analysis of DBpedia can achieve better results for topic labelling in terms of both precision and topic coverage.

【Keywords】: dbpedia; graph centrality measures; latent dirichlet allocation; topic labelling

50. Estimating content concreteness for finding comprehensible documents.

Paper Link】 【Pages】:475-484

【Authors】: Shinya Tanaka ; Adam Jatowt ; Makoto P. Kato ; Katsumi Tanaka

【Abstract】: Document comprehensibility is one of key factors determining document quality and, in result, user's satisfaction. Relevant web pages are of little utility if they are incomprehensible or impose too much cognitive burden on readers. Traditional measures of text difficulty focus often on syntactic factors of text such as sentence length, word length, syllable count, or they utilize fixed list of common terms. However, document comprehensibility depends on many factors, of which concreteness and the ease of concept visualization are crucial ones. In this paper, we first propose a method for predicting the concreteness of terms using SVM regression. We then extend it to calculating document concreteness level. The experimental results indicate satisfactory accuracy in estimating both term and document concreteness as well as demonstrate positive correlation between the document concreteness and comprehensibility. Our ultimate goal is to enable comprehension-driven search, which will return both relevant and comprehensible results.

【Keywords】: abstractness; comprehensibility; concreteness; information retrieval; readability

Paper Link】 【Pages】:485-494

【Authors】: Pradipto Das ; Rohini K. Srihari ; Jason J. Corso

【Abstract】: Documents containing video and text are becoming more and more widespread and yet content analysis of those documents depends primarily on the text. Although automated discovery of semantically related words from text improves free text query understanding, translating videos into text summaries facilitates better video search particularly in the absence of accompanying text. In this paper, we propose a multimedia topic modeling framework suitable for providing a basis for automatically discovering and translating semantically related words obtained from textual metadata of multimedia documents to semantically related videos or frames from videos. The framework jointly models video and text and is flexible enough to handle different types of document features in their constituent domains such as discrete and real valued features from videos representing actions, objects, colors and scenes as well as discrete features from text. Our proposed models show much better fit to the multimedia data in terms of held-out data log likelihoods. For a given query video, our models translate low level vision features into bag of keyword summaries which can be further translated using simple natural language generation techniques into human readable paragraphs. We quantitatively compare the results of video to bag of words translation against a state-of-the-art baseline object recognition model from computer vision. We show that text translations from multimodal topic models vastly outperform the baseline on a multimedia dataset downloaded from the Internet.

【Keywords】: multimedia topic models; video to text summarization

Paper Link】 【Pages】:495-504

【Authors】: Jing Liu ; Fan Zhang ; Xinying Song ; Young-In Song ; Chin-Yew Lin ; Hsiao-Wuen Hon

【Abstract】: In this paper, we consider the problem of linking users across multiple online communities. Specifically, we focus on the alias-disambiguation step of this user linking task, which is meant to differentiate users with the same usernames. We start quantitatively analyzing the importance of the alias-disambiguation step by conducting a survey on 153 volunteers and an experimental analysis on a large dataset of About.me (75,472 users). The analysis shows that the alias-disambiguation solution can address a major part of the user linking problem in terms of the coverage of true pairwise decisions (46.8%). To the best of our knowledge, this is the first study on human behaviors with regards to the usages of online usernames. We then cast the alias-disambiguation step as a pairwise classification problem and propose a novel unsupervised approach. The key idea of our approach is to automatically label training instances based on two observations: (a) rare usernames are likely owned by a single natural person, e.g. pennystar88 as a positive instance; (b) common usernames are likely owned by different natural persons, e.g. tank as a negative instance. We propose using the n-gram probabilities of usernames to estimate the rareness or commonness of usernames. Moreover, these two observations are verified by using the dataset of Yahoo! Answers. The empirical evaluations on 53 forums verify: (a) the effectiveness of the classifiers with the automatically generated training data and (b) that the rareness and commonness of usernames can help user linking. We also analyze the cases where the classifiers fail.

【Keywords】: pairwise classification; social network; user linking

Keynote address 1

53. Big data, lifelong machine learning and transfer learning.

Paper Link】 【Pages】:505-506

【Authors】: Qiang Yang

【Abstract】: A major challenge in today's world is the Big Data problem, which manifests itself in Web and Mobile domains as rapidly changing and heterogeneous data streams. A data-mining system must be able to cope with the influx of changing data in a continual manner. This calls for Lifelong Machine Learning, which in contrast to the traditional one-shot learning, should be able to identify the learning tasks at hand and adapt to the learning problems in a sustainable manner. A foundation for lifelong machine learning is transfer learning, whereby knowledge gained in a related but different domain may be transferred to benefit learning for a current task. To make effective transfer learning, it is important to maintain a continual and sustainable channel in the life time of a user in which the data are annotated. In this talk, I outline the lifelong machine learning situations, give several examples of transfer learning and applications for lifelong machine learning, and discuss cases of successful extraction of data annotations to meet the Big Data challenge.

【Keywords】: big data; lifelong machine learning; transfer learning

Best student paper award 1

54. Balanced label propagation for partitioning massive graphs.

Paper Link】 【Pages】:507-516

【Authors】: Johan Ugander ; Lars Backstrom

【Abstract】: Partitioning graphs at scale is a key challenge for any application that involves distributing a graph across disks, machines, or data centers. Graph partitioning is a very well studied problem with a rich literature, but existing algorithms typically can not scale to billions of edges, or can not provide guarantees about partition sizes. In this work we introduce an efficient algorithm, balanced label propagation, for precisely partitioning massive graphs while greedily maximizing edge locality, the number of edges that are assigned to the same shard of a partition. By combining the computational efficiency of label propagation --- where nodes are iteratively relabeled to the same 'label' as the plurality of their graph neighbors --- with the guarantees of constrained optimization --- guiding the propagation by a linear program constraining the partition sizes --- our algorithm makes it practically possible to partition graphs with billions of edges. Our algorithm is motivated by the challenge of performing graph predictions in a distributed system. Because this requires assigning each node in a graph to a physical machine with memory limitations, it is critically necessary to ensure the resulting partition shards do not overload any single machine. We evaluate our algorithm for its partitioning performance on the Facebook social graph, and also study its performance when partitioning Facebook's 'People You May Know' service (PYMK), the distributed system responsible for the feature extraction and ranking of the friends-of-friends of all active Facebook users. In a live deployment, we observed average query times and average network traffic levels that were 50.5% and 37.1% (respectively) when compared to the previous naive random sharding.

【Keywords】: graph clustering; graph partitioning; label propagation; social networks

Social media 6

55. GOP primary season on twitter: "popular" political sentiment in social media.

Paper Link】 【Pages】:517-526

【Authors】: Yelena Mejova ; Padmini Srinivasan ; Bob Boynton

【Abstract】: As mainstream news media and political campaigns start to pay attention to the political discourse online, a systematic analysis of political speech in social media becomes more critical. What exactly do people say on these sites, and how useful is this data in estimating political popularity? In this study we examine Twitter discussions surrounding seven US Republican politicians who were running for the US Presidential nomination in 2011. We show this largely negative rhetoric to be laced with sarcasm and humor and dominated by a small portion of users. Furthermore, we show that using out-of-the-box classification tools results in a poor performance, and instead develop a highly optimized multi-stage approach designed for general-purpose political sentiment classification. Finally, we compare the change in sentiment detected in our dataset before and after 19 Republican debates, concluding that, at least in this case, the Twitter political chatter is not indicative of national political polls.

【Keywords】: political discourse; sentiment analysis; social media

56. Towards Twitter context summarization with user influence models.

Paper Link】 【Pages】:527-536

【Authors】: Yi Chang ; Xuanhui Wang ; Qiaozhu Mei ; Yan Liu

【Abstract】: Twitter has become one of the most popular platforms for users to share information in real time. However, as an individual tweet is short and lacks sufficient contextual information, users cannot effectively understand or consume information on Twitter, which can either make users less engaged or even detached from using Twitter. In order to provide informative context to a Twitter user, we propose the task of Twitter context summarization, which generates a succinct summary from a large but noisy Twitter context tree. Traditional summarization techniques only consider text information, which is insufficient for Twitter context summarization task, since text information on Twitter is very sparse. Given that there are rich user interactions in Twitter, we thus study how to improve summarization methods by leveraging such signals. In particular, we study how user influence models, which project user interaction information onto a Twitter context tree, can help Twitter context summarization within a supervised learning framework. To evaluate our methods, we construct a data set by asking human editors to manually select the most informative tweets as a summary. Our experimental results based on this editorial data set show that Twitter context summarization is a promising research topic and pairwise user influence signals can significantly improve the task performance.

【Keywords】: summarization; twitter context tree; user influence model

57. Exploiting social relations for sentiment analysis in microblogging.

Paper Link】 【Pages】:537-546

【Authors】: Xia Hu ; Lei Tang ; Jiliang Tang ; Huan Liu

【Abstract】: Microblogging, like Twitter and Sina Weibo, has become a popular platform of human expressions, through which users can easily produce content on breaking news, public events, or products. The massive amount of microblogging data is a useful and timely source that carries mass sentiment and opinions on various topics. Existing sentiment analysis approaches often assume that texts are independent and identically distributed (i.i.d.), usually focusing on building a sophisticated feature space to handle noisy and short texts, without taking advantage of the fact that the microblogs are networked data. Inspired by the social sciences findings that sentiment consistency and emotional contagion are observed in social networks, we investigate whether social relations can help sentiment analysis by proposing a Sociological Approach to handling Noisy and short Texts (SANT) for sentiment classification. In particular, we present a mathematical optimization formulation that incorporates the sentiment consistency and emotional contagion theories into the supervised learning process; and utilize sparse learning to tackle noisy texts in microblogging. An empirical study of two real-world Twitter datasets shows the superior performance of our framework in handling noisy and short tweets.

【Keywords】: microblogging; noisy text; sentiment analysis; sentiment classification; short text; social context; social correlation; twitter

58. Connecting comments and tags: improved modeling of social tagging systems.

Paper Link】 【Pages】:547-556

【Authors】: Dawei Yin ; Shengbo Guo ; Boris Chidlovskii ; Brian D. Davison ; Cédric Archambeau ; Guillaume Bouchard

【Abstract】: Collaborative tagging systems are now deployed extensively to help users share and organize resources. Tag prediction and recommendation can simplify and streamline the user experience, and by modeling user preferences, predictive accuracy can be significantly improved. However, previous methods typically model user behavior based only on a log of prior tags, neglecting other behaviors and information in social tagging systems, e.g., commenting on items and connecting with other users. On the other hand, little is known about the connection and correlations among these behaviors and contexts in social tagging systems. In this paper, we investigate improved modeling for predictive social tagging systems. Our explanatory analyses demonstrate three significant challenges: coupled high order interaction, data sparsity and cold start on items. We tackle these problems by using a generalized latent factor model and fully Bayesian treatment. To evaluate performance, we test on two real-world data sets from Flickr and Bibsonomy. Our experiments on these data sets show that to achieve best predictive performance, it is necessary to employ a fully Bayesian treatment in modeling high order relations in social tagging system. Our methods noticeably outperform state-of-the-art approaches.

【Keywords】: co-factorization; personalized comment recommendation; personalized tag prediction; social tagging; tag recommendation

59. Co-factorization machines: modeling user interests and predicting individual decisions in Twitter.

Paper Link】 【Pages】:557-566

【Authors】: Liangjie Hong ; Aziz S. Doumith ; Brian D. Davison

【Abstract】: Users of popular services like Twitter and Facebook are often simultaneously overwhelmed with the amount of information delivered via their social connections and miss out on much content that they might have liked to see, even though it was distributed outside of their social circle. Both issues serve as difficulties to the users and drawbacks to the services. Social media service providers can benefit from understanding user interests and how they interact with the service, potentially predicting their behaviors in the future. In this paper, we address the problem of simultaneously predicting user decisions and modeling users' interests in social media by analyzing rich information gathered from Twitter. The task differs from conventional recommender systems as the cold-start problem is ubiquitous, and rich features, including textual content, need to be considered. We build predictive models for user decisions in Twitter by proposing Co-Factorization Machines (CoFM), an extension of a state-of-the-art recommendation model, to handle multiple aspects of the dataset at the same time. Additionally, we discuss and compare ranking-based loss functions in the context of recommender systems, providing the first view of how they vary from each other and perform in real tasks. We explore an extensive set of features and conduct experiments on a real-world dataset, concluding that CoFM with ranking-based loss functions is superior to state-of-the-art methods and yields interpretable latent factors.

【Keywords】: collaborative filtering; latent factor models; recommender systems; twitter

60. Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning.

Paper Link】 【Pages】:567-576

【Authors】: Lidong Bing ; Wai Lam ; Tak-Lam Wong

【Abstract】: We develop a new framework to achieve the goal of Wikipedia entity expansion and attribute extraction from the Web. Our framework takes a few existing entities that are automatically collected from a particular Wikipedia category as seed input and explores their attribute infoboxes to obtain clues for the discovery of more entities for this category and the attribute content of the newly discovered entities. One characteristic of our framework is to conduct discovery and extraction from desirable semi-structured data record sets which are automatically collected from the Web. A semi-supervised learning model with Conditional Random Fields is developed to deal with the issues of extraction learning and limited number of labeled examples derived from the seed entities. We make use of a proximate record graph to guide the semi-supervised learning process. The graph captures alignment similarity among data records. Then the semi-supervised learning process can leverage the unlabeled data in the record set by controlling the label regularization under the guidance of the proximate record graph. Extensive experiments on different domains have been conducted to demonstrate its superiority for discovering new entities and extracting attribute content.

【Keywords】: entity expansion;; information extraction; proximate record graph; semi-supervised learning

Mining of web and social data 12

61. Retweet or not?: personalized tweet re-ranking.

Paper Link】 【Pages】:577-586

【Authors】: Wei Feng ; Jianyong Wang

【Abstract】: With Twitter being widely used around the world, users are facing enormous new tweets every day. Tweets are ranked in chronological order regardless of their potential interestedness. Users have to scan through pages of tweets to find useful information. Thus more personalized ranking scheme is needed to filter the overwhelmed information. Since retweet history reveals users' personal preference for tweets, we study how to learn a predictive model to rank the tweets according to their probability of being retweeted. In this way, users can find interesting tweets in a short time. To model the retweet behavior, we build a graph made up of three types of nodes: users, publishers and tweets. To incorporate all sources of information like users' profile, tweet quality, interaction history, etc, nodes and edges are represented by feature vectors. All these feature vectors are mapped to node weights and edge weights. Based on the graph, we propose a feature-aware factorization model to re-rank the tweets, which unifies the linear discriminative model and the low-rank factorization model seamlessly. Finally, we conducted extensive experiments on a real dataset crawled from Twitter. Experimental results show the effectiveness of our model.

【Keywords】: recommender system; social media

62. Overlapping community detection at scale: a nonnegative matrix factorization approach.

Paper Link】 【Pages】:587-596

【Authors】: Jaewon Yang ; Jure Leskovec

【Abstract】: Network communities represent basic structures for understanding the organization of real-world networks. A community (also referred to as a module or a cluster) is typically thought of as a group of nodes with more connections amongst its members than between its members and the remainder of the network. Communities in networks also overlap as nodes belong to multiple clusters at once. Due to the difficulties in evaluating the detected communities and the lack of scalable algorithms, the task of overlapping community detection in large networks largely remains an open problem. In this paper we present BIGCLAM (Cluster Affiliation Model for Big Networks), an overlapping community detection method that scales to large networks of millions of nodes and edges. We build on a novel observation that overlaps between communities are densely connected. This is in sharp contrast with present community detection methods which implicitly assume that overlaps between communities are sparsely connected and thus cannot properly extract overlapping communities in networks. In this paper, we develop a model-based community detection algorithm that can detect densely overlapping, hierarchically nested as well as non-overlapping communities in massive networks. We evaluate our algorithm on 6 large social, collaboration and information networks with ground-truth community information. Experiments show state of the art performance both in terms of the quality of detected communities as well as in speed and scalability of our algorithm.

【Keywords】: matrix factorization; network communities; overlapping community detection

63. LaFT-tree: perceiving the expansion trace of one's circle of friends in online social networks.

Paper Link】 【Pages】:597-606

【Authors】: Jun Zhang ; Chaokun Wang ; Jianmin Wang ; Philip S. Yu

【Abstract】: Many patterns have been discovered to explain and analyze how people make friends. Among them is the triadic closure, supported by the principle of the transitivity of friendship, which means for an individual the friends of her friend are more likely to become her new friends. However, people's motivations under this principle haven't been well studied, and it's still unknown that how this principle works in diverse situations. In this paper, we try to study this principle deeply based on the behavior modeling. We study how one expands her egocentric network via her friends, also called intermediaries, based on the transitivity of friendship. We propose LaFT-Tree, a tree-based representation of friendship formation inspired from triadic closure. LaFT-Tree provides a hierarchical view of the flat structure of one's egocentric network by visualizing the expansion trace of one's egocentric network. We model people's friend-making behaviors using LaFT-LDA, a generative model for LaFT-Tree learning. The proposed model is evaluated on both synthetic and real-world social networks and experimental results demonstrate the effectiveness of LaFT-LDA for LaFT-Tree inference. We also present some interesting applications of the LaFT-Tree, showing that our model can be generalized and benefit other social network analysis tasks.

【Keywords】: intermediary inference; laft-tree; link formation; link prediction; social networks; transitivity of friendship; triadic closure

64. A peek into the future: predicting the evolution of popularity in user generated content.

Paper Link】 【Pages】:607-616

【Authors】: Mohamed Ahmed ; Stella Spagna ; Felipe Huici ; Saverio Niccolini

【Abstract】: Content popularity prediction finds application in many areas, including media advertising, content caching, movie revenue estimation, traffic management and macro-economic trends forecasting, to name a few. However, predicting this popularity is difficult due to, among others, the effects of external phenomena, the influence of context such as locality and relevance to users,and the difficulty of forecasting information cascades. In this paper we identify patterns of temporal evolution that are generalisable to distinct types of data, and show that we can (1) accurately classify content based on the evolution of its popularity over time and (2) predict the value of the content's future popularity. We verify the generality of our method by testing it on YouTube, Digg and Vimeo data sets and find our results to outperform the K-Means baseline when classifying the behaviour of content and the linear regression baseline when predicting its popularity.

【Keywords】: social media; time series clustering

65. Online community detection in social sensing.

Paper Link】 【Pages】:617-626

【Authors】: Guo-Jun Qi ; Charu C. Aggarwal ; Thomas S. Huang

【Abstract】: The proliferation of location and GPS data streams which are collected in a wide variety of participatory sensing applications has created numerous possibilities for analysis of the underlying patterns of activity. Typically, the spatio-temporal patterns arising from such activity can be analyzed in order to determine the latent community structure in the underlying data. In this paper, we will examine the problem of online community detection from the location data collected from such social sensing applications in real time. Such data brings numerous challenges associated with it, in that they can be of a relatively large scale, and can be extremely noisy from the perspective of both data representation and analysis. Furthermore, the community structure in the underlying data cannot be directly inferred from the shape of the underlying trajectories, since a considerable amount of variation may exist in terms of trajectories of individuals belonging to the same community. In this paper, we will design online algorithms for community detection in social sensing applications. Our algorithm uses a robust and efficiently updateable model with the use of Gibbs sampling, and we will show its effectiveness and efficiency for social sensing applications.

【Keywords】: social sensing

66. Distinguishing topical and social groups based on common identity and bond theory.

Paper Link】 【Pages】:627-636

【Authors】: Przemyslaw A. Grabowicz ; Luca Maria Aiello ; Víctor M. Eguíluz ; Alejandro Jaimes

【Abstract】: Social groups play a crucial role in social media platforms because they form the basis for user participation and engagement. Groups are created explicitly by members of the community, but also form organically as members interact. Due to their importance, they have been studied widely (e.g., community detection, evolution, activity, etc.). One of the key questions for understanding how such groups evolve is whether there are different types of groups and how they differ. In Sociology, theories have been proposed to help explain how such groups form. In particular, the common identity and common bond theory states that people join groups based on identity (i.e., interest in the topics discussed) or bond attachment (i.e., social relationships). The theory has been applied qualitatively to small groups to classify them as either topical or social. We use the identity and bond theory to define a set of features to classify groups into those two categories. Using a dataset from Flickr, we extract user-defined groups and automatically-detected groups, obtained from a community detection algorithm. We discuss the process of manual labeling of groups into social or topical and present results of predicting the group label based on the defined features. We directly validate the predictions of the theory showing that the metrics are able to forecast the group type with high accuracy. In addition, we present a comparison between declared and detected groups along topicality and sociality dimensions.

【Keywords】: bond theory; flickr; groups; identity theory; social media

67. Modeling the impact of lifestyle on health at scale.

Paper Link】 【Pages】:637-646

【Authors】: Adam Sadilek ; Henry A. Kautz

【Abstract】: Research in computational epidemiology to date has concentrated on estimating summary statistics of populations and simulated scenarios of disease outbreaks. Detailed studies have been limited to small domains, as scaling the methods involved poses considerable challenges. By contrast, we model the associations of a large collection of social and environmental factors with the health of particular individuals. Instead of relying on surveys, we apply scalable machine learning techniques to noisy data mined from online social media and infer the health state of any given person in an automated way. We show that the learned patterns can be subsequently leveraged in descriptive as well as predictive fine-grained models of human health. Using a unified statistical model, we quantify the impact of social status, exposure to pollution, interpersonal interactions, and other important lifestyle factors on one's health. Our model explains more than 54% of the variance in people's health (as estimated from their online communication), and predicts the future health status of individuals with 91% accuracy. Our methods complement traditional studies in life sciences, as they enable us to perform large-scale and timely measurement, inference, and prediction of previously elusive factors that affect our everyday lives.

【Keywords】: computational epidemiology; geo-temporal modeling; machine learning; online social networks; ubiquitous computing

68. Collective inference for network data with copula latent markov networks.

Paper Link】 【Pages】:647-656

【Authors】: Rongjing Xiang ; Jennifer Neville

【Abstract】: The popularity of online social networks and social media has increased the amount of linked data available in Web domains. Relational and Gaussian Markov networks have both been applied successfully for classification in these relational settings. However, since Gaussian Markov networks model joint distributions over continuous label space, it is difficult to use them to reason about uncertainty in discrete labels. On the other hand, relational Markov networks model probability distributions over discrete label space, but since they condition on the graph structure, the marginal probability for an instance will vary based on the structure of the subnetwork observed around the instance. This implies that the marginals will not be identical across instances and can sometimes result in poor prediction performance. In this work, we propose a novel latent relational model based on copulas which allows use to make predictions in a discrete label space while ensuring identical marginals and at the same time incorporating some desirable properties of modeling relational dependencies in a continuous space. While copulas have recently been used for descriptive modeling, they have not been used for collective classification in large scale network data and the associated conditional inference problem has not been considered before. We develop an approximate inference algorithm, and demonstrate empirically that our proposed Copula Latent Markov Network models based on approximate inference outperform a number of competing relational classification models over a range of real-world relational classification tasks.

【Keywords】: collective inference; relational learning

69. Influence diffusion dynamics and influence maximization in social networks with friend and foe relationships.

Paper Link】 【Pages】:657-666

【Authors】: Yanhua Li ; Wei Chen ; Yajun Wang ; Zhi-Li Zhang

【Abstract】: Influence diffusion and influence maximization in large-scale online social networks (OSNs) have been extensively studied because of their impacts on enabling effective online viral marketing. Existing studies focus on social networks with only friendship relations, whereas the foe or enemy relations that commonly exist in many OSNs, e.g., Epinions and Slashdot, are completely ignored. In this paper, we make the first attempt to investigate the influence diffusion and influence maximization in OSNs with both friend and foe relations, which are modeled using positive and negative edges on signed networks. In particular, we extend the classic voter model to signed networks and analyze the dynamics of influence diffusion of two opposite opinions. We first provide systematic characterization of both short-term and long-term dynamics of influence diffusion in this model, and illustrate that the steady state behaviors of the dynamics depend on three types of graph structures, which we refer to as balanced graphs, anti-balanced graphs, and strictly unbalanced graphs. We then apply our results to solve the influence maximization problem and develop efficient algorithms to select initial seeds of one opinion that maximize either its short-term influence coverage or long-term steady state influence coverage. Extensive simulation results on both synthetic and real-world networks, such as Epinions and Slashdot, confirm our theoretical analysis on influence diffusion dynamics, and demonstrate that our influence maximization algorithms perform consistently better than other heuristic algorithms.

【Keywords】: influence maximization; signed social networks; voter model

70. Modeling dynamic behavior in large evolving graphs.

Paper Link】 【Pages】:667-676

【Authors】: Ryan A. Rossi ; Brian Gallagher ; Jennifer Neville ; Keith Henderson

【Abstract】: Given a large time-evolving graph, how can we model and characterize the temporal behaviors of individual nodes (and network states)? How can we model the behavioral transition patterns of nodes? We propose a temporal behavior model that captures the "roles" of nodes in the graph and how they evolve over time. The proposed dynamic behavioral mixed-membership model (DBMM) is scalable, fully automatic (no user-defined parameters), non-parametric/data-driven (no specific functional form or parameterization), interpretable (identifies explainable patterns), and flexible (applicable to dynamic and streaming networks). Moreover, the interpretable behavioral roles are generalizable and computationally efficient. We applied our model for (a) identifying patterns and trends of nodes and network states based on the temporal behavior, (b) predicting future structural changes, and (c) detecting unusual temporal behavior transitions. The experiments demonstrate the scalability, flexibility, and effectiveness of our model for identifying interesting patterns, detecting unusual structural transitions, and predicting the future structural changes of the network and individual nodes.

【Keywords】: dynamic mixed-membership models; dynamic network analysis; dynamic roles; graph mining; scalable models

71. On the streaming complexity of computing local clustering coefficients.

Paper Link】 【Pages】:677-686

【Authors】: Konstantin Kutzkov ; Rasmus Pagh

【Abstract】: Due to a large number of applications, the problem of estimating the number of triangles in graphs revealed as a stream of edges, and the closely related problem of estimating the graph's clustering coefficient, have received considerable attention in the last decade. Both efficient algorithms and impossibility results have shed light on the computational complexity of the problem. Motivated by applications in Web mining, Becchetti et al.~presented new algorithms for the estimation of the local number of triangles, i.e., the number of triangles incident to individual vertices. The algorithms are shown, both theoretically and experimentally, to efficiently handle the problem. However, at least two passes over the data are needed and thus the algorithms are not suitable for real streaming scenarios. In the present work, we consider the problem of estimating the clustering coefficient of individual vertices in a graph over n vertices revealed as a stream of m edges. As a first result we show that any one pass randomized streaming algorithm that can distinguish a graph with no triangles from a graph having a vertex of degree d with clustering coefficient > 1/2 must use Ω(m/d) bits of space in expectation. Our second result is a new randomized one pass algorithm estimating the local clustering coefficient of each vertex with degree at least d. The space requirement of our algorithm is within a logarithmic factor of the lower bound, thus our approach is close to optimal. We also extend the algorithm to local triangle counting and report experimental results on its performance on real-life graphs.

【Keywords】: graph algorithms; local clustering coefficient; streaming; triangle counting

72. Learning query and document similarities from click-through bipartite graph with metadata.

Paper Link】 【Pages】:687-696

【Authors】: Wei Wu ; Hang Li ; Jun Xu

【Abstract】: We consider learning query and document similarities from a click-through bipartite graph with metadata on the nodes. The metadata contains multiple types of features of queries and documents. We aim to leverage both the click-through bipartite graph and the features to learn query-document, document-document, and query-query similarities. The challenges include how to model and learn the similarity functions based on the graph data. We propose solving the problems in a principled way. Specifically, we use two different linear mappings to project the queries and documents in two different feature spaces into the same latent space, and take the dot product in the latent space as their similarity. Query-query and document-document similarities can also be naturally defined as dot products in the latent space. We formalize the learning of similarity functions as learning of the mappings that maximize the similarities of the observed query-document pairs on the enriched click-through bipartite graph. When queries and documents have multiple types of features, the similarity function is defined as a linear combination of multiple similarity functions, each based on one type of features. We further solve the learning problem by using a new technique called Multi-view Partial Least Squares (M-PLS). The advantages include the global optimum which can be obtained through Singular Value Decomposition (SVD) and the capability of finding high quality similar queries. We conducted large scale experiments on enterprise search data and web search data. The experimental results on relevance ranking and similar query finding demonstrate that the proposed method works significantly better than the baseline methods.

【Keywords】: click-through; multi-view partial least squares; similarity learning

Paper Link】 【Pages】:697-706

【Authors】: Chinmay Karande ; Aranyak Mehta ; Ramakrishnan Srikant

【Abstract】: Search engine ad auctions typically have a significant fraction of advertisers who are budget constrained, i.e., if allowed to participate in every auction that they bid on, they would spend more than their budget. This yields an important problem: selecting the ad auctions which these advertisers participate, in order to optimize different system objectives such as the return on investment for advertisers, and the quality of ads shown to users. We present a system and algorithms for optimizing budget constrained spend. The system is designed be deployed in a large search engine, with hundreds of thousands of advertisers, millions of searches per hour, and with the query stream being only partially predictable. We have validated the system design by implementing it in the Google ads serving system and running experiments on live traffic. We have also compared our algorithm to previous work that casts this problem as a large linear programming problem limited to popular queries, and show that our algorithms yield substantially better results.

【Keywords】: budget optimization; online advertising

Paper Link】 【Pages】:707-716

【Authors】: Yu Wang ; Xiao Huang ; Ryen W. White

【Abstract】: Web searchers frequently transition from desktop computers and laptops to mobile devices, and vice versa. Little is known about the nature of cross-device search tasks, yet they represent an important opportunity for search engines to help their users, especially those on the target (post-switch) device. For example, the search engine could save the current session and re-instate it post switch, or it could capitalize on down-time between devices to proactively re-trieve content on behalf of the searcher. In this paper, we present a log-based study to define and characterize cross-device search be-havior and predict the resumption of cross-device tasks. Using data from a large commercial search engine, we show that there are dis-cernible and noteworthy patterns of search behavior associated with device transitions. We also develop learned models for predicting task resumption on the target device using behavioral, topical, geo-spatial, and temporal features. Our findings show that our models can attain strong prediction accuracy and have direct implications for the development of tools to help people search more effectively in a multi-device world.

【Keywords】: cross-device search; personalization; search tasks; slow search

75. Learning to rank for spatiotemporal search.

Paper Link】 【Pages】:717-726

【Authors】: Blake Shaw ; Jon Shea ; Siddhartha Sinha ; Andrew Hogue

【Abstract】: In this article we consider the problem of mapping a noisy estimate of a user's current location to a semantically meaningful point of interest, such as a home, restaurant, or store. Despite the poor accuracy of GPS on current mobile devices and the relatively high density of places in urban areas, it is possible to predict a user's location with considerable precision by explicitly modeling both places and users and by combining a variety of signals about a user's current context. Places are often simply modeled as a single latitude and longitude when in fact they are complex entities existing in both space and time and shaped by the millions of people that interact with them. Similarly, models of users reveal complex but predictable patterns of mobility that can be exploited for this task. We propose a novel spatial search algorithm that infers a user's location by combining aggregate signals mined from billions of foursquare check-ins with real-time contextual information. We evaluate a variety of techniques and demonstrate that machine learning algorithms for ranking and spatiotemporal models of places and users offer significant improvement over common methods for location search based on distance and popularity.

【Keywords】: data mining; geocoding; foursquare; human mobility; information retrieval; learn to rank; spatiotemporal models; location data; machine learning; mobile devices; spatial search

76. Maguro, a system for indexing and searching over very large text collections.

Paper Link】 【Pages】:727-736

【Authors】: Knut Magne Risvik ; Trishul M. Chilimbi ; Henry Tan ; Karthik Kalyanaraman ; Chris Anderson

【Abstract】: Maguro is a system for efficiently searching very large collections of text content of up to 1 trillion documents at low cost. Search engines span across content that is very dynamic and highly augmented with metadata to the tail content of the web. A long tail distribution of content calls for different trade-offs in the design space for good efficiency across the entire index range. Maguro is designed for the long tail of content with less dynamics and less metadata, but very good cost efficiency. Maguro is part of the serving stack in Bing and allows us to scale the index significantly better.

【Keywords】: index serving; scalability

Doctoral consortium 6

Paper Link】 【Pages】:737-740

【Authors】: Riccardo Colini-Baldeschi

【Abstract】: Sponsored search auctions are used to allocate ad slots to advertisers. The standard mechanism for sponsored search auctions is the Generalized-Second-Price (GSP) auction. Even if GSP seems to be established, a lot of open problems remain in the area and many significant researches have been done in the recent years. My research proposal is focusing in some specific aspects of the sponsored search auctions like revenue maximization and the design of mechanisms that obtain some form of social efficiency. In this paper we start from a brief history of the sponsored search auctions and then we present the formal model. In the last two sections I will introduce the obtained results and some open problems.

【Keywords】: algorithmic game theory; auctions; mechanism design

Paper Link】 【Pages】:741-746

【Authors】: Flavio Figueiredo

【Abstract】: User generated content (UGC) has emerged as the predominant form of media publishing on the Web 2.0. Motivated by the large adoption of video sharing on the Web 2.0, the objective of our work is to understand and predict popularity trends (e.g, will a video be viral?) and hits (e.g, how may views will a video receive?) of user generated videos. Such knowledge is paramount to the effective design of various services including content distribution and advertising. Thus, in this paper we formalize the problem of predicting trends and hits in user generated videos. Also, we describe our research methodology on approaching this problem. To the best of knowledge, our work is novel in focusing on the problem of predicting popularity trends complementary to hits. Moreover, we intend on evaluating efficacy of our results not only based on common statistical error metrics, but also on the possible online advertising revenues our predictions can generate. After describing our proposal, we here summarize our latest findings regarding (1) uncovering common popularity trends; (2) measuring associations between UGC features and popularity trends; and (3) assessing the effectiveness of models for predicting popularity trends.

【Keywords】: popularity; trends; ugc; video

79. On the quest of discovering cultural trails in social media.

Paper Link】 【Pages】:747-752

【Authors】: Ruth Olimpia Garcia Gavilanes

【Abstract】: With the constant increasing reach of the Web and in particular of Social Media, people create and share content that harbors information about habits, norms, preferences and values. Consequently, studying how culture influences users in online social media has increased the interest of several sectors such as the advertising industry, search engines and corporations. As a consequence, anthropological and computational models need to interact and complement each other to better target these new demands. Recently, several studies have analyzed culture from large-scale data but not many took into consideration the cultural models proposed by anthropological theory. By carrying out several experiments on large-scale data from the Web, we propose to combine theoretical concepts of culture with information technology techniques to process, analyze, model and interpret data from the Web. We plan to discover synergies between traditional social studies of culture and those derived from our experiments.

【Keywords】: cross-cultural differences; sentiment analysis; social media; social networks

80. Towards web-scale structured web data extraction.

Paper Link】 【Pages】:753-758

【Authors】: Tomas Grigalis

【Abstract】: In this paper we present an ongoing PhD research on unsupervised and domain-independent structured data extraction from the Web. We propose a novel method to extract structured data records from template-generated Web pages. The method is based on clustering visually similar Web page elements by exploiting their visual formatting and HTML structural features. Tag paths of clustered Web page elements are then employed to derive extraction rules. These rules, called wrappers, can be later reused on thousands of same template-generated Web pages. This opens the possibility for the proposed method to be deployed in Web-Scale structured data extraction systems.

【Keywords】: structured data; web data extraction; wrapper induction

81. Building user profiles to improve user experience in recommender systems.

Paper Link】 【Pages】:759-764

【Authors】: Anísio Lacerda ; Nivio Ziviani

【Abstract】: Recommender systems are quickly becoming ubiquitous in many Web applications, including e-commerce, social media channels, content providers, among others. These systems act as an enabling mechanism designed to overcome the information overload problem by improving browsing and consumption experience. Crucial to the performance of a recommender system is the accuracy of the user profiles used to represent the interests of the users. In this proposal, we analyze three different aspects of user profiling: (i) selecting the most informative events from the interaction between users and the system, (ii) combining different recommendation algorithms to (iii) including trust-aware information in user profiles to improve the accuracy of recommender systems.

【Keywords】: diversity; novelty; recommender systems

82. Web usage mining for enhancing search-result delivery and helping users to find interesting web content.

Paper Link】 【Pages】:765-770

【Authors】: Ida Mele

【Abstract】: Web usage mining is the application of data mining techniques to the data generated by the interactions of users with web servers. This kind of data, stored in server logs, represents a valuable source of information, which can be exploited to optimize the document-retrieval task, or to better understand, and thus, satisfy user needs. Our research focuses on two important issues: improving search-engine performance through static caching of search results, and helping users to find interesting web pages by recommending news articles and blog posts. Concerning the static caching of search results, we present the query covering approach. The general idea is to populate the cache with those documents that contribute to the result pages of a large number of queries, as opposed to caching the top documents of most frequent queries. For the recommendation of web pages, we present a graph-based approach, which leverages the user-browsing logs to identify early adopters. These users discover interesting content before others, and monitoring their activity we can find web pages to recommend.

【Keywords】: caching; recommendation; web usage mining

Abstracts, workshops, etc. 13

83. Advanced graph mining for community evaluation in social networks and the web.

Paper Link】 【Pages】:771-772

【Authors】: Christos Giatsidis ; Fragkiskos D. Malliaros ; Michalis Vazirgiannis

【Abstract】: Graphs constitute a dominant data structure and appear essentially in all forms of information. Examples are the Web graph, numerous social networks, protein interaction networks, terms dependency graphs and network topologies. The main features of these graphs are their huge volume and rate of change. Presumably, there is important hidden knowledge in the macroscopic topology and features of these graphs. A cornerstone issue here is the detection and evaluation of communities -- bearing multiple and diverse semantics. The tutorial reports the basic models of graph structures for undirected, directed and signed graphs and their properties. Next we offer a thorough review of fundamental methods for graph clustering and community detection, on both undirected and directed graphs. Then we survey community evaluation measures, including both the individual node based ones as well as those that take into account aggregate properties of communities. A special mention is made on approaches that capitalize on the concept of degeneracy (k-cores and extensions), as a novel means of community detection and evaluation. We justify the above foundational framework with applications on citation graphs, trust networks and protein graphs.

【Keywords】: community detection; community structure; graph mining; social network analysis

84. Anomaly, event, and fraud detection in large network datasets.

Paper Link】 【Pages】:773-774

【Authors】: Leman Akoglu ; Christos Faloutsos

【Abstract】: Detecting anomalies and events in data is a vital task, with numerous applications in security, finance, health care, law enforcement, and many others. While many techniques have been developed in past years for spotting outliers and anomalies in unstructured collections of multi-dimensional points, with graph data becoming ubiquitous, techniques for structured graph data have been of focus recently. As objects in graphs have long-range correlations, novel technology has been developed for abnormality detection in graph data. The goal of this tutorial is to provide a general, comprehensive overview of the state-of-the-art methods for anomaly, event, and fraud detection in data represented as graphs. As a key contribution, we provide a thorough exploration of both data mining and machine learning algorithms for these detection tasks. We give a general framework for the algorithms, categorized under various settings: unsupervised vs.(semi-)supervised, for static vs. dynamic data. We focus on the scalability and effectiveness aspects of the methods, and highlight results on crucial real-world applications, including accounting fraud and opinion spam detection.

【Keywords】: anomaly detection; event detection; fraud; graph mining

85. Models and algorithms for social influence analysis.

Paper Link】 【Pages】:775-776

【Authors】: Jimeng Sun ; Jie Tang

【Abstract】: Social influence is the behavioral change of a person because of the perceived relationship with other people, organizations and society in general. Social influence has been a widely accepted phenomenon in social networks for decades. Many applications have been built based around the implicit notation of social influence between people, such as marketing, advertisement and recommendations. With the exponential growth of online social network services such as Facebook and Twitter, social influence can for the first time be measured over a large population. In this tutorial, we survey the research on social influence analysis with a focus on the computational aspects. First, we introduce how to verify the existence of social influence in various social networks. Second, we present computational models for quantifying social influence. Third, we describe how social influence can help real applications. In particular, we will focus on opinion leader finding and influence maximization for viral marketing. Finally, we apply the selected algorithms of social influence analysis on different social network data, such as twitter, arnetminer data, weibo, and slashdot forum.

【Keywords】: influence maximization; social influence; social network

86. Data-driven political science.

Paper Link】 【Pages】:777-778

【Authors】: Ingmar Weber ; Ana-Maria Popescu ; Marco Pennacchiotti

【Abstract】: The tutorial will summarize the state-of-the art in the growing area of computational political science. Like many others, this research domain is being revolutionized by the availability of open, big data and the increasing reach and importance of social media. The surging interest on the part of the academic community is matched by intense efforts on the part of political campaigns to use online data in order to learn how to best disseminate information and reach the right potential donors or voters. In this context, a tutorial can summarize existing methods in a fascinating, high-interest area and allow participants with diverse backgrounds to get inspiration from the methods and problems studied. The tutorial will feature seminal research concerning (i) political polarization, (ii) election prediction and polling, and (iii) political campaigning and influence propagation. The goal is not only to familiarize attendees with ideas from related conferences such as WWW, ICWSM or CIKM, but also to present ideas and quantitative methods closer to political science such as Poole's and Rosenthal's NOMINATE score for a politician's political orientation.

【Keywords】: big data; election prediction; political science

87. Exploring structure and content on the web: extraction and integration of the semi-structured web.

Paper Link】 【Pages】:779-780

【Authors】: Tim Weninger ; Jiawei Han

【Abstract】: In this tutorial we view the World Wide Web as a type of massive, decentralized database. At present, this "Web database" is presented in a manner largely devoid of any consistent meaning or schema. That is not to say that Web-data lacks an underlying organization; in fact, most Web content is generated from an underlying schema-bound, or otherwise structured database. Information extraction is generally concerned with the reconciliation of unstructured or semi-structured Web content with the neatly structured database paradigm. With this Web-database in hand, researchers and practitioners have recently begun developing mechanisms which return structured results in response to an unstructured query. These new developments are a product of (1) record, list and table extraction from large numbers of semi-structured Web pages, (2) integration of these disparate extraction results into a consistent form, and (3) analysis of the newly extracted and integrated Web data. Among the many fruits of this line of work is the ability for semi-structured Web data to enhance the search capabilities of a schema-bound database. Alternatively, structured database records have also been used to augment Web page collections typically used by Web search engines. We will cover several key technologies, and principles explored so far in the area of Web information extraction, search and exploration.

【Keywords】: information extraction; information integration; semi-structured data

88. Temporal web dynamics and its application to information retrieval.

Paper Link】 【Pages】:781-782

【Authors】: Kira Radinsky ; Fernando Diaz ; Susan T. Dumais ; Milad Shokouhi ; Anlei Dong ; Yi Chang

【Abstract】: The World Wide Web is highly dynamic and is constantly evolving to cover the latest information about the physical and social updates in the world. At the same time, the changes in web contents are entangled with new information needs and time-sensitive user interactions with information sources. To address these temporal information needs effectively, it is essential for the search engines to model web dynamics and understand the changes in user behavior over time that are caused by them. In this full-day tutorial, we focus on modeling time-sensitive content on the web, and discuss the state-of-the-art approaches for integrating temporal signals in web search. We address many of the related research topics including those associated with searching dynamic collections, defining time-sensitive relevance, understanding user query behavior over time, and investigating the mains reasons behind content changes. We also cover algorithms, architectures, evaluation methodologies and metrics for time-aware search, and discuss the latest breakthroughs and open challenges, both from the algorithmic and the architectural perspectives.

【Keywords】: web dynamics

89. (big) usage data in web search.

Paper Link】 【Pages】:783-784

【Authors】: Ricardo A. Baeza-Yates ; Yoelle Maarek

【Abstract】: Web Search, which takes its root in the mature field of information retrieval, evolved tremendously over the last 15 years. The field encountered its first revolution when it started to deal with huge amounts of Web pages. Then, a major step was accomplished when engines started to consider the structure of the Web graph and leveraged link analysis in both crawling and ranking. Finally, a more discrete, but no less critical step, was made when search engines started to monitor and exploit the numerous (mostly implicit) signals provided by users while interacting with the search engine. In this tutorial we focus on this "revolution" of large scale usage data. In the first part of this tutorial, we focus on usage data, which typically refers to any type of information provided by the user while interacting with the search engine. It comes first under its raw form as a set of individual signals, but is typically mined after multiple signals have been aggregated and linked to the same interaction event. The two major types of such data are (1) query streams, which include the query string that the user issued, together with the time-stamp of the query, a user identifier, possibly the IP of the machine on which the browser runs, and (2) click data, which include the reference to the element the user clicked on the page together with the timestamp, user identifier, possibly IP, the rank of the link if it is a result, etc. Exploiting usage data under its multiple forms brought an unprecedented wealth of implicit information to Web Search. We discuss in the second part of this tutorial some of the key Web search applications that it made possible. One such example is the query spelling correction feature embodied now in all search engines. In fact, after years of very sophisticated spell checking research, simply counting similar queries at a small edit distance would in most cases surface the most popular spelling as the correct one, a beautiful and simple demonstration of the wisdom of crowds principle.

【Keywords】: user interaction; web retrieval

90. Econometric analysis and digital marketing: how tomeasure the effectiveness of an ad.

Paper Link】 【Pages】:785-786

【Authors】: Ayman Farahat ; James Shanahan

【Abstract】: Over the past 18 years online advertising has grown to a $70 billion industry worldwide annually. Despite this impressive growth, online advertising faces many (and some would say traditional) challenges including how to measure the efficiency or the potential loss of sales caused by the inefficient use of advertising dollars. Consequently, it is vital to measure, maximize, and benchmark the efficiency of advertising media expenditures. This tutorial introduces the field of econometrics as a means of measuring the effectiveness of digital marketing. Econometrics is a field that extends and applies statistical methods to the analysis of economic phenomena. In that vein, econometrics goes beyond traditional statistics and explicitly recognizes the complexities of human behavior. Consider for example the impact of deep discounts on survival of restaurants. Struggling businesses are more likely to offer these deep discounts and eventually fail. A naïve application of statistical techniques will overestimate the impact of deep discounts on business survival. In this case, the discounts are an endogenous variable as compared to an exogenous variable. This type of specification error highlights why we need a deeper look at the variables that go into statistical models. Econometrics addresses these and other issues in a formal and rigorous manner.

【Keywords】: advertising; big data; econometrics; marketing

Paper Link】 【Pages】:787-788

【Authors】: Pavel Serdyukov ; Georges Dupret ; Nick Craswell

【Abstract】: WSCD 2013 is the third workshop on Web Search Click Data, following WSCD 2009 and WSCD 2012. It is a forum for new research relating to Web search usage logs and for discussing desirable properties of publicly released search log datasets. Research relating to search logs has been hampered by the limited availability of click datasets. This workshop comes with a new dataset based on logged user search behaviour and an accompanying challenge to predict switches between search engines within a given search session.

【Keywords】: query logs; search engine switching

92. 3rd workshop on context-awareness in retrieval and recommendation.

Paper Link】 【Pages】:789-790

【Authors】: Matthias Böhmer ; Ernesto William De Luca ; Alan Said ; Jaime Teevan

【Abstract】: Context-aware information is widely available in various ways and is becoming more and more important for enhancing retrieval performance and recommendation results. The current main issue to cope with is not only recommending or retrieving the most relevant items and content, but defining them ad hoc. Other relevant issues include personalizing and adapting the information and the way it is displayed to the user's current situation and interests. Ubiquitous computing further provides new means for capturing user feedback on items and providing information.

【Keywords】: context-awareness; information retrieval; recommender systems

93. Workshop on large-scale and distributed systems for information retrieval (LSDS-IR 2013).

Paper Link】 【Pages】:791-792

【Authors】: Nicola Tonellotto ; Craig Macdonald ; Ismail Sengör Altingövde

【Abstract】: The LSDS-IR'13 workshop aims to bring together both information retrieval practitioners from industry, as well as academic researchers concerned with efficient and distributed IR systems. The workshop also welcomes contributions that propose different ways of leveraging diversity and multiplicity of resources available in distributed systems. The main goal of the workshop is to attract people from industry and academia to present and discuss ideas, problems and results in efficiency of large scale and distributed information retrieval systems, and to foster their participation to the WSDM conference.

【Keywords】: distribscalability; distributed search; efficiency; large-scale systems

94. Workshop on semantic personalized information management (SPIM'13).

Paper Link】 【Pages】:793-794

【Authors】: Till Plumbaum ; Ernesto William De Luca ; Aldo Gangemi ; Michael Hausenblas

【Abstract】: The SPIM workshop focuses especially on people that are working on the social or semantic Web, machine learning, user modeling, recommender systems, information retrieval, semantic interaction, or their combination. The goal is to bring together researchers and practitioners to initiating discussions on the different requirements and challenges coming with the social and semantic Web for personalized information retrieval systems. The workshop aims at improving the exchange of ideas between the different research communities and practitioners involved in the research on semantic personalized information management.

【Keywords】: information management; information retrieval; ontology; personalization; semantic web

Paper Link】 【Pages】:795-796

【Authors】: Vanessa Murdock ; Charles L. A. Clarke ; Jaap Kamps ; Jussi Karlgren

【Abstract】: Adult content is pervasive on the Web, has been a driving factor in the adoption of the Internet medium. It is responsible for a significant fraction of traffic and revenues, yet rarely attracts attention in research. We propose that the research questions surrounding adult content access behaviors are unique, and we believe interesting and valuable research in this area can be done ethically. The workshop on Search and Exploration of X-Rated Information (SEXI) addresses these issues for information access tasks related to adult content.

【Keywords】: adult content