4. WSDM 2011:Hong Kong, China

Proceedings of the Forth International Conference on Web Search and Web Data Mining, WSDM 2011, Hong Kong, China, February 9-12, 2011. ACM 【DBLP Link

Paper Num: 91 || Session Num: 17

Tutorial AM1 and AM2 2

1. Crowdsourcing 101: putting the WSDM of crowds to work for you.

Paper Link】 【Pages】:1-2

【Authors】: Omar Alonso ; Matthew Lease

【Abstract】: Crowdsourcing has emerged in recent years as an exciting new avenue for leveraging the tremendous potential and resources of today's digitally-connected, diverse, distributed workforce. Generally speaking, crowdsourcing describes outsourcing of tasks to a large group of people instead of assigning such tasks to an in-house employee or contractor. Crowdsourcing platforms such as Amazon Mechanical Turk and CrowdFlower have gained particular attention as active online market places for reaching and tapping into this glut of a still largely under-utilized workforce. Crowdsourcing offers intriguing new opportunities for accomplishing different kinds of tasks or achieving broader participation than previously possible, as well as completing standard tasks more accurately in less time and at lower cost. Unlocking the potential of crowdsourcing in practice, however, requires a tri-partite understanding of principles, platforms, and best practices. This tutorial will introduce the opportunities and challenges of crowdsourcing while discussing the three issues above. This will provide attendees with a basic foundation to begin applying crowdsourcing in the context of their own particular tasks.

【Keywords】: crowdsourcing; human computation

2. Introduction to display advertising: a half-day tutorial.

Paper Link】 【Pages】:3-4

【Authors】: Andrei Z. Broder ; Vanja Josifovski ; Jayavel Shanmugasundaram

【Abstract】: Display advertising is one of the two major advertising channels on the web (in addition to search advertising). Display advertising on the Web is usually done by graphical ads placed on the publishers' Web pages. There is no explicit user query, and the ad selection is performed based on the page where the ad is placed (contextual targeting) or user's past activities (behavioral targeting). In both cases, sophisticated text analysis and learning algorithms are needed to provide relevant ads to the user. In this tutorial we will overview the display advertising marketplace, and technologies that power the display advertising platforms.

【Keywords】: computational advertising; display advertising; guaranteed delivery; non-guaranteed delivery; targeting

Workshop 1 and 2 2

Paper Link】 【Pages】:5-6

【Authors】: Vitor R. Carvalho ; Matthew Lease ; Emine Yilmaz

【Abstract】: The advent of crowdsourcing is revolutionizing data annotation, evaluation, and other traditionally manual-labor intensive processes by dramatically reducing the time, cost, and effort involved. This in turn is driving a disruptive shift in search and data mining methodology in areas such as: Evaluation: the Cranfield paradigm for search evaluation requires manually assessing document relevance to search queries. Recent work on stochastic evaluation has reduced but not removed this need for manual assessment. Supervised Learning: while traditional costs associated with data annotation have driven recent machine learning work (e.g. Learning to Rank) toward greater use of unsupervised and semi-supervised methods, the emergence of crowdsourcing has made labeled data far easier to acquire, thereby driving a potential resurgence in fully-supervised methods. Applications: Crowdsourcing has introduced exciting new opportunities to integrate human labor into automated systems: handling difficult cases where automation fails, exploiting the breadth of backgrounds, geographic dispersion, real-time response, etc.

【Keywords】: crowdsourcing; data mining; search

4. User modeling for web applications.

Paper Link】 【Pages】:7-8

【Authors】: David Carmel ; Vanja Josifovski ; Yoelle Maarek

【Abstract】: Users have taken a more and more central role in the Web. Their role is both explicit, as they become more savvy, they have more expectations, and new interactive features keep appearing, and implicit, as their actions are monitored at various levels of granularity for various needs from live traffic evaluation for usage data mining to improve ranking, spelling etc. In a few years, most Web applications will have the ability to successfully adapt to both the explicit and implicit needs and tastes of their users. Such adaptation requires the ability to model the user's personal goals, interests, preferences and knowledge, and to apply this model while users interact with various applications. While adaptive applications that are based on user modeling have attracted the attention of multiple communities, from AI to UI, there is no forum that specifically focuses on user modeling and adaptive applications in the Web domain. This workshop will focus on user modeling and the usage of these models in Web applications. The emphasis of the workshop will be on modeling techniques that scale for the Web. User modeling might be based on explicit and implicit user feedback gathered from variety of sources such as sign-on information, clickthrough data, user previous queries, social network, purchases, and real-world activity. Adaptive Web based applications include search personalization, advertisement targeting, recommendation systems, social networks, on-line shopping, etc.

【Keywords】: user modeling; web

Tutorial PM1 and PM2 2

5. Exploiting statistical and relational information on the web and in social media.

Paper Link】 【Pages】:9-10

【Authors】: Lise Getoor ; Lilyana Mihalkova

【Abstract】: The popularity of Web 2.0, characterized by a proliferation of social media sites, and Web 3.0, with more richly semantically annotated objects and relationships, brings to light a variety of important prediction, ranking, and extraction tasks. The input to these tasks is often best seen as a (noisy) multi-relational graph, such as the click graph, defined by user interactions with Web sites; and the social graph, defined by friendships and affiliations on social media sites. This tutorial will provide an overview of statistical relational learning and inference techniques, motivating and illustrating them using web and social media applications. We will start by briefly surveying some of the sources of statistical and relational information on the web and in social media and will then dedicate most of the tutorial time to an introduction to representations and techniques for learning and reasoning with multi-relational information, viewing them through the lens of web and social media domains. We will end with a discussion of current trends and related fields, such as privacy in social networks.

【Keywords】: graph analysis; probabilistic inference; query log data; social media; statistical relational learning

6. Web retrieval: the role of users.

Paper Link】 【Pages】:11-12

【Authors】: Ricardo A. Baeza-Yates ; Yoelle Maarek

【Abstract】: Web retrieval methods have evolved through three major steps in the last decade or so. They started from standard documentcentric IR in the early days of the Web, then made a major step forward by leveraging the structure of the Web, using link analysis techniques in both crawling and ranking challenges. A more recent, no less important but maybe more discrete step forward, has been to enter the user in this equation in two ways: (1) implicitly, through the analysis of usage data captured by query logs, and session and click information in general, the goal being to improve ranking as well as to measure user's happiness and engagement; (2) explicitly, by offering novel interactive features; the goal here being to better answer users' needs. In this tutorial, we will cover the user-related challenges associated with the implicit and explicit role of users in Web retrieval. We will review and discuss challenges associated with two types of activities, namely: Usage data analysis and metrics - It is critical to monitor how users interact with Web retrieval systems, as this implicit relevant feedback aggregated at a large scale can approximate quite accurately the level of success of a given feature. Here we have to consider not only clicks statistics but also the time spent in a page, the number of actions per session, etc. User interaction - Given the intrinsic problems posed by the Web, the key challenge for the user is to conceive a good query, one that leads to a manageable and relevant answer. The retrieval system must complete search requests fast and give back relevant results, even for poorly formulated queries. Web retrieval engines thus interact with the user at two key stages, each associated with its own challenges: (1) Expressing a query: Human beings have needs or tasks to accomplish, which are frequently not easy to express as 'queries'. Queries are just a reflection of human needs and are thus, by definition, imperfect. The issue here is for the engine both to assist the user in reflecting this need and to capture the intent behind the query even if the information is incomplete or poorly expressed. (2) Interpreting and using results: Even if the user is able to perfectly express a query, the answer might be split over thousands or millions of Web pages or not exist at all. In this context, numerous questions need to be addressed. Examples include: How do we handle a large answer? How do we select or maybe synthesize the documents that really are of interest to the user? Even in the case of a single document candidate, the document itself could be large. How do we browse such documents efficiently? How to help the user take advantage of results, and possibly combine with applications to perform the task that drove the query? The goal of this tutorial is to teach the key principles and technologies behind the activities and challenges briefly outlined above, bring new understanding and insights to the attendees, and hopefully foster future research. A previous version of this tutorial was offered at the ACM SIGIR

【Keywords】: user interaction; web retrieval

Keynote 1 1

7. Mining billion-node graphs: patterns, generators and tools.

Paper Link】 【Pages】:13-14

【Authors】: Christos Faloutsos

【Abstract】: What do graphs look like? How do they evolve over time? How to handle a graph with a billion nodes? We present a comprehensive list of static and temporal laws, and some recent observations on real graphs (like, e.g., "eigenSpokes"). For generators, we describe some recent ones, which naturally match all of the known properties of real graphs. Finally, for tools, we present "oddBall" for discovering anomalies and patterns, as well as an overview of the PEGASUS system which is designed for handling Billion-node graphs, running on top of the "hadoop" system.

【Keywords】: large scale graphs; oddball; pegasus

Paper Link】 【Pages】:15-24

【Authors】: Ingmar Weber ; Alejandro Jaimes

【Abstract】: We analyze a large query log of 2.3 million anonymous registered users from a web-scale U.S. search engine in order to jointly analyze their on-line behavior in terms of who they might be (demographics), what they search for (query topics), and how they search (session analysis). We examine basic demographics from registration information provided by the users, augmented with U.S. census data, analyze basic session statistics, classify queries into types (navigational, informational, transactional) based on click entropy, classify queries into topic categories, and cluster users based on the queries they issued. We then examine the resulting clusters in terms of demographics and search behavior. Our analysis of the data suggests that there are important differences in search behavior across different demographic groups in terms of the topics they search for, and how they search (e.g., white conservatives are those likely to have voted republican, mostly white males, who search for business, home, and gardening related topics; Baby Boomers tend to be primarily interested in Finance and a large fraction of their sessions consist of simple navigational queries related to online banking, etc.). Finally, we examine regional search differences, which seem to correlate with differences in local industries (e.g., gambling related queries are highest in Las Vegas and lowest in Salt Lake City; searches related to actors are about three times higher in L.A. than in any other region).

【Keywords】: demographics; query logs; session analysis; topic classification

Paper Link】 【Pages】:25-34

【Authors】: Nicolaas Matthijs ; Filip Radlinski

【Abstract】: Personalizing web search results has long been recognized as an avenue to greatly improve the search experience. We present a personalization approach that builds a user interest profile using users' complete browsing behavior, then uses this model to rerank web results. We show that using a combination of content and previously visited websites provides effective personalization. We extend previous work by proposing a number of techniques for filtering previously viewed content that greatly improve the user model used for personalization. Our approaches are compared to previous work in offline experiments and are evaluated against unpersonalized web search in large scale online tests. Large improvements are found in both cases.

【Keywords】: alterego; browsing history; evaluation; interleaving; personalized web search; ranking; user profile

Paper Link】 【Pages】:35-44

【Authors】: Jaime Teevan ; Daniel Ramage ; Meredith Ringel Morris

【Abstract】: Social networking Web sites are not just places to maintain relationships; they can also be valuable information sources. However, little is known about how and why people search socially-generated content. In this paper we explore search behavior on the popular microblogging/social networking site Twitter. Using analysis of large-scale query logs and supplemental qualitative data, we observe that people search Twitter to find temporally relevant information (e.g., breaking news, real-time content, and popular trends) and information related to people (e.g., content directed at the searcher, information about people of interest, and general sentiment and opinion). Twitter queries are shorter, more popular, and less likely to evolve as part of a session than Web queries. It appears people repeat Twitter queries to monitor the associated search results, while changing and developing Web queries to learn about a topic. The results returned from the different corpora support these different uses, with Twitter results including more social chatter and social events, and Web results containing more basic facts and navigational content. We discuss the implications of these findings for the design of next-generation Web search tools that incorporate social media.

【Keywords】: microblogging; q&a; social media; social search; web search

11. Identifying topical authorities in microblogs.

Paper Link】 【Pages】:45-54

【Authors】: Aditya Pal ; Scott Counts

【Abstract】: Content in microblogging systems such as Twitter is produced by tens to hundreds of millions of users. This diversity is a notable strength, but also presents the challenge of finding the most interesting and authoritative authors for any given topic. To address this, we first propose a set of features for characterizing social media authors, including both nodal and topical metrics. We then show how probabilistic clustering over this feature space, followed by a within-cluster ranking procedure, can yield a final list of top authors for a given topic. We present results across several topics, along with results from a user study confirming that our method finds authors who are significantly more interesting and authoritative than those resulting from several baseline conditions. Additionally our algorithm is computationally feasible in near real-time scenarios making it an attractive alternative for capturing the rapidly changing dynamics of microblogs.

【Keywords】: authority; clustering; microblogging; ranking; twitter

12. Correcting for missing data in information cascades.

Paper Link】 【Pages】:55-64

【Authors】: Eldar Sadikov ; Montserrat Medina ; Jure Leskovec ; Hector Garcia-Molina

【Abstract】: Transmission of infectious diseases, propagation of information, and spread of ideas and influence through social networks are all examples of diffusion. In such cases we say that a contagion spreads through the network, a process that can be modeled by a cascade graph. Studying cascades and network diffusion is challenging due to missing data. Even a single missing observation in a sequence of propagation events can significantly alter our inferences about the diffusion process. We address the problem of missing data in information cascades. Specifically, given only a fraction C' of the complete cascade C, our goal is to estimate the properties of the complete cascade C, such as its size or depth. To estimate the properties of C, we first formulate k-tree model of cascades and analytically study its properties in the face of missing data. We then propose a numerical method that given a cascade model and observed cascade C' can estimate properties of the complete cascade C. We evaluate our methodology using information propagation cascades in the Twitter network (70 million nodes and 2 billion edges), as well as information cascades arising in the blogosphere. Our experiments show that the k-tree model is an effective tool to study the effects of missing data in cascades. Most importantly, we show that our method (and the k-tree model) can accurately estimate properties of the complete cascade C even when 90% of the data is missing.

【Keywords】: analytical models; balanced trees; blogs; information cascades; information diffusion; missing data; numerical methods; sampling; social networks; twitter

13. Everyone's an influencer: quantifying influence on twitter.

Paper Link】 【Pages】:65-74

【Authors】: Eytan Bakshy ; Jake M. Hofman ; Winter A. Mason ; Duncan J. Watts

【Abstract】: In this paper we investigate the attributes and relative influence of 1.6M Twitter users by tracking 74 million diffusion events that took place on the Twitter follower graph over a two month interval in 2009. Unsurprisingly, we find that the largest cascades tend to be generated by users who have been influential in the past and who have a large number of followers. We also find that URLs that were rated more interesting and/or elicited more positive feelings by workers on Mechanical Turk were more likely to spread. In spite of these intuitive results, however, we find that predictions of which particular user or URL will generate large cascades are relatively unreliable. We conclude, therefore, that word-of-mouth diffusion can only be harnessed reliably by targeting large numbers of potential influencers, thereby capturing average effects. Finally, we consider a family of hypothetical marketing strategies, defined by the relative cost of identifying versus compensating potential "influencers." We find that although under some circumstances, the most influential users are also the most cost-effective, under a wide range of plausible assumptions the most cost-effective performance can be realized using "ordinary influencers"---individuals who exert average or even less-than-average influence.

【Keywords】: communication networks; diffusion; influence; twitter; word of mouth marketing

14. A comparative analysis of cascade measures for novelty and diversity.

Paper Link】 【Pages】:75-84

【Authors】: Charles L. A. Clarke ; Nick Craswell ; Ian Soboroff ; Azin Ashkan

【Abstract】: Traditional editorial effectiveness measures, such as nDCG, remain standard for Web search evaluation. Unfortunately, these traditional measures can inappropriately reward redundant information and can fail to reflect the broad range of user needs that can underlie a Web query. To address these deficiencies, several researchers have recently proposed effectiveness measures for novelty and diversity. Many of these measures are based on simple cascade models of user behavior, which operate by considering the relationship between successive elements of a result list. The properties of these measures are still poorly understood, and it is not clear from prior research that they work as intended. In this paper we examine the properties and performance of cascade measures with the goal of validating them as tools for measuring effectiveness. We explore their commonalities and differences, placing them in a unified framework; we discuss their theoretical difficulties and limitations, and compare the measures experimentally, contrasting them against traditional measures and against other approaches to measuring novelty. Data collected by the TREC 2009 Web Track is used as the basis for our experimental comparison. Our results indicate that these measures reward systems that achieve an balance between novelty and overall precision in their result lists, as intended. Nonetheless, other measures provide insights not captured by the cascade measures, and we suggest that future evaluation efforts continue to report a variety of measures.

【Keywords】: diversity; effectiveness measures; novelty

Paper Link】 【Pages】:85-94

【Authors】: Jaime Teevan ; Daniel J. Liebling ; Gayathri Ravichandran Geetha

【Abstract】: This paper presents an algorithm that predicts with very high accuracy which Web search result a user will click for one sixth of all Web queries. Prediction is done via a straightforward form of personalization that takes advantage of the fact that people often use search engines to re-find previously viewed resources. In our approach, an individual's past navigational behavior is identified via query log analysis and used to forecast identical future navigational behavior by the same individual. We compare the potential value of personal navigation with general navigation identified using aggregate user behavior. Although consistent navigational behavior across users can be useful for identifying a subset of navigational queries, different people often use the same queries to navigate to different resources. This is true even for queries comprised of unambiguous company names or URLs and typically thought of as navigational. We build an understanding of what personal navigation looks like, and identify ways to improve its coverage and accuracy by taking advantage of people's consistency over time and across groups of individuals.

【Keywords】: navigation; personal navigation; personalized search; query intent; query log analysis; re-finding; web search

16. Quality-biased ranking of web documents.

Paper Link】 【Pages】:95-104

【Authors】: Michael Bendersky ; W. Bruce Croft ; Yanlei Diao

【Abstract】: Many existing retrieval approaches do not take into account the content quality of the retrieved documents, although link-based measures such as PageRank are commonly used as a form of document prior. In this paper, we present the quality-biased ranking method that promotes documents containing high-quality content, and penalizes low-quality documents. The quality of the document content can be determined by its readability, layout and ease-of-navigation, among other factors. Accordingly, instead of using a single estimate for document quality, we consider multiple content-based features that are directly integrated into a state-of- the-art retrieval method. These content-based features are easy to compute, store and retrieve, even for large web collections. We use several query sets and web collections to empirically evaluate the performance of our quality-biased retrieval method. In each case, our method consistently improves by a large margin the retrieval performance of text-based and link-based retrieval methods that do not take into account the quality of the document content.

【Keywords】: document quality; quality-biased ranking

17. Ranking from pairs and triplets: information quality, evaluation methods and query complexity.

Paper Link】 【Pages】:105-114

【Authors】: Kira Radinsky ; Nir Ailon

【Abstract】: Obtaining judgments from human raters is a vital part in the design of search engines' evaluation. Today, a discrepancy exists between judgment acquisition from raters (training phase) and use of the responses for retrieval evaluation (evaluation phase). This discrepancy is due to the inconsistency between the representation of the information in both phases. During training, raters are requested to provide a relevance score for an individual result in the context of a query, whereas the evaluation is performed on ordered lists of search results, with the results' relative position (compared to other results) taken into account. As an alternative to the practice of learning to rank using relevance judgments for individual search results, more and more focus has recently been diverted to the theory and practice of learning from answers to combinatorial questions about sets of search results. That is, users, during training, are asked to rank small sets (typically pairs). Human rater responses to questions about the relevance of individual results are first compared to their responses to questions about the relevance of pairs of results. We empirically show that neither type of response can be deduced from the other, and that the added context created when results are shown together changes the raters' evaluation process. Since pairwise judgments are directly related to ranking, we conclude they are more accurate for that purpose. We go beyond pairs to show that triplets do not contain significantly more information than pairs for the purpose of measuring statistical preference. These two results establish good stability properties of pairwise comparisons for the purpose of learning to rank. We further analyze different scenarios, in which results of varying quality are added as "decoys". A recurring source of worry in papers focusing on pairwise comparison is the quadratic number of pairs in a set of results. Which preferences do we choose to solicit from paid raters? Can we provably eliminate a quadratic cost? We employ results from statistical learning theory to show that the quadratic cost can be provably eliminated in certain cases. More precisely, we show that in order to obtain a ranking in which each element is an average of O(n/C) positions away from its position in the optimal ranking, one needs to sample O(nC2) pairs uniformly at random, for any C > 0. We also present an active learning algorithm which samples the pairs adaptively, and conjecture that it provides additional improvement.

【Keywords】: ranking evaluation; ranking from pairs; relevance feedback

Keynote 2 1

18. Bing dialog model: intent, knowledge and user interaction.

Paper Link】 【Pages】:115-116

【Authors】: Harry Shum

【Abstract】: The decade-old Internet search outcomes, manifested in the form of "ten blue links," are no longer sufficient for Internet users. Many studies have shown that when users are ushered off the conventional search result pages through blue links, their needs are often partially met at best in a "hit-or-miss" fashion. To tackle this challenge, we have designed Bing, Microsoft's decision engine, to not just navigate users to a landing page through a blue link but to continue engaging with users to clarify intent and facilitate task completion. Underlying this new paradigm is the Bing Dialog Model that consists of three building blocks: an indexing system that comprehensively collects information from the web and systematically harvests knowledge, an intent model that statistically infers user intent and predicts next action, and an interaction model that elicits user intent through mathematically optimized presentations of web information and domain knowledge that matches user needs. In this talk, I'll describe Bing Dialog Model in details and demonstrate it in action through some innovative features since the launch of www.Bing.com.

【Keywords】: bing dialog model; intent; knowledge; user interaction

19. We feel fine and searching the emotional web.

Paper Link】 【Pages】:117-126

【Authors】: Sepandar D. Kamvar ; Jonathan Harris

【Abstract】: We present We Feel Fine, an emotional search engine and web-based artwork whose mission is to collect the world's emotions to help people better understand themselves and others. We Feel Fine continuously crawls blogs, microblogs, and social networking sites, extracting sentences that include the words "I feel" or "I am feeling", as well as the gender, age, and location of the people authoring those sentences. The We Feel Fine search interface allows users to search or browse over the resulting sentence-level index, asking questions such as "How did young people in Ohio feel when Obama was elected?" While most research in sentiment analysis focuses on algorithms for extraction and classification of sentiment about given topics, we focus instead on building an interface that provides an engaging means of qualitative exploration of emotional data, and a flexible data collection and serving architecture that enables an ecosystem of data analysis applications. We use our observations on the usage of We Feel Fine to suggest a class of visualizations called Experiential Data Visualization, which focus on immersive item-level interaction with data. We also discuss the implications of such visualizations for crowdsourcing qualitative research in the social sciences.

【Keywords】: search; sentiment analysis; social media

20. Efficient indexing of repeated n-grams.

Paper Link】 【Pages】:127-136

【Authors】: Samuel Huston ; Alistair Moffat ; W. Bruce Croft

【Abstract】: The identification of repeated n-gram phrases in text has many practical applications, including authorship attribution, text reuse identification, and plagiarism detection. We consider methods for finding the repeated n-grams in text corpora, with emphasis on techniques that can be effectively scaled across a cluster of processors to handle very large amounts of text. We compare our proposed method to existing techniques using the 1.5 TB TREC ClueWeb-B text collection, using both single-processor and multi-processor approaches. The experiments show that our method offers an important tradeoff between speed and temporary storage space, and provides an alternative to previous approaches that scales almost linearly in the length of the sequence, is largely independent of n, and provides a uniform workload balance across the set of available processors.

【Keywords】: hash filter; n-gram; repeated phrase; scalability; text reuse

Paper Link】 【Pages】:137-146

【Authors】: Shuai Ding ; Josh Attenberg ; Ricardo A. Baeza-Yates ; Torsten Suel

【Abstract】: Large web search engines are now processing billions of queries per day. Most of these queries are interactive in nature, requiring a response in fractions of a second. However, there are also a number of important scenarios where large batches of queries are submitted for various web mining and system optimization tasks that do not require an immediate response. Given the significant cost of executing search queries over billions of web pages, it is a natural question to ask if such batches of queries can be more efficiently executed than interactive queries. In this paper, we motivate and discuss the problem of batch query processing in search engines, identify basic mechanisms for improving the performance of such queries, and provide a preliminary experimental evaluation of the proposed techniques. Our conclusion is that significant cost reductions are possible by using specialized mechanisms for executing batch queries in Web search engines.

【Keywords】: batch query processing; query processing; result cache updates; web search

22. Detecting duplicate web documents using clickthrough data.

Paper Link】 【Pages】:147-156

【Authors】: Filip Radlinski ; Paul N. Bennett ; Emine Yilmaz

【Abstract】: The web contains many duplicate and near-duplicate documents. Given that user satisfaction is negatively affected by redundant information in search results, a significant amount of research has been devoted to developing duplicate detection algorithms. However, most such algorithms rely solely on document content to detect duplication, ignoring the fact that a primary goal of duplicate detection is to identify documents that contain redundant information with respect to a particular user query. Similarly, although query-dependent result diversification algorithms compute a query-dependent ranking, they tend to do so on the basis of a query-independent content similarity score. In this paper, we bridge the gap between query-dependent redundancy and query-independent duplication by showing how user click behavior following a query provides evidence about the relative novelty of web documents. While most previous work on interpreting user clicks on search results has assumed that they reflect just result relevance, we show that clicks also provide information about duplication between web documents since users consider search results in the context of previously seen documents. Moreover, we find that duplication explains a substantial amount of presentation bias observed in clicking behavior. We identify three distinct types of redundancy that commonly occur on the web and show how click data can be used to detect these different types.

【Keywords】: clickthrough; duplication; redundancy; utility; web search

23. KMV-peer: a robust and adaptive peer-selection algorithm.

Paper Link】 【Pages】:157-166

【Authors】: Yosi Mass ; Yehoshua Sagiv ; Michal Shmueli-Scheuer

【Abstract】: The problem of fully decentralized search over many collections is considered. The objective is to approximate the results of centralized search (namely, using a central index) while controlling the communication cost and involving only a small number of collections. The proposed solution is couched in a peer-to-peer (P2P) network, but can also be applied in other setups. Peers publish per-term summaries of their collections. Specifically, for each term, the range of document scores is divided into intervals; and for each interval, a KMV (K Minimal Values) synopsis of its documents is created. A new peer-selection algorithm uses the KMV synopses and two scoring functions in order to adaptively rank the peers, according to the relevance of their documents to a given query. The proposed method achieves high-quality results while meeting the above criteria of efficiency. In particular, experiments are done on two large, real-world datasets; one is blogs and the other is web data. These experiments show that the algorithm outperforms the state-of-the-art approaches and is robust over different collections, various scoring functions and multi-term queries.

【Keywords】: kmv; p2p search; top-k

24. Understanding temporal query dynamics.

Paper Link】 【Pages】:167-176

【Authors】: Anagha Kulkarni ; Jaime Teevan ; Krysta Marie Svore ; Susan T. Dumais

【Abstract】: Web search is strongly influenced by time. The queries people issue change over time, with some queries occasionally spiking in popularity (e.g., earthquake) and others remaining relatively constant (e.g., youtube). The documents indexed by the search engine also change, with some documents always being about a particular query (e.g., the Wikipedia page on earthquakes is about the query earthquake) and others being about the query only at a particular point in time (e.g., the New York Times is only about earthquakes following a major seismic activity). The relationship between documents and queries can also change as people's intent changes (e.g., people sought different content for the query earthquake before the Haitian earthquake than they did after). In this paper, we explore how queries, their associated documents, and the query intent change over the course of 10 weeks by analyzing query log data, a daily Web crawl, and periodic human relevance judgments. We identify several interesting features by which changes to query popularity can be classified, and show that presence of these features, when accompanied by changes in result content, can be a good indicator of change in query intent.

【Keywords】: query dynamics

25. Patterns of temporal variation in online media.

Paper Link】 【Pages】:177-186

【Authors】: Jaewon Yang ; Jure Leskovec

【Abstract】: Online content exhibits rich temporal dynamics, and diverse realtime user generated content further intensifies this process. However, temporal patterns by which online content grows and fades over time, and by which different pieces of content compete for attention remain largely unexplored. We study temporal patterns associated with online content and how the content's popularity grows and fades over time. The attention that content receives on the Web varies depending on many factors and occurs on very different time scales and at different resolutions. In order to uncover the temporal dynamics of online content we formulate a time series clustering problem using a similarity metric that is invariant to scaling and shifting. We develop the K-Spectral Centroid (K-SC) clustering algorithm that effectively finds cluster centroids with our similarity measure. By applying an adaptive wavelet-based incremental approach to clustering, we scale K-SC to large data sets. We demonstrate our approach on two massive datasets: a set of 580 million Tweets, and a set of 170 million blog posts and news media articles. We find that K-SC outperforms the K-means clustering algorithm in finding distinct shapes of time series. Our analysis shows that there are six main temporal shapes of attention of online content. We also present a simple model that reliably predicts the shape of attention by using information about only a small number of participants. Our analyses offer insight into common temporal patterns of the content on theWeb and broaden the understanding of the dynamics of human attention.

【Keywords】: social media; time series clustering

26. Using graded-relevance metrics for evaluating community QA answer selection.

Paper Link】 【Pages】:187-196

【Authors】: Tetsuya Sakai ; Daisuke Ishikawa ; Noriko Kando ; Yohei Seki ; Kazuko Kuriyama ; Chin-Yew Lin

【Abstract】: Community Question Answering (CQA) sites such as Yahoo! Answers have emerged as rich knowledge resources for information seekers. However, answers posted to CQA sites can be irrelevant, incomplete, redundant, incorrect, biased, ill-formed or even abusive. Hence, automatic selection of "good" answers for a given posted question is a practical research problem that will help us manage the quality of accumulated knowledge. One way to evaluate answer selection systems for CQA would be to use the Best Answers (BAs) that are readily available from the CQA sites. However, BAs may be biased, and even if they are not, there may be other good answers besides BAs. To remedy these two problems, we propose system evaluation methods that involve multiple answer assessors and graded-relevance information retrieval metrics. Our main findings from experiments using the NTCIR-8 CQA task data are that, using our evaluation methods, (a) we can detect many substantial differences between systems that would have been overlooked by BA-based evaluation; and (b) we can better identify hard questions (i.e. those that are handled poorly by many systems and therefore require focussed investigation) compared to BAbased evaluation. We therefore argue that our approach is useful for building effective CQA answer selection systems despite the cost of manual answer assessments.

【Keywords】: best answers; community question answering; evaluation; graded relevance; ntcir; test collections

27. Mining social images with distance metric learning for automated image tagging.

Paper Link】 【Pages】:197-206

【Authors】: Pengcheng Wu ; Steven C. H. Hoi ; Peilin Zhao ; Ying He

【Abstract】: With the popularity of various social media applications, massive social images associated with high quality tags have been made available in many social media web sites nowadays. Mining social images on the web has become an emerging important research topic in web search and data mining. In this paper, we propose a machine learning framework for mining social images and investigate its application to automated image tagging. To effectively discover knowledge from social images that are often associated with multimodal contents (including visual images and textual tags), we propose a novel Unified Distance Metric Learning (UDML) scheme, which not only exploits both visual and textual contents of social images, but also effectively unifies both inductive and transductive metric learning techniques in a systematic learning framework. We further develop an efficient stochastic gradient descent algorithm for solving the UDML optimization task and prove the convergence of the algorithm. By applying the proposed technique to the automated image tagging task in our experiments, we demonstrate that our technique is empirically effective and promising for mining social images towards some real applications.

【Keywords】: automated image tagging; distance metric learning; inductive learning; social images; transductive learning

Oral session 7: web mining 4

28. Dynamic relationship and event discovery.

Paper Link】 【Pages】:207-216

【Authors】: Anish Das Sarma ; Alpa Jain ; Cong Yu

【Abstract】: This paper studies the problem of dynamic relationship and event discovery. A large body of previous work on relation extraction focuses on discovering predefined and static relationships between entities. In contrast, we aim to identify temporally defined (e.g., co-bursting) relationships that are not predefined by an existing schema, and we identify the underlying time constrained events that lead to these relationships. The key challenges in identifying such events include discovering and verifying dynamic connections among entities, and consolidating binary dynamic connections into events consisting of a set of entities that are connected at a given time period. We formalize this problem and introduce an efficient end-to-end pipeline as a solution. In particular, we introduce two formal notions, global temporal constraint cluster and local temporal constraint cluster, for detecting dynamic events. We further design efficient algorithms for discovering such events from a large graph of dynamic relationships. Finally, detailed experiments on real data show the effectiveness of our proposed solution.

【Keywords】: dynamic events; entity relationships; event discovery

29. Joint training for open-domain extraction on the web: exploiting overlap when supervision is limited.

Paper Link】 【Pages】:217-226

【Authors】: Rahul Gupta ; Sunita Sarawagi

【Abstract】: We consider the problem of jointly training structured models for extraction from multiple web sources whose records enjoy partial content overlap. This has important applications in open-domain extraction, e.g. a user materializing a table of interest from multiple relevant unstructured sources; or a site like Freebase augmenting an incomplete relation by extracting more rows from web sources. Such applications require extraction over arbitrary domains, so one cannot use a pre-trained extractor or demand a huge labeled dataset. We propose to overcome this lack of supervision by using content overlap across the related web sources. Existing methods of exploiting overlap have been developed under settings that do not generalize easily to the scale and diversity of overlap seen on Web sources. We present an agreement-based learning framework that jointly trains the models by biasing them to agree on the agreement regions, i.e. shared text segments. We present alternatives within our framework to trade-off tractability, robustness to noise, and extent of agreement enforced; and propose a scheme of partitioning agreement regions that leads to efficient training while maximizing overall accuracy. Further, we present a principled scheme to discover low-noise agreement regions in unlabeled data across multiple sources. Through extensive experiments over 58 different extraction domains, we establish that our framework provides significant boosts over uncoupled training, and scores over alternatives such as collective inference, staged training, and multi-view learning.

【Keywords】: collective training; graphical models; information extraction

30. Scalable knowledge harvesting with high precision and high recall.

Paper Link】 【Pages】:227-236

【Authors】: Ndapandula Nakashole ; Martin Theobald ; Gerhard Weikum

【Abstract】: Harvesting relational facts from Web sources has received great attention for automatically constructing large knowledge bases. Stateof-the-art approaches combine pattern-based gathering of fact candidates with constraint-based reasoning. However, they still face major challenges regarding the trade-offs between precision, recall, and scalability. Techniques that scale well are susceptible to noisy patterns that degrade precision, while techniques that employ deep reasoning for high precision cannot cope with Web-scale data. This paper presents a scalable system, called PROSPERA, for high-quality knowledge harvesting. We propose a new notion of ngram-itemsets for richer patterns, and use MaxSat-based constraint reasoning on both the quality of patterns and the validity of fact candidates.We compute pattern-occurrence statistics for two benefits: they serve to prune the hypotheses space and to derive informative weights of clauses for the reasoner. The paper shows how to incorporate these building blocks into a scalable architecture that can parallelize all phases on a Hadoop-based distributed platform. Our experiments with the ClueWeb09 corpus include comparisons to the recent ReadTheWeb experiment. We substantially outperform these prior results in terms of recall, with the same precision, while having low run-times.

【Keywords】: information extraction; knowledhe harvesting; scalability

31. Mining named entities with temporally correlated bursts from multilingual web news streams.

Paper Link】 【Pages】:237-246

【Authors】: Alexander Kotov ; ChengXiang Zhai ; Richard Sproat

【Abstract】: In this work, we study a new text mining problem of discovering named entities with temporally correlated bursts of mention counts in multiple multilingual Web news streams. Mining named entities with temporally correlated bursts of mention counts in multilingual text streams has many interesting and important applications, such as identification of the latent events, attracting the attention of on-line media in different countries, and valuable linguistic knowledge in the form of transliterations. While mining "bursty" terms in a single text stream has been studied before, the problem of detecting terms with temporally correlated bursts in multilingual Web streams raises two new challenges: (i) correlated terms in multiple streams may have bursts that are of different orders of magnitude in their intensity and (ii) bursts of correlated terms may be separated by time gaps. We propose a two-stage method for mining items with temporally correlated bursts from multiple data streams, which addresses both challenges. In the first stage of the method, the temporal behavior of different entities is normalized by modeling them with the Markov-Modulated Poisson Process. In the second stage, a dynamic programming algorithm is used to discover correlated bursts of different items, that can be potentially separated by time gaps. We evaluated our method with the task of discovering transliterations of named entities from multilingual Web news streams. Experimental results indicate that our method can not only effectively discover named entities with correlated bursts in multilingual Web news streams, but also outperforms two state-of-the-art baseline methods for unsupervised discovery of transliterations in static text collections.

【Keywords】: correlated burst detection; dynamic programming; probabilistic modeling; text streams

32. Dynamic ranked retrieval.

Paper Link】 【Pages】:247-256

【Authors】: Christina Brandt ; Thorsten Joachims ; Yisong Yue ; Jacob Bank

【Abstract】: We present a theoretically well-founded retrieval model for dynamically generating rankings based on interactive user feedback. Unlike conventional rankings that remain static after the query was issued, dynamic rankings allow and anticipate user activity, thus providing a way to combine the otherwise contradictory goals of result diversification and high recall. We develop a decision-theoretic framework to guide the design and evaluation of algorithms for this interactive retrieval setting. Furthermore, we propose two dynamic ranking algorithms, both of which are computationally efficient. We prove that these algorithms provide retrieval performance that is guaranteed to be at least as good as the optimal static ranking algorithm. In empirical evaluations, dynamic ranking shows substantial improvements in retrieval performance over conventional static rankings.

【Keywords】: decision theory; diversified retrieval; relevance feedback

Paper Link】 【Pages】:257-266

【Authors】: Flavio Chierichetti ; Ravi Kumar ; Prabhakar Raghavan

【Abstract】: Classic search engine results are presented as an ordered list of documents and the problem of presentation trivially reduces to ordering documents by their scores. This is because users scan a list presentation from top to bottom. This leads to natural list optimization measures such as the discounted cumulative gain (DCG) and the rank-biased precision (RBP). Increasingly, search engines are using two-dimensional results presentations; image and shopping search results are long-standing examples. The simplistic heuristic used in practice is to place images by row-major order in the matrix presentation. However, a variety of evidence suggests that users' scan of pages is not in this matrix order. In this paper we (1) view users' scan of a results page as a Markov chain, which yields DCG and RBP as special cases for linear lists; (2) formulate, study, and develop solutions for the problem of inferring the Markov chain from click logs; (3) from these inferred Markov chains, empirically validate folklore phenomena (e.g., the "golden triangle" of user scans in two dimensions); and (4) develop and experimentally compare algorithms for optimizing user utility in matrix presentations. The theory and algorithms extend naturally beyond matrix presentations.

【Keywords】: image search; markov chain; page layout; user scan model

Paper Link】 【Pages】:267-276

【Authors】: Debmalya Panigrahi ; Sreenivas Gollapudi

【Abstract】: Commerce search engines have become popular in recent years, as users increasingly search for (and buy) products on the web. In response to an user query, they surface links to products in their catalog (or index) that match the requirements specified in the query. Often, few or no product in the catalog matches the user query exactly, and the search engine is forced to return a set of products that partially match the query. This paper considers the problem of choosing a set of products in response to an user query, so as to ensure maximum user satisfaction. We call this the result enrichment problem in commerce search. The challenge in result enrichment is two-fold: the search engine needs to estimate the extent to which a user genuinely cares about an attribute that she has specified in a query; then, it must display products in the catalog that match the user requirement on the important attributes, but have a similar but possibly non-identical value on the less important ones. To this end, we propose a technique for measuring the importance of individual attribute values and the similarity between different values of an attribute. A novelty of our approach is that we use entire browse trails, rather than just clickthrough rates, in this estimation algorithm. We develop a model for this problem, propose an algorithm to solve it, and support our theoretical findings via experiments conducted on actual user data.

【Keywords】: streaming algorithms; structured search

Paper Link】 【Pages】:277-286

【Authors】: Claudio Lucchese ; Salvatore Orlando ; Raffaele Perego ; Fabrizio Silvestri ; Gabriele Tolomei

【Abstract】: The research challenge addressed in this paper is to devise effective techniques for identifying task-based sessions, i.e. sets of possibly non contiguous queries issued by the user of a Web Search Engine for carrying out a given task. In order to evaluate and compare different approaches, we built, by means of a manual labeling process, a ground-truth where the queries of a given query log have been grouped in tasks. Our analysis of this ground-truth shows that users tend to perform more than one task at the same time, since about 75% of the submitted queries involve a multi-tasking activity. We formally define the Task-based Session Discovery Problem (TSDP) as the problem of best approximating the manually annotated tasks, and we propose several variants of well known clustering algorithms, as well as a novel efficient heuristic algorithm, specifically tuned for solving the TSDP. These algorithms also exploit the collaborative knowledge collected by Wiktionary and Wikipedia for detecting query pairs that are not similar from a lexical content point of view, but actually semantically related. The proposed algorithms have been evaluated on the above ground-truth, and are shown to perform better than state-of-the-art approaches, because they effectively take into account the multi-tasking behavior of users.

【Keywords】: query clustering; query log analysis; query log session detection; task-based session; user search intent

36. Recommender systems with social regularization.

Paper Link】 【Pages】:287-296

【Authors】: Hao Ma ; Dengyong Zhou ; Chao Liu ; Michael R. Lyu ; Irwin King

【Abstract】: Although Recommender Systems have been comprehensively analyzed in the past decade, the study of social-based recommender systems just started. In this paper, aiming at providing a general method for improving recommender systems by incorporating social network information, we propose a matrix factorization framework with social regularization. The contributions of this paper are four-fold: (1) We elaborate how social network information can benefit recommender systems; (2) We interpret the differences between social-based recommender systems and trust-aware recommender systems; (3) We coin the term Social Regularization to represent the social constraints on recommender systems, and we systematically illustrate how to design a matrix factorization objective function with social regularization; and (4) The proposed method is quite general, which can be easily extended to incorporate other contextual information, like social tags, etc. The empirical analysis on two large datasets demonstrates that our approaches outperform other state-of-the-art methods.

【Keywords】: collaborative filtering; matrix factorization; recommender systems; social network; social regularization

37. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms.

Paper Link】 【Pages】:297-306

【Authors】: Lihong Li ; Wei Chu ; John Langford ; Xuanhui Wang

【Abstract】: Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news recommendation in general. Offline evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging due to their "partial-label" nature. Common practice is to create a simulator which simulates the online environment for the problem at hand and then run an algorithm against this simulator. However, creating simulator itself is often difficult and modeling bias is usually unavoidably introduced. In this paper, we introduce a replay methodology for contextual bandit algorithm evaluation. Different from simulator-based approaches, our method is completely data-driven and very easy to adapt to different applications. More importantly, our method can provide provably unbiased evaluations. Our empirical results on a large-scale news article recommendation dataset collected from Yahoo! Front Page conform well with our theoretical results. Furthermore, comparisons between our offline replay and online bucket evaluation of several contextual bandit algorithms show accuracy and effectiveness of our offline evaluation method.

【Keywords】: benchmark dataset; contextual bandit; multi-armed bandit; offline evaluation; recommendation

38. Efficient online ad serving in a display advertising exchange.

Paper Link】 【Pages】:307-316

【Authors】: Kevin J. Lang ; Joaquin Delgado ; Dongming Jiang ; Bhaskar Ghosh ; Shirshanka Das ; Amita Gajewar ; Swaroop Jagadish ; Arathi Seshan ; Chavdar Botev ; Michael Bindeberger-Ortega ; Sunil Nagaraj ; Raymie Stata

【Abstract】: We introduce and formalize a novel constrained path optimization problem that is the heart of the real-time ad serving task in the Yahoo! (formerly RightMedia) Display Advertising Exchange. In the Exchange, the ad server's task for each display opportunity is to compute, with low latency, an optimal valid path through a directed graph representing the business arrangements between the hundreds of thousands of business entities that are participating in the Exchange. These entities include not only publishers and advertisers, but also intermediate entities called "ad networks" which have delegated their ad serving responsibilities to the Exchange. Path optimality is determined by the payment to the publisher, and is affected by an advertiser's bid and also by the revenue-sharing agreements between the entities in the chosen path leading back to the publisher. Path validity is determined by constraints which focus on the following three issues: 1) suitability of the opportunity's web page and its publisher 2)suitability of the user who is currently viewing that web page, and 3) suitability of a candidate ad and its advertiser. Because the Exchange's constrained path optimization task is novel, there are no published algorithms for it. This paper describes two different algorithms that have both been successfully used in the actual Yahoo! ad server. The first algorithm has the advantage of being extremely simple, while the second is more robust thanks to its polynomial worst-case running time. In both cases, meeting latency caps has required that the basic algorithms be improved by optimizations; we will describe a candidate ordering scheme and a pre-computation scheme that have both been effective in reducing latency in the real ad serving system that serves over ten billion ad calls per day.

【Keywords】: ad selection; ad serving; graph algorithms; online advertising

39. Trend analysis model: trend consists of temporal words, topics, and timestamps.

Paper Link】 【Pages】:317-326

【Authors】: Noriaki Kawamae

【Abstract】: This paper presents a topic model that identifies interpretable low dimensional components in time-stamped data for capturing the evolution of trends. Unlike other models for time-stamped data, our proposal, the trend analysis model (TAM), focuses on the difference between temporal words and other words in each document to detect topic evolution over time. TAM introduces a latent trend class variable into each document and a latent switch variable into each token for handling these differences. The trend class has a probability distribution over temporal words, topics, and a continuous distribution over time, where each topic is responsible for generating words. The latter class uses a document specific probabilistic distribution to judge which variable each word comes from for generating words in each token. Accordingly, TAM can explain which topic co-occurrence pattern will appear at any given time, and represents documents of similar content and timestamp as sharing the same trend class. Therefore, TAM projects them on a latent space of trend dimensionality and allows us to predict the temporal evolution of words and topics in document collections. Experiments on various data sets show that the proposed model can capture interpretable low dimensionality sets of topics and timestamps, take advantage of previous models, and is useful as a generative model in the analysis of the evolution of trends.

【Keywords】: bayesian hierarchical model; graphical models; latent variable modeling; timestamp; timestamped data; topic modeling; trend analysis

Poster session 1 26

40. Topical semantics of twitter links.

Paper Link】 【Pages】:327-336

【Authors】: Michael J. Welch ; Uri Schonfeld ; Dan He ; Junghoo Cho

【Abstract】: Twitter, a micro-blogging platform with an estimated 20 million unique monthly visitors and over 100 million registered users, offers an abundance of rich, structured data at a rate exceeding 600 tweets per second. Recent efforts to leverage this social data to rank users by quality and topical relevance have largely focused on the "follow" relationship. Twitter's data offers additional implicit relationships between users, however, such as "retweets" and "mentions". In this paper we investigate the semantics of the follow and retweet relationships. Specifically, we show that the transitivity of topical relevance is better preserved over retweet links, and that retweeting a user is a significantly stronger indicator of topical interest than following him. We demonstrate these properties by ranking users with two variants of the PageRank algorithm; one based on the follows sub-graph and one based on the implicit retweet sub-graph. We perform a user study to assess the topical relevance of the resulting top-ranked users.

【Keywords】: link semantics; modeling; ranking; twitter; web graph

41. Evaluating the visual quality of web pages using a computational aesthetic approach.

Paper Link】 【Pages】:337-346

【Authors】: Ou Wu ; Yunfei Chen ; Bing Li ; Weiming Hu

【Abstract】: Current Web mining explores useful and valuable information (content) online for users. However, there is scant research on the overall visual aspect of Web pages, even though visual elements such as aesthetics significantly influence user experience. A beautiful and well-laid out Web page greatly facilitates users' accessing and enhances browsing experiences.We use "visual quality (VisQ)" to denote the aesthetics of Web pages. In this paper, a computational aesthetics approach is proposed to learn the evaluation model for the visual quality of Web pages. First, a Web page layout extraction algorithm (V-LBE) is introduced to partition a Web page into major layout blocks. Then, regarding a Web page as a semi-structured image, features (e.g., layout,visual complexity, colorfulness) known to significantly affect the visual quality of a Web page are extracted to construct a feature vector. We present a multi-cost-sensitive learning for visual quality classification and a multi-value regression for visual quality score assignment. Our experiments compare the extracted features and conclude that the Web page's layout visual features (LV) and text visual features (TV) are the primary affecting factors toward Web page's visual quality. The performance of the learned visual quality classifier is close to some persons'. The learned regression function also achieves promising results.

【Keywords】: aesthetics; learning; semi-structured image; visual quality; web mining

42. Clustering product features for opinion mining.

Paper Link】 【Pages】:347-354

【Authors】: Zhongwu Zhai ; Bing Liu ; Hua Xu ; Peifa Jia

【Abstract】: In sentiment analysis of product reviews, one important problem is to produce a summary of opinions based on product features/attributes (also called aspects). However, for the same feature, people can express it with many different words or phrases. To produce a useful summary, these words and phrases, which are domain synonyms, need to be grouped under the same feature group. Although several methods have been proposed to extract product features from reviews, limited work has been done on clustering or grouping of synonym features. This paper focuses on this task. Classic methods for solving this problem are based on unsupervised learning using some forms of distributional similarity. However, we found that these methods do not do well. We then model it as a semi-supervised learning problem. Lexical characteristics of the problem are exploited to automatically identify some labeled examples. Empirical evaluation shows that the proposed method outperforms existing state-of-the-art methods by a large margin.

【Keywords】: opinion mining; product feature grouping

43. Materializing multi-relational databases from the web using taxonomic queries.

Paper Link】 【Pages】:355-364

【Authors】: Matthew Michelson ; Sofus A. Macskassy ; Steven Minton ; Lise Getoor

【Abstract】: Recently, much attention has been given to extracting tables from Web data. In this problem, the column definitions and tuples (such as what "company" is headquartered in what "city,") are extracted from Web text, structured Web data such as lists, or results of querying the deep Web, creating the table of interest. In this paper, we examine the problem of extracting and discovering multiple tables in a given domain, generating a truly multi-relational database as output. Beyond discovering the relations that define single tables, our approach discovers and leverages "within column" set membership relations, and discovers relations across the extracted tables (e.g., joins). By leveraging within-column relations our method can extract table instances that are ambiguous or rare, and by discovering joins, our method generates truly multi-relational output. Further, our approach uses taxonomic queries to bootstrap the extraction, rather than the more traditional "seed instances." Creating seeds often requires more domain knowledge than taxonomic queries, and previous work has shown that extraction methods may be sensitive to which input seeds they are given. We test our approach on two real world domains: NBA basketball and cancer information. Our results demonstrate that our approach generates databases of relevant tables from disparate Web information, and discovers the relations between them. Further, we show that by leveraging the "within column" relation our approach can identify a significant number of relevant tuples that would be difficult to do so otherwise.

【Keywords】: discovering multi-relational data; multirelational data

44. What blogs tell us about websites: a demographics study.

Paper Link】 【Pages】:365-374

【Authors】: Matthew Michelson ; Sofus A. Macskassy

【Abstract】: One challenge for content providers on the Web is determining who consumes their content. For instance, online newspapers want to know who is reading their articles. Previous approaches have tried to determine such audience demographics by placing cookies on users' systems, or by directly asking consumers (e.g., through surveys). The first approach may make users uncomfortable, and the second is not scalable. In this paper we focus on determining the demographics of a Website's audience by analyzing the blogs that link to the Website. We analyze both the text of the blogs and the network connectivity of the blog network to determine demographics such as whether a person "is married" or "has pets." Presumably bloggers linking to sites also consume the content of those sites. Therefore, the discovered demographics for the bloggers can be used to represent a proxy set of demographics for a subset of the Website's consumers. We demonstrate that in many cases we can infer sub-audiences for a site from these demographics. Further, this feasibility demonstrates that very specific demographics for sites can be generated as we improve the methods for determining them (e.g., finding people who play video games). In our study we analyze blogs collected from more than 590,000 bloggers collected over a six month period that link to more than 488,000 distinct, external websites.

【Keywords】: audience discovery; discovery of demographics

45. Cross lingual text classification by mining multilingual topics from wikipedia.

Paper Link】 【Pages】:375-384

【Authors】: Xiaochuan Ni ; Jian-Tao Sun ; Jian Hu ; Zheng Chen

【Abstract】: This paper investigates how to effectively do cross lingual text classification by leveraging a large scale and multilingual knowledge base, Wikipedia. Based on the observation that each Wikipedia concept is described by documents of different languages, we adapt existing topic modeling algorithms for mining multilingual topics from this knowledge base. The extracted topics have multiple types of representations, with each type corresponding to one language. In this work, we regard such topics extracted from Wikipedia documents as universal-topics, since each topic corresponds with same semantic information of different languages. Thus new documents of different languages can be represented in a space using a group of universal-topics. We use these universal-topics to do cross lingual text classification. Given the training data labeled for one language, we can train a text classifier to classify the documents of another language by mapping all documents of both languages into the universal-topic space. This approach does not require any additional linguistic resources, like bilingual dictionaries, machine translation tools, or labeling data for the target language. The evaluation results indicate that our topic modeling approach is effective for building cross lingual text classifier.

【Keywords】: cross lingual text classification; multilingual; topic modeling; universal-topics; wikipedia

Paper Link】 【Pages】:385-394

【Authors】: Dongyeop Kang ; Daxin Jiang ; Jian Pei ; Zhen Liao ; Xiaohui Sun ; Ho-Jin Choi

【Abstract】: In addition to search queries and the corresponding clickthrough information, search engine logs record multidimensional information about user search activities, such as search time, location, vertical, and search device. Multidimensional mining of search logs can provide novel insights and useful knowledge for both search engine users and developers. In this paper, we describe our topic-concept cube project, which addresses the business need of supporting multidimensional mining of search logs effectively and efficiently. We answer two challenges. First, search queries and click-through data are well recognized sparse, and thus have to be aggregated properly for effective analysis. Second, there is often a gap between the topic hierarchies in multidimensional aggregate analysis and queries in search logs. To address those challenges, we develop a novel topic-concept model that learns a hierarchy of concepts and topics automatically from search logs. Enabled by the topicconcept model, we construct a topic-concept cube that supports online multidimensional mining of search log data. A distinct feature of our approach is that, in addition to the standard dimensions such as time and location, our topic-concept cube has a dimension of topics and concepts, which substantially facilitates the analysis of log data. To handle a huge amount of log data, we develop distributed algorithms for learning model parameters efficiently. We also devise approaches to computing a topic-concept cube. We report an empirical study verifying the effectiveness and efficiency of our approach on a real data set of 1.96 billion queries and 2.73 billion clicks.

【Keywords】: olap; search log; topic-concept cube

47. Optimizing merchant revenue with rebates.

Paper Link】 【Pages】:395-404

【Authors】: Rakesh Agrawal ; Samuel Ieong ; Raja Velu

【Abstract】: We study an online advertising model in which the merchant reimburses a portion of the transacted amount to the customer in a form of rebate. The customer referral and the rebate transfer might be mediated by a search engine. We investigate how the merchants can set rebate rates across different products to maximize their revenue. We consider two widely used demand models in economics---linear and log-linear---and explain how the effects of rebates can be incorporated in these models. Treating the parameters estimated as inputs to a revenue maximization problem, we develop convex optimization formulations of the problem and combinatorial algorithms for solving them. We validate our modeling assumptions using real transaction data. We conduct an extensive simulation study to evaluate the performance of our approach on maximizing revenue, and found that it generates significantly higher revenues for merchants compared to other rebate strategies. The rebate rates selected are extremely close to the optimal rates selected in hindsight.

【Keywords】: internet advertising; rebates

48. Searchable web sites recommendation.

Paper Link】 【Pages】:405-414

【Authors】: Yang Song ; Nam Nguyen ; Li-wei He ; Scott Imig ; Robert Rounthwaite

【Abstract】: In this paper, we propose a new framework for searchable web sites recommendation. Given a query, our system will recommend a list of searchable web sites ranked by relevance, which can be used to complement the web page results and ads from a search engine. We model the conditional probability of a searchable web site being relevant to a given query in term of three main components: the language model of the query, the language model of the content within the web site, and the reputation of the web site searching capability (static rank). The language models for queries and searchable sites are built using information mined from client-side browsing logs. The static rank for each searchable site leverages features extracted from these client-side logs such as number of queries that are submitted to this site, and features extracted from general search engines such as the number of web pages that indexed for this site, number of clicks per query, and the dwell-time that a user spends on the search result page and on the clicked result web pages. We also learn a weight for each kind of feature to optimize the ranking performance. In our experiment, we discover 10.5 thousand searchable sites and use 5 million unique queries, extracted from one week of log data to build and demonstrate the effectiveness of our searchable web site recommendation system.

【Keywords】: vertical search engines

Paper Link】 【Pages】:415-424

【Authors】: Yin He ; Kuansan Wang

【Abstract】: This paper presents Partially Observable Markov model with Duration (POMD), a statistical method that addresses the challenge of understanding sophisticated user behaviors from the search log in which some user actions, such as reading and skipping search results, cannot be observed and recorded. POMD utilizes not only the positional but also the temporal information of the clicks in the log. In this work, we treat the user engagements with a search engine as a Markov process, and model the unobservable engagements as hidden states. POMD differs from the traditional hidden Markov model (HMM) in that not all the hidden state transitions emit observable events, and that the duration of staying in each state is explicitly factored into the core statistical model. To address the training and decoding issues emerged as the results of the variations, we propose an iterative two-stage training algorithm and a greedy segmental decoding algorithm respectively. We validate the proposed algorithm with two sets of experiments. First, we show that the search behavioral patterns inferred by POMD match well with those reported in the eye tracking experiments. Secondly, through a series of A/B comparison experiments, we demonstrate that POMD can distinguish the ranking qualities of different search engine configurations much better than the patterns inferred by the model proposed in the previous work. Both of the experimental results suggest that POMD can provide a statistical and quantitative way to understand the sophisticated search behaviors by simply mining the search logs.

【Keywords】: duration; eye tracking; markov model; search behaviors

50. Learning website hierarchies for keyword enrichment in contextual advertising.

Paper Link】 【Pages】:425-434

【Authors】: Pavan Kumar GM ; Krishna P. Leela ; Mehul Parsana ; Sachin Garg

【Abstract】: In Contextual advertising, textual ads relevant to the content in a webpage are embedded in the page. Content keywords are extracted offline by crawling webpages and then stored in an index for fast serving. Given a page, ad selection involves index lookup, computing similarity between the keywords of the page and those of candidate ads and returning the top-k scoring ads. In this approach, ad relevance can suffer in two scenarios. First, since page-ad similarity is computed using keywords extracted only from that particular page, a few non pertinent keywords can skew ad selection. Second, requesting page may not be present in the index but we still need to serve relevant ads. We propose a novel mechanism to mitigate these problems in the same framework. The basic idea is to enrich keywords of a particular page with keywords from other but "similar" pages. The scheme involves learning a website specific hierarchy from (page, URL) pairs of the website. Next, keywords are populated on the nodes via successive top-down and bottom-up iterations over the hierarchy. We evaluate our approach on three data sets, one small human labeled set and two large-scale sets from Yahoo's contextual advertising system. Empirical evaluation show that ads fetched by enriching keywords has 2-3% higher nDCG compared to ads fetched based on a recent semantic approach even though the index size of our approach is 7 times less than the index size of semantic approach. Evaluation over pages which are not present in the index shows that ads fetched by our method has 6-7% higher nDCG compared to ads fetched based on a recent approach which uses first N bytes of the page content. Scalability is demonstrated via map-reduce adoption of our method and training on a large data set of 220 million pages from 95,104 websites.

【Keywords】: cache miss; contextual advertising; keyword aggregation; keyword retrieval accuracy; website hierarchy

51. Action prediction and identification from mining temporal user behaviors.

Paper Link】 【Pages】:435-444

【Authors】: Dakan Wang ; Gang Wang ; Xiaofeng Ke ; Weizhu Chen

【Abstract】: Predicting user's action provides many monetization opportunities to web service providers. If a user's future action can be predicted and identified correctly in time or in advance, we cannot only satisfy user's current need, but also facilitate and simplify user's future online activities. Traditional works on user behavior modeling such as implicit feedback or personalization mainly investigate on users' immediate, short-term or aggregate behaviors. Hence, it is difficult to understand the diversity in temporal user behavior and predict user's future action. In this paper, we consider a forecasting problem of temporal user behavior modeling. Our first objective is able to capture relevant users that will perform an action. The second objective is able to identify whether a user has finished the action, even when the action happened offline. We propose an ensemble algorithm to achieve both objectives. The experiment compares several implementation methods and demonstrates the temporal user behavior modeling using the ensemble algorithm significantly outperforms other methods.

【Keywords】: behavior targeting; ensemble modeling; temporal user behavior modeling

52. Collective extraction from heterogeneous web lists.

Paper Link】 【Pages】:445-454

【Authors】: Ashwin Machanavajjhala ; Arun Shankar Iyer ; Philip Bohannon ; Srujana Merugu

【Abstract】: Automatic extraction of structured records from inconsistently formatted lists on the web is challenging: different lists present disparate sets of attributes with variations in the ordering of attributes; many lists contain additional attributes and noise that can confuse the extraction process; and formatting within a list may be inconsistent due to missing attributes or manual formatting on some sites. We present a novel solution to this extraction problem that is based on i) collective extraction from multiple lists simultaneously and ii) careful exploitation of a small database of seed entities. Our approach addresses the layout homogeneity within the individual lists, content redundancy across some snippets from different sources, and the noisy attribute rendering process. We experimentally evaluate variants of this algorithm on real world data sets and show that our approach is a promising direction for extraction from noisy lists, requiring mild and thus inexpensive supervision suitable for extraction from the tail of the web.

【Keywords】: collective bayesian models; hidden markov models; incremental; information extraction

53. CMAP: effective fusion of quality and relevance for multi-criteria recommendation.

Paper Link】 【Pages】:455-464

【Authors】: Xin Xin ; Michael R. Lyu ; Irwin King

【Abstract】: The research issue of recommender systems has been treated as a classical regression problem over the decades and has obtained a great success. In the next generation of recommender systems, multi-criteria recommendation has been predicted as an important direction. Different from traditional recommender systems that aim particularly at recommending high-quality items evaluated by users' ratings, inmulti-criteria recommendation, quality only serves as one criterion, and many other criteria such as relevance, coverage, and diversity should be simultaneously optimized. Although recently there is work investigating each single criterion, there is rarely any literature that reports how each single criterion impacts each other and how to combine them in real applications. Thus in this paper, we study the relationship of two criteria, quality and relevance, as a preliminary work in multi-criteria recommendation. We first give qualitative and quantitative analysis of competitive quality-based and relevance-based algorithms in these two criteria to show that both algorithms cannot work well in the opposite criteria. Then we propose an integrated metric and finally investigate how to combine previous work together into an unified model. In the combination, we introduce a Continuous-time MArkov Process (CMAP) algorithm for ranking, which enables principled and natural integration with features derived from both quality-based and relevance-based algorithms. Through experimental verification, the combined methods can significantly outperform either single quality-based or relevance-based algorithms in the integrated metric and the CMAP model outperforms traditional combination methods by around 3%. Its linear complexity with respect to the number of users and items leads to satisfactory performance, as demonstrated by the around 7-hour computational time for over 480k users and almost 20k items.

【Keywords】: collaborative filtering; continuous-time markov process; recommender system

54. CoBayes: bayesian knowledge corroboration with assessors of unknown areas of expertise.

Paper Link】 【Pages】:465-474

【Authors】: Gjergji Kasneci ; Jurgen Van Gael ; David H. Stern ; Thore Graepel

【Abstract】: Our work aims at building probabilistic tools for constructing and maintaining large-scale knowledge bases containing entity-relationship-entity triples (statements) extracted from the Web. In order to mitigate the uncertainty inherent in information extraction and integration we propose leveraging the "wisdom of the crowds" by aggregating truth assessments that users provide about statements. The suggested method, CoBayes, operates on a collection of statements, a set of deduction rules (e.g. transitivity), a set of users, and a set of truth assessments of users about statements. We propose a joint probabilistic model of the truth values of statements and the expertise of users for assessing statements. The truth values of statements are interconnected through derivations based on the deduction rules. The correctness of a user's assessment for a given statement is modeled by linear mappings from user descriptions and statement descriptions into a common latent knowledge space where the inner product between user and statement vectors determines the probability that the user assessment for that statement will be correct. Bayesian inference in this complex graphical model is performed using mixed variational and expectation propagation message passing. We demonstrate the viability of CoBayes in comparison to other approaches, on realworld datasets and user feedback collected from Amazon Mechanical Turk.

【Keywords】: bayesian; expertise; feedback; knowledge; model; user

Paper Link】 【Pages】:475-484

【Authors】: Zhicheng Dou ; Sha Hu ; Kun Chen ; Ruihua Song ; Ji-Rong Wen

【Abstract】: Most existing search result diversification algorithms diversify search results in terms of a specific dimension. In this paper, we argue that search results should be diversified in a multi-dimensional way, as queries are usually ambiguous at different levels and dimensions. We first explore mining subtopics from four types of data sources, including anchor texts, query logs, search result clusters, and web sites. Then we propose a general framework that explicitly diversifies search results based on multiple dimensions of subtopics. It balances the relevance of documents with respect to the query and the novelty of documents by measuring the coverage of subtopics. Experimental results on the TREC 2009 Web track dataset indicate that combining multiple types of subtopics do help better understand user intents. By incorporating multiple types of subtopics, our models improve the diversity of search results over the sole use of one of them, and outperform two state-of-the-art models.

【Keywords】: intent; multi-dimensional; search result diversity; subtopic

Paper Link】 【Pages】:485-494

【Authors】: Morgan Harvey ; Ian Ruthven ; Mark James Carman

【Abstract】: Social tagging systems have recently become very popular as a method of categorising information online and have been used to annotate a wide range of different resources. In such systems users are free to choose whatever keywords or "tags" they wish to annotate each resource, resulting in a highly personalised, unrestricted vocabulary. While this freedom of choice has several notable advantages, it does come at the cost of making searching of these systems more difficult as the vocabulary problem introduced is more pronounced than in a normal information retrieval setting. In this paper we propose to use hidden topic models as a principled way of reducing the dimensionality of this data to provide more accurate resource rankings with higher recall. We first describe Latent Dirichlet Allocation (LDA), a simple topic model and then introduce 2 extended models which can be used to personalise the results by including information about the user who made each annotation. We test these 3 models and compare them with 3 non-topic model baselines on a large data sample obtained from the Delicious social bookmarking site. Our evaluations show that our methods significantly outperform all of the baselines with the personalised models also improving significantly upon unpersonalised LDA.

【Keywords】: collaborative tagging; personalised search; social bookmarking; topic models

57. Strength of social influence in trust networks in product review sites.

Paper Link】 【Pages】:495-504

【Authors】: Ching-man Au Yeung ; Tomoharu Iwata

【Abstract】: Some popular product review sites such as Epinions allow users to establish a trust network among themselves, indicating who they trust in providing product reviews and ratings. While trust relations have been found to be useful in generating personalised recommendations, the relations between trust and product ratings has so far been overlooked. In this paper, we examine large datasets collected from Epinions and Ciao, two popular product review sites. We discover that in general users who trust each other tend to have smaller differences in their ratings as time passes, giving support to the theories of homophily and social influence. However, we also discover that this does not hold true across all trusted users. A trust relation does not guarantee that two users have similar preferences, implying that personalised recommendations based on trust relations do not necessarily produce more accurate predictions. We propose a method to estimate the strengths of trust relations so as to estimate the true influence among the trusted users. Our method extends the popular matrix factorisation technique for collaborative filtering, which allow us to generate more accurate rating predictions at the same time. We also show that the estimated strengths of trust relations correlate with the similarity among the users. Our work contributes to the understanding of the interplay between trust relations and product ratings, and suggests that trust networks may serve as a more general socialising venue than only an indication of similarity in user preferences.

【Keywords】: collaborative filtering; matrix factorization; recommender systems; social influence; trust network

58. A combined topical/non-topical approach to identifying web sites for children.

Paper Link】 【Pages】:505-514

【Authors】: Carsten Eickhoff ; Pavel Serdyukov ; Arjen P. de Vries

【Abstract】: Today children interact more and more frequently with information services. Especially in on-line scenarios there is a great amount of content that is not suitable for their age group. Due to the growing importance and ubiquity of the Internet in today's world, denying children any unsupervised Web access is often not possible. This work presents an automatic way of distinguishing web pages for children from those for adults in order to improve child-appropriate web search engine performance. A range of 80 different features based on findings from cognitive sciences and children's psychology are discussed and evaluated. We conducted a large scale user study on the suitability of web sites and give detailed information about the insights gained. Finally a comparison to traditional web classification methods as well as human annotator performance reveals that our automatic classifier can reach a performance close to that of human agreement.

【Keywords】: children; classification; filtering; suitability; web search

Paper Link】 【Pages】:515-524

【Authors】: Andrei Z. Broder ; Evgeniy Gabrilovich ; Vanja Josifovski ; George Mavromatis ; Alexander J. Smola

【Abstract】: Sponsored search is a three-way interaction between advertisers, users, and the search engine. The basic ad selection in sponsored search, lets the advertiser choose the exact queries where the ad is to be shown. To increase advertising volume, many advertisers opt into advanced match, where the search engine can select additional queries that are deemed relevant for the advertiser's ad. In advanced match, the search engine is effectively bidding on the behalf of the advertisers. While advanced match has been extensively studied in the literature from the ad relevance perspective there is little work that discusses how to infer the appropriate bid value for a given advanced match. The bid value is crucial as it affects both the ad placement in revenue reordering and the amount advertisers are charged in case of a click. We propose a statistical approach to solve the bid generation problem and examine two information sources: the bidding behavior of advertisers, and the conversion data. Our key finding suggests that sophisticated advertisers' bids are driven by many factors beyond clicks and immediate measurable conversions, likely capturing the value chain of an ad display ranging from views, clicks, profit margins, etc., representing the total ROI from the advertising.

【Keywords】: machine learning; optimization; sponsored search

60. Let web spammers expose themselves.

Paper Link】 【Pages】:525-534

【Authors】: Zhicong Cheng ; Bin Gao ; Congkai Sun ; Yanbing Jiang ; Tie-Yan Liu

【Abstract】: This paper is concerned with mining link spams (e.g., link farm and link exchange) from search engine optimization (SEO) forums. To provide quality services, it is critical for search engines to address web spam. Several techniques such as TrustRank, BadRank, and SpamRank have been proposed for this purpose. Most of these methods try to downgrade the effects of the spam websites by identifying specific link patterns of them. However, spam websites have appeared to be more and more similar to normal or even good websites in their link structures, by reforming their spam techniques. As a result, it is very challenging to automatically detect link spams from the Web graph. In this paper, we propose a different approach, which detects link spams by looking at how web spammers make link spam happen. We find that web spammers usually ally with each other, and SEO forum is one of the major means for them to form the alliance. We therefore propose mining suspicious link spams directly from the posts in the SEO forums. However, the task is non-trivial because there are also other information and even noises contained in these posts, in addition to useful clues of link spam. To tackle the challenges, we first extract all the URLs contained in the posts of the SEO forums. Second, we extract features for the URLs from their relationships with forum users (potential spammers) and from their link structure in the web graph. Third, we build a semi-supervised learning framework to calculate the spam scores for the URLs, which encodes several heuristics such as spam websites usually linking to each other, and good websites seldom linking to spam websites. We tested our approach on seven major SEO forums. A lot of spam websites were identified, a significant proportion of which cannot be detected by conventional anti-spam methods. It indicates that the proposed approach can be a good complement of existing anti-spam techniques.

【Keywords】: anti-spam; forum mining; semi-supervised learning

61. Efficient entity resolution for large heterogeneous information spaces.

Paper Link】 【Pages】:535-544

【Authors】: George Papadakis ; Ekaterini Ioannou ; Claudia Niederée ; Peter Fankhauser

【Abstract】: We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that describes the same real-world entities, a task known as Entity Resolution. To make this quadratic task efficient, blocking techniques are typically employed. However, the high dynamics, loose schema binding, and heterogeneity of (semi-)structured data, impose new challenges to entity resolution. Existing blocking approaches become inapplicable because they rely on the homogeneity of the considered data and a-priory known schemata. In this paper, we introduce a novel approach for entity resolution, scaling it up for large, noisy, and heterogeneous information spaces. It combines an attribute-agnostic mechanism for building blocks with intelligent block processing techniques that boost blocks with high expected utility, propagate knowledge about identified matches, and preempt the resolution process when it gets too expensive. Our extensive evaluation on real-world, large, heterogeneous data sets verifies that the suggested approach is both effective and efficient.

【Keywords】: attribute-agnostic blocking; data cleaning; entity resolution

62. Web-scale table census and classification.

Paper Link】 【Pages】:545-554

【Authors】: Eric Crestan ; Patrick Pantel

【Abstract】: We report on a census of the types of HTML tables on the Web according to a fine-grained classification taxonomy describing the semantics that they express. For each relational table type, we describe open challenges for extracting from them semantic triples, i.e., knowledge. We also present TabEx, a supervised framework for web-scale HTML table classification and apply it to the task of classifying HTML tables into our taxonomy. We show empirical evidence, through a large-scale experimental analysis over a crawl of the Web, that classification accuracy significantly outperforms several baselines. We present a detailed feature analysis and outline the most salient features for each table type.

【Keywords】: classification; information extraction; structured data; web tables

63. A probabilistic approach for learning folksonomies from structured data.

Paper Link】 【Pages】:555-564

【Authors】: Anon Plangprasopchok ; Kristina Lerman ; Lise Getoor

【Abstract】: Learning structured representations has emerged as an important problem in many domains, including document and Web data mining, bioinformatics, and image analysis. One approach to learning complex structures is to integrate many smaller, incomplete and noisy structure fragments. In this work, we present an unsupervised probabilistic approach that extends affinity propagation [7] to combine the small ontological fragments into a collection of integrated, consistent, and larger folksonomies. This is a challenging task because the method must aggregate similar structures while avoiding structural inconsistencies and handling noise. We validate the approach on a real-world social media dataset, comprised of shallow personal hierarchies specified by many individual users, collected from the photosharing website Flickr. Our empirical results show that our proposed approach is able to construct deeper and denser structures, compared to an approach using only the standard affinity propagation algorithm. Additionally, the approach yields better overall integration quality than a state-of-the-art approach based on incremental relational clustering.

【Keywords】: collective knowledge; data mining; folksonomies; social information processing; social metadata; taxonomies

64. Linking online news and social media.

Paper Link】 【Pages】:565-574

【Authors】: Manos Tsagkias ; Maarten de Rijke ; Wouter Weerkamp

【Abstract】: Much of what is discussed in social media is inspired by events in the news and, vice versa, social media provide us with a handle on the impact of news events. We address the following linking task: given a news article, find social media utterances that implicitly reference it. We follow a three-step approach: we derive multiple query models from a given source news article, which are then used to retrieve utterances from a target social media index, resulting in multiple ranked lists that we then merge using data fusion techniques. Query models are created by exploiting the structure of the source article and by using explicitly linked social media utterances that discuss the source article. To combat query drift resulting from the large volume of text, either in the source news article itself or in social media utterances explicitly linked to it, we introduce a graph-based method for selecting discriminative terms. For our experimental evaluation, we use data from Twitter, Digg, Delicious, the New York Times Community, Wikipedia, and the blogosphere to generate query models. We show that different query models, based on different data sources, provide complementary information and manage to retrieve different social media utterances from our target index. As a consequence, data fusion methods manage to significantly boost retrieval performance over individual approaches. Our graph-based term selection method is shown to help improve both effectiveness and efficiency.

【Keywords】: linking; online news; social media; user generated content

Paper Link】 【Pages】:575-584

【Authors】: Ulf Brefeld ; Berkant Barla Cambazoglu ; Flavio Paiva Junqueira

【Abstract】: Assigning documents accurately to sites is critical for the performance of multi-site Web search engines. In such settings, sites crawl only documents they index and forward queries to obtain best-matching documents from other sites. Inaccurate assignments may lead to inefficiencies when crawling Web pages or processing user queries. In this work, we propose a machine-learned document assignment strategy that uses the locality of document views in search results to decide upon assignments. We evaluate the performance of our strategy using various document features extracted from a large Web collection. Our experimental setup uses query logs from a number of search front-ends spread across different geographic locations and uses these logs to learn the document access patterns. We compare our technique against baselines such as region- and language-based document assignment and observe that our technique achieves substantial performance improvements with respect to recall. With our technique, we are able to obtain a small query forwarding rate (0.04) requiring roughly 45% less replication of documents compared to replicating all documents across all sites.

【Keywords】: classification; document replication; multi-site web search engines

Poster session 2 25

66. Transient crowd discovery on the real-time social web.

Paper Link】 【Pages】:585-594

【Authors】: Krishna Yeswanth Kamath ; James Caverlee

【Abstract】: In this paper, we study the problem of automatically discovering and tracking transient crowds in highly-dynamic social messaging systems like Twitter and Facebook. Unlike the more static and long-lived group-based membership offered on many social networks (e.g., fan of the LA Lakers), a transient crowd is a short-lived ad-hoc collection of users, representing a "hotspot" on the real-time web. Successful detection of these hotspots can positively impact related research directions in online event detection, content personalization, social information discovery, etc. Concretely, we propose to model crowd formation and dispersion through a message-based communication clustering approach over time-evolving graphs that captures the natural conversational nature of social messaging systems. Two of the salient features of the proposed approach are (i) an efficient locality- based clustering approach for identifying crowds of users in near real-time compared to more heavyweight static clustering algorithms; and (ii) a novel crowd tracking and evolution approach for linking crowds across time periods. We find that the locality-based clustering approach results in empirically high-quality clusters relative to static graph clus- tering techniques at a fraction of the computational cost. Based on a three month snapshot of Twitter consisting of 711,612 users and 61.3 million messages, we show how the proposed approach can successfully identify and track interesting crowds based on the Twitter communication structure and uncover crowd-based topics of interest.

【Keywords】: clustering; community detection; real-time web; social media

67. Adaptive bootstrapping of recommender systems using decision trees.

Paper Link】 【Pages】:595-604

【Authors】: Nadav Golbandi ; Yehuda Koren ; Ronny Lempel

【Abstract】: Recommender systems perform much better on users for which they have more information. This gives rise to a problem of satisfying users new to a system. The problem is even more acute considering that some of these hard to profile new users judge the unfamiliar system by its ability to immediately provide them with satisfying recommendations, and may quickly abandon the system when disappointed. Rapid profiling of new users by a recommender system is often achieved through a bootstrapping process - a kind of an initial interview - that elicits users to provide their opinions on certain carefully chosen items or categories. The elicitation process becomes particularly effective when adapted to users' responses, making best use of users' time by dynamically modifying the questions to improve the evolving profile. In particular, we advocate a specialized version of decision trees as the most appropriate tool for this task. We detail an efficient tree learning algorithm, specifically tailored to the unique properties of the problem. Several extensions to the tree construction are also introduced, which enhance the efficiency and utility of the method. We implemented our methods within a movie recommendation service. The experimental study delivered encouraging results, with the tree-based bootstrapping process significantly outperforming previous approaches.

【Keywords】: collaborative filtering; decision tree; new user; recommender systems; user cold start

68. Predicting future reviews: sentiment analysis models for collaborative filtering.

Paper Link】 【Pages】:605-614

【Authors】: Noriaki Kawamae

【Abstract】: This paper presents hierarchical topic models for integrating sentiment analysis with collaborative filtering. Our goal is to automatically predict future reviews to a given author from previous reviews. For this goal, we focus on differentiating author's preference, while previous sentiment analysis models process these review articles without this difference. Therefore, we propose a Latent Evaluation Topic model (LET) that infer each author's preference by introducing novel latent variables into author and his/her document layer. Because these variables distinguish the variety of words in each article by merging similar word distributions, LET incorporates the difference of writers' preferences into sentiment analysis. Consequently LET can determine the attitude of writers, and predict their reviews based on like-minded writers' reviews in the collaborative filtering approach. Experiments on review articles show that the proposed model can reduce the dimensionality of reviews to the low-dimensional set of these latent variables, and is a significant improvement over standard sentiment analysis models and collaborative filtering algorithms.

【Keywords】: collaborative filtering; information extraction; latent variable modeling; sentiment analysis; topic modeling

69. Learning similarity function for rare queries.

Paper Link】 【Pages】:615-624

【Authors】: Jingfang Xu ; Gu Xu

【Abstract】: The key element of many query processing tasks can be formalized as calculation of similarities between queries. These include query suggestion, query reformulation, and query expansion. Although many methods have been proposed for query similarity calculation, they could perform poorly on rare queries. As far as we know, there was no previous work particularly about rare query similarity calculation, and this paper tries to study this problem. Specifically, we address three problems. Firstly, we define an n-gram space to represent queries with their own content and a similarity function to measure the similarities between queries. Secondly, we propose learning the similarity function by leveraging the training data derived from user behavior data. This is formalized as an optimization problem and a metric learning approach is employed to solve it efficiently. Finally, we exploit locality sensitive hashing for efficient retrieval of similar queries from a large query repository. We experimentally verified the effectiveness of the proposed approach by showing that our method can indeed enhance the accuracy of query similarity calculation for rare queries and efficiently retrieve similar queries. As an application, we also experimentally demonstrated that the similar queries found by our method can significantly improve search relevance.

【Keywords】: learning similarity function; query similarity; rare query

70. A two-view learning approach for image tag ranking.

Paper Link】 【Pages】:625-634

【Authors】: Jinfeng Zhuang ; Steven C. H. Hoi

【Abstract】: Tags of social images play a central role for text-based social image retrieval and browsing tasks. However, the original tags annotated by web users could be noisy, irrelevant, and often incomplete for describing the image contents, which may severely deteriorate the performance of text-based image retrieval models. In this paper, we aim to overcome the challenge of social tag ranking for a corpus of social images with rich user-generated tags by proposing a novel two-view learning approach. It can effectively exploit both textual and visual contents of social images to discover the complicated relationship between tags and images. Unlike the conventional learning approaches that usually assume some parametric models, our method is completely data-driven and makes no assumption of the underlying models, making the proposed solution practically more effective. We formally formulate our method as an optimization task and present an efficient algorithm to solve it. To evaluate the efficacy of our method, we conducted an extensive set of experiments by applying our technique to both text-based social image retrieval and automatic image annotation tasks, in which encouraging results showed that the proposed method is more effective than the conventional approaches.

【Keywords】: annotation; image search; optimization; recommendation; social images; tag ranking; two-view learning

Paper Link】 【Pages】:635-644

【Authors】: Lars Backstrom ; Jure Leskovec

【Abstract】: Predicting the occurrence of links is a fundamental problem in networks. In the link prediction problem we are given a snapshot of a network and would like to infer which interactions among existing members are likely to occur in the near future or which existing interactions are we missing. Although this problem has been extensively studied, the challenge of how to effectively combine the information from the network structure with rich node and edge attribute data remains largely open. We develop an algorithm based on Supervised Random Walks that naturally combines the information from the network structure with node and edge level attributes. We achieve this by using these attributes to guide a random walk on the graph. We formulate a supervised learning task where the goal is to learn a function that assigns strengths to edges in the network such that a random walker is more likely to visit the nodes to which new links will be created in the future. We develop an efficient training algorithm to directly learn the edge strength estimation function. Our experiments on the Facebook social graph and large collaboration networks show that our approach outperforms state-of-the-art unsupervised approaches as well as approaches that are based on feature extraction.

【Keywords】: link prediction; social networks

Paper Link】 【Pages】:645-654

【Authors】: Keke Cai ; Shenghua Bao ; Zi Yang ; Jie Tang ; Rui Ma ; Li Zhang ; Zhong Su

【Abstract】: Social influence is a complex and subtle force that governs the dynamics of social networks. In the past years, a lot of research work has been conducted to understand the spread patterns of social influence. However, most of approaches assume that influence exists between users with active social interactions, but ignore the question of what kind of influence happens between them. As such one interesting and also fundamental question is raised here: "in a social network, could the social connection reflect users'influence from both positive and negative aspects?". To this end, an Opinion Oriented Link Analysis Model (OOLAM) is proposed in this paper to characterize users' influence personae in order to exhibit their distinguishing influence ability in the social network. In particular, three types of influence personae are generalized and the problem of influence persona discovery is formally defined. Within the OOLAM model, two factors, i.e., opinion consistency and opinion creditability, are defined to capture the persona information from public opinion perspective. Extensive experimental studies have been performed to demonstrate the effectiveness of the proposed approach on influence persona analysis using real web data sets.

【Keywords】: creditability analysis; influence persona discovery; link analysis; opinion consistency

73. eBay: an E-commerce marketplace as a complex network.

Paper Link】 【Pages】:655-664

【Authors】: Zeqian Shen ; Neel Sundaresan

【Abstract】: Commerce networks involve buying and selling activities among individuals or organizations. As the growing of the Internet and e-commerce, it brings opportunities for obtaining real world online commerce networks, which are magnitude larger than before. Getting a deeper understanding of e-commerce networks, such as the eBay marketplace, in terms of what structure they have, what kind of interactions they afford, what trust and reputation measures exist, and how they evolve has tremendous value in suggesting business opportunities and building effective user applications. In this paper, we modeled the eBay network as a complex network. We analyzed the macroscopic shape of the network using degree distribution and the bow-tie model. Networks of different eBay categories are also compared. The results suggest that the categories vary from collector networks to retail networks. We also studied the local structures of the networks using motif profiling. Finally, patterns of preferential connections are visually analyzed using Auroral diagrams.

【Keywords】: auction; complex network; ecommerce; network motif; power law; preferential connection

74. A framework for quantitative analysis of cascades on networks.

Paper Link】 【Pages】:665-674

【Authors】: Rumi Ghosh ; Kristina Lerman

【Abstract】: How does information flow in online social networks? How does the structure and size of the information cascade evolve in time? How can we efficiently mine the information contained in cascade dynamics? We approach these questions empirically and present an efficient and scalable mathematical framework for quantitative analysis of cascades on networks. We define a cascade generating function that captures the details of the microscopic dynamics of the cascades. We show that this function can also be used to compute the macroscopic properties of cascades, such as their size, spread, diameter, number of paths, and average path length. We present an algorithm to efficiently compute cascade generating function and demonstrate that while significantly compressing information within a cascade, it nevertheless allows us to accurately reconstruct its structure. We use this framework to study information dynamics on the social network of Digg. Digg allows users to post and vote on stories, and easily see the stories that friends have voted on. As a story spreads on Digg through voting, it generates cascades. We extract cascades of more than 3,500 Digg stories and calculate their macroscopic and microscopic properties. We identify several trends in cascade dynamics: spreading via chaining, branching and community. We discuss how these affect the spread of the story through the Digg social network. Our computational framework is general and offers a practical solution to quantitative analysis of the microscopic structure of even very large cascades.

【Keywords】: cascades; diffusion; information spread; online social networks

Paper Link】 【Pages】:675-684

【Authors】: Srinivas Vadrevu ; Choon Hui Teo ; Suju Rajan ; Kunal Punera ; Byron Dom ; Alexander J. Smola ; Yi Chang ; Zhaohui Zheng

【Abstract】: In this paper, we present a system for clustering the search results of a news search engine. The news search interface includes the relevant news articles to a given query organized in terms of related news stories. Here each cluster corresponds to a news story and the news articles are clustered into stories. We present a system that clusters the search results of a news search system in a fast and scalable manner. The clustering system is organized into three components including offline clustering, incremental clustering and realtime clustering. We propose novel techniques for clustering the search results in realtime. The experimental results with large collections of news documents reveal that our system is both scalable and also achieves good accuracy in clustering the news search results.

【Keywords】: clustering; news clustering; news search; query based clustering; realtime clustering

76. Large-scale hierarchical text classification without labelled data.

Paper Link】 【Pages】:685-694

【Authors】: Viet Ha-Thuc ; Jean-Michel Renders

【Abstract】: The traditional machine learning approaches for text classification often require labelled data for learning classifiers. However, when applied to large-scale classification involving thousands of categories, creating such labelled data is extremely expensive since typically the data is manually labelled by humans. Motivated by this, we propose a novel approach for large-scale hierarchical text classification which does not require any labelled data. We explore a perspective where the meaning of a category is not defined by human-labelled documents, but by its description and more importantly its relationships with other categories (e.g. its ascendants and descendants). Specifically, we take advantage of the ontological knowledge in all phases of the whole process, namely when retrieving pseudo-labelled documents, when iteratively training the category models and when categorizing test documents. Our experiments based on a taxonomy containing 1131 categories and widely adopted in the news industry as a standard for the NewsML framework demonstrate the effectiveness of our approach in these phases both qualitatively and quantitatively. In particular, we emphasize that just by taking the simple ontological knowledge defined in the category hierarchy, we could automatically build a large-scale hierarchical classifier with reasonable performance of 67% in terms of the hierarchy-based F-1 measure.

【Keywords】: classification with no labelled data; hierarchical text classification; topic models; weakly supervised classification

77. Low-order tensor decompositions for social tagging recommendation.

Paper Link】 【Pages】:695-704

【Authors】: Yuanzhe Cai ; Miao Zhang ; Dijun Luo ; Chris H. Q. Ding ; Sharma Chakravarthy

【Abstract】: Social tagging recommendation is an urgent and useful enabling technology for Web 2.0. In this paper, we present a systematic study of low-order tensor decomposition approach that are specifically targeted at the very sparse data problem in tagging recommendation problem. Low-order polynomials have low functional complexity, are uniquely capable of enhancing statistics and also avoids over-fitting than traditional tensor decompositions such as Tucker and Parafac decompositions. We perform extensive experiments on several datasets and compared with 6 existing methods. Experimental results demonstrate that our approach outperforms existing approaches.

【Keywords】: graph mining; recommendation system; social network

78. Shopping for products you don't know you need.

Paper Link】 【Pages】:705-714

【Authors】: Srikanth Jagabathula ; Nina Mishra ; Sreenivas Gollapudi

【Abstract】: Recommendation engines today suggest one product to another, e.g., an accessory to a product. However, intent to buy often precedes a user's appearance in a commerce vertical: someone interested in buying a skateboard may have earlier searched for {varial heelflip}, a trick performed on a skateboard. This paper considers how a search engine can provide early warning of commercial intent. The naive algorithm of counting how often an interest precedes a commercial query is not sufficient due to the number of related ways of expressing an interest. Thus, methods are needed for finding sets of queries where all pairs are related, what we call a query community, and this is the technical contribution of the paper. We describe a random model by which we obtain relationships between search queries and then prove general conditions under which we can reconstruct query communities. We propose two complementary approaches for inferring recommendations that utilize query communities in order to magnify the recommendation signal beyond what an individual query can provide. An extensive series of experiments on real search logs shows that the query communities found by our algorithm are more interesting and unexpected than a baseline of clustering the query-click graph. Also, whereas existing query suggestion algorithms are not designed for making commercial recommendations, we show that our algorithms do succeed in forecasting commercial intent. Query communities increase both the quantity and quality of recommendations.

【Keywords】: long-range recommendations; query communities

Paper Link】 【Pages】:715-724

【Authors】: Ashok Kumar Ponnuswami ; Kumaresh Pattabiraman ; Qiang Wu ; Ran Gilad-Bachrach ; Tapas Kanungo

【Abstract】: Modern web search engines are federated --- a user query is sent to the numerous specialized search engines called verticals like web (text documents), News, Image, Video, etc. and the results returned by these engines are then aggregated and composed into a search result page (SERP) and presented to the user. For a specific query, multiple verticals could be relevant, which makes the placement of these vertical results within blocks of textual web results challenging: how do we represent, assess, and compare the relevance of these heterogeneous entities? In this paper we present a machine-learning framework for SERP composition in the presence of multiple relevant verticals. First, instead of using the traditional label generation method of human judgment guidelines and trained judges, we use a randomized online auditioning system that allows us to evaluate triples of the form query, web block, vertical>. We use a pairwise click preference to evaluate whether the web block or the vertical block had a better users' engagement. Next, we use a hinged feature vector that contains features from the web block to create a common reference frame and augment it with features representing the specific vertical judged by the user. A gradient boosted decision tree is then learned from the training data. For the final composition of the SERP, we place a vertical result at a slot if the score is higher than a computed threshold. The thresholds are algorithmically determined to guarantee specific coverage for verticals at each slot. We use correlation of clicks as our offline metric and show that click-preference target has a better correlation than human judgments based models. Furthermore, on online tests for News and Image verticals we show higher user engagement for both head and tail queries.

【Keywords】: federated web search; heterogeneous verticals; machine learning; pairwise preference from clicks; randomized flights

80. Stochastic query covering.

Paper Link】 【Pages】:725-734

【Authors】: Aris Anagnostopoulos ; Luca Becchetti ; Stefano Leonardi ; Ida Mele ; Piotr Sankowski

【Abstract】: In this paper we introduce the problem of query covering as a means to efficiently cache query results. The general idea is to populate the cache with documents that contribute to the result pages of a large number of queries, as opposed to caching the top documents for each query. It turns out that the problem is hard and solving it requires knowledge of the structure of the queries and the results space, as well as knowledge of the input query distribution. We formulate the problem under the framework of stochastic optimization; theoretically it can be seen as a stochastic universal version of set multicover. While the problem is NP-hard to be solved exactly, we show that for any distribution it can be approximated using a simple greedy approach. Our theoretical findings are complemented by experimental activity on real datasets, showing the feasibility and potential interest of query-covering approaches in practice.

【Keywords】: caching; query covering; stochastic analysis

Paper Link】 【Pages】:735-744

【Authors】: Changsung Kang ; Xuanhui Wang ; Jiang Chen ; Ciya Liao ; Yi Chang ; Belle L. Tseng ; Zhaohui Zheng

【Abstract】: Web search ranking functions are typically learned to rank search results based on features of individual documents, i.e., pointwise features. Hence, the rich relationships among documents, which contain multiple types of useful information, are either totally ignored or just explored very limitedly. In this paper, we propose to explore multiple pairwise relationships between documents in a learning setting to rerank search results. In particular, we use a set of pairwise features to capture various kinds of pairwise relationships and design two machine learned re-ranking methods to effectively combine these features with a base ranking function: a pairwise comparison method and a pairwise function decomposition method. Furthermore, we propose several schemes to estimate the potential gains of our re-ranking methods on each query and selectively apply them to queries with high confidence. Our experiments on a large scale commercial search engine editorial data set show that considering multiple pairwise relationships is quite beneficial and our proposed methods can achieve significant gain over methods which only consider pointwise features or a single type of pairwise relationship.

【Keywords】: pairwise features; re-rank

82. The tube over time: characterizing popularity growth of youtube videos.

Paper Link】 【Pages】:745-754

【Authors】: Flavio Figueiredo ; Fabrício Benevenuto ; Jussara M. Almeida

【Abstract】: Understanding content popularity growth is of great importance to Internet service providers, content creators and online marketers. In this work, we characterize the growth patterns of video popularity on the currently most popular video sharing application, namely YouTube. Using newly provided data by the application, we analyze how the popularity of individual videos evolves since the video's upload time. Moreover, addressing a key aspect that has been mostly overlooked by previous work, we characterize the types of the referrers that most often attracted users to each video, aiming at shedding some light into the mechanisms (e.g., searching or external linking) that often drive users towards a video, and thus contribute to popularity growth. Our analyses are performed separately for three video datasets, namely, videos that appear in the YouTube top lists, videos removed from the system due to copyright violation, and videos selected according to random queries submitted to YouTube's search engine. Our results show that popularity growth patterns depend on the video dataset. In particular, copyright protected videos tend to get most of their views much earlier in their lifetimes, often exhibiting a popularity growth characterized by a viral epidemic-like propagation process. In contrast, videos in the top lists tend to experience sudden significant bursts of popularity. We also show that not only search but also other YouTube internal mechanisms play important roles to attract users to videos in all three datasets.

【Keywords】: keywords youtube; popularity growth; referrer; video popularity

83. Citation recommendation without author supervision.

Paper Link】 【Pages】:755-764

【Authors】: Qi He ; Daniel Kifer ; Jian Pei ; Prasenjit Mitra ; C. Lee Giles

【Abstract】: Automatic recommendation of citations for a manuscript is highly valuable for scholarly activities since it can substantially improve the efficiency and quality of literature search. The prior techniques placed a considerable burden on users, who were required to provide a representative bibliography or to mark passages where citations are needed. In this paper we present a system that considerably reduces this burden: a user simply inputs a query manuscript (without a bibliography) and our system automatically finds locations where citations are needed. We show that naïve approaches do not work well due to massive noise in the document corpus. We produce a successful approach by carefully examining the relevance between segments in a query manuscript and the representative segments extracted from a document corpus. An extensive empirical evaluation using the CiteSeerX data set shows that our approach is effective.

【Keywords】: bibliometrics; context; extraction; recommender systems

84. Query suggestion for E-commerce sites.

Paper Link】 【Pages】:765-774

【Authors】: Mohammad Al Hasan ; Nish Parikh ; Gyanit Singh ; Neel Sundaresan

【Abstract】: Query suggestion module is an integral part of every search engine. It helps search engine users narrow or broaden their searches. Published work on query suggestion methods has mainly focused on the web domain. But, the module is also popular in the domain of e-commerce for product search. In this paper, we discuss query suggestion and its methodologies in the context of e-commerce search engines. We show that dynamic inventory combined with long and sparse tail of query distribution poses unique challenges to build a query suggestion method for an e-commerce marketplace. We compare and contrast the design of a query suggestion system for web search engines and e-commerce search engines. Further, we discuss interesting measures to quantify the effectiveness of our query suggestion methodologies. We also describe the learning gained from exposing our query suggestion module to a vibrant community of millions of users.

【Keywords】: query reformulation; query suggestion; recommendation system; suggestion ranking

85. An algorithmic treatment of strong queries.

Paper Link】 【Pages】:775-784

【Authors】: Ravi Kumar ; Silvio Lattanzi ; Prabhakar Raghavan

【Abstract】: A strong query for a target document with respect to an index is the smallest query for which the target document is returned by the index as the top result for the query. The strong query problem was first studied more than a decade ago in the context of measuring search engine overlap. Despite its simple-to-state nature and its longevity in the field, this problem has not been sufficiently addressed in a formal manner. In this paper we provide the first rigorous treatment of the strong query problem. We show an interesting connection between this problem and the set cover problem, and use it to obtain basic hardness and algorithmic results. Experiments on more than 10K documents show that our proposed algorithm performs much better than the widely-used word frequency-based heuristic. En route, our study suggests that less than four words on average can be sufficient to uniquely identify web pages.

【Keywords】: greedy algorithm; set cover; strong query

86. Enhanced email spam filtering through combining similarity graphs.

Paper Link】 【Pages】:785-794

【Authors】: Anirban Dasgupta ; Maxim Gurevich ; Kunal Punera

【Abstract】: Over the last decade Email Spam has evolved from being just an irritant to users to being truly dangerous. This has led web-mail providers and academic researchers to dedicate considerable resources towards tackling this problem [9, 21, 22, 24, 26]. However, we argue that some aspects of the spam filtering problem are not handled appropriately in existing work. Principal among these are adversarial spammer efforts -- spammers routinely tune their spam emails to bypass spam-filters, and contaminate ground truth via fake HAM/SPAM votes -- and the scale and sparsity of the problem, which essentially precludes learning with a very large set of parameters. In this paper we propose an approach that learns to filter spam by striking a balance between generalizing HAM/SPAM votes across users and emails (to alleviate sparsity) and learning local models for each user (to limit effect of adversarial votes); votes are shared only amongst users and emails that are "similar" to one another. Moreover, we define user-user and email-email similarities using spam-resilient features that are extremely difficult for spammers to fake. We give a methodology that learns to combine multiple features into similarity values while directly optimizing the objective of better spam filtering. A useful side effect of this methodology is that the number of parameters that need to be estimated is very small: this helps us use off-the-shelf learning algorithms to achieve good accuracy while preventing over-training to the adversarial noise in the data. Finally, our approach gives a systematic way to incorporate existing spam-fighting technologies such as IP blacklists, keyword based classifiers, etc into one framework. Experiments on a real-world email dataset show that our approach leads to significant improvements compared to two state-of-the-art baselines.

【Keywords】: collaborative filtering; email spam filtering; indexing; personalization; similarity graphs; top-k retrieval

87. LambdaMerge: merging the results of query reformulations.

Paper Link】 【Pages】:795-804

【Authors】: Daniel Sheldon ; Milad Shokouhi ; Martin Szummer ; Nick Craswell

【Abstract】: Search engines can automatically reformulate user queries in a variety of ways, often leading to multiple queries that are candidates to replace the original. However, selecting a replacement can be risky: a reformulation may be more effective than the original or significantly worse, depending on the nature of the query, the source of reformulation candidates, and the corpus. In this paper, we explore methods to mitigate this risk by issuing several versions of the query (including the original) and merging their results. We focus on reformulations generated by random walks on the click graph, a method that can produce very good reformulations but is also variable and prone to topic drift. Our primary contribution is λ-Merge, a supervised merging method that is trained to directly optimize a retrieval metric (such as NDCG or MAP) using features that describe both the reformulations and the documents they return. In experiments on Bing data and GOV2, λ-Merge outperforms the original query and several unsupervised merging methods. λ-Merge also outperforms a supervised method to predict and select the best single formulation, and is competitive with an oracle that always selects the best formulation.

【Keywords】: lambdamerge; learning to rank; query expansion; query reformulation

88. Normalizing web product attributes and discovering domain ontology with minimal effort.

Paper Link】 【Pages】:805-814

【Authors】: Tak-Lam Wong ; Lidong Bing ; Wai Lam

【Abstract】: We have developed a framework aiming at normalizing product attributes from Web pages collected from different Web sites without the need of labeled training examples. It can deal with pages composed of different layout format and content in an unsupervised manner. As a result, it can handle a variety of different domains with minimal effort. Our model is based on a generative probabilistic graphical model incorporated with Hidden Markov Models (HMM) considering both attribute names and attribute values to extract and normalize text fragments from Web pages in a unified manner. Dirichlet Process is employed to handle the unlimited number of attributes in a domain. An unsupervised inference method is proposed to predict the unobservable variables. We have also developed a method to automatically construct a domain ontology using the normalized product attributes which are the output of the inference on the graphical model. We have conducted extensive experiments and compared with existing works using prouct Web pages collected from real-world Web sites in three different domains to demonstrate the effectiveness of our framework.

【Keywords】: graphical models; information extraction; web mining

89. Aspect and sentiment unification model for online review analysis.

Paper Link】 【Pages】:815-824

【Authors】: Yohan Jo ; Alice H. Oh

【Abstract】: User-generated reviews on the Web contain sentiments about detailed aspects of products and services. However, most of the reviews are plain text and thus require much effort to obtain information about relevant details. In this paper, we tackle the problem of automatically discovering what aspects are evaluated in reviews and how sentiments for different aspects are expressed. We first propose Sentence-LDA (SLDA), a probabilistic generative model that assumes all words in a single sentence are generated from one aspect. We then extend SLDA to Aspect and Sentiment Unification Model (ASUM), which incorporates aspect and sentiment together to model sentiments toward different aspects. ASUM discovers pairs of {aspect, sentiment} which we call senti-aspects. We applied SLDA and ASUM to reviews of electronic devices and restaurants. The results show that the aspects discovered by SLDA match evaluative details of the reviews, and the senti-aspects found by ASUM capture important aspects that are closely coupled with a sentiment. The results of sentiment classification show that ASUM outperforms other generative models and comes close to supervised classification methods. One important advantage of ASUM is that it does not require any sentiment labels of the reviews, which are often expensive to obtain.

【Keywords】: aspect detection; sentiment analysis; topic modeling

90. Searching patterns for relation extraction over the web: rediscovering the pattern-relation duality.

Paper Link】 【Pages】:825-834

【Authors】: Yuan Fang ; Kevin Chen-Chuan Chang

【Abstract】: While tuple extraction for a given relation has been an active research area, its dual problem of pattern search-- to find and rank patterns in a principled way-- has not been studied explicitly. In this paper, we propose and address the problem of pattern search, in addition to tuple extraction. As our objectives, we stress reusability for pattern search and scalability of tuple extraction, such that our approach can be applied to very large corpora like the Web. As the key foundation, we propose a conceptual model PRDualRank to capture the notion of precision and recall for both tuples and patterns in a principled way, leading to the "rediscovery" of the Pattern-Relation Duality-- the formal quantification of the reinforcement between patterns and tuples with the metrics of precision and recall. We also develop a concrete framework for PRDualRank, guided by the principles of a perfect sampling process over a complete corpus. Finally, we evaluated our framework over the real Web. Experiments show that on all three target relations our principled approach greatly outperforms the previous state-of-the-art system in both effectiveness and efficiency. In particular, we improved optimal F-score by up to 64%.

【Keywords】: information extraction; pattern; ranking; relation; web mining

91. On the selection of tags for tag clouds.

Paper Link】 【Pages】:835-844

【Authors】: Petros Venetis ; Georgia Koutrika ; Hector Garcia-Molina

【Abstract】: We examine the creation of a tag cloud for exploring and understanding a set of objects (e.g., web pages, documents). In the first part of our work, we present a formal system model for reasoning about tag clouds. We then present metrics that capture the structural properties of a tag cloud, and we briefly present a set of tag selection algorithms that are used in current sites (e.g., del.icio.us, Flickr, Technorati) or that have been described in recent work. In order to evaluate the results of these algorithms, we devise a novel synthetic user model. This user model is specifically tailored for tag cloud evaluation and assumes an "ideal" user. We evaluate the algorithms under this user model, as well as the model itself, using two datasets: CourseRank (a Stanford social tool containing information about courses) and del.icio.us (a social bookmarking site). The results yield insights as to when and why certain selection schemes work best.

【Keywords】: tag clouds; tag selection; tags