Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11 - 15, 2016. ACM 【DBLP Link】
【Paper Link】 【Pages】:1
【Authors】: Peter Norvig
【Abstract】: We would like to understand the meaning of content on the web. Bit where should that meaning come from? From markup language annotations created by the authors of the content? Crowdsourced from readers of the content? Automatically extracted by machine learning algorithms? This talk investigates the possibilities.
【Keywords】: knowledge extraction; question answering; semantic web
【Paper Link】 【Pages】:3
【Authors】: Martha Lane Fox
【Abstract】:
【Keywords】:
【Paper Link】 【Pages】:5
【Authors】: Mary Ellen Zurko
【Abstract】: Open has meant a lot of things in the web thus far. The openness of the web has had profound implications for web security, from the beginning through to today. Each time the underlying web technology changes, we do a reset on the security it provides. Patterns and differences emerge in each round of security responses and challenges. What has that brought us as web users, technologists, researchers, and as a global community? What can we expect going forward? And what should we work towards as web technologists and caretakers?
【Keywords】: human factors; security
【Paper Link】 【Pages】:9-20
【Authors】: Daniel M. Romero ; Brian Uzzi ; Jon M. Kleinberg
【Abstract】: Social network research has begun to take advantage of fine-grained communications regarding coordination, decision-making, and knowledge sharing. These studies, however, have not generally analyzed how external events are associated with a social network's structure and communicative properties. Here, we study how external events are associated with a network's change in structure and communications. Analyzing a complete dataset of millions of instant messages among the decision-makers in a large hedge fund and their network of outside contacts, we investigate the link between price shocks, network structure, and change in the affect and cognition of decision-makers embedded in the network. When price shocks occur the communication network tends not to display structural changes associated with adaptiveness. Rather, the network 'turtles up'. It displays a propensity for higher clustering, strong tie inter- action, and an intensification of insider vs. outsider communication. Further, we find changes in network structure pre- dict shifts in cognitive and affective processes, execution of new transactions, and local optimality of transactions better than prices, revealing the important predictive relationship between network structure and collective behavior within a social network.
【Keywords】: collective behavior; organizations; social networks; temporal dynamics
【Paper Link】 【Pages】:21-30
【Authors】: Desislava Hristova ; Matthew J. Williams ; Mirco Musolesi ; Pietro Panzarasa ; Cecilia Mascolo
【Abstract】: Large metropolitan cities bring together diverse individuals, creating opportunities for cultural and intellectual exchanges, which can ultimately lead to social and economic enrichment. In this work, we present a novel network perspective on the interconnected nature of people and places, allowing us to capture the social diversity of urban locations through the social network and mobility patterns of their visitors. We use a dataset of approximately 37K users and 42K venues in London to build a network of Foursquare places and the parallel Twitter social network of visitors through check-ins. We define four metrics of the social diversity of places which relate to their social brokerage role, their entropy, the homogeneity of their visitors and the amount of serendipitous encounters they are able to induce. This allows us to distinguish between places that bring together strangers versus those which tend to bring together friends, as well as places that attract diverse individuals as opposed to those which attract regulars. We correlate these properties with wellbeing indicators for London neighbourhoods and discover signals of gentrification in deprived areas with high entropy and brokerage, where an influx of more affluent and diverse visitors points to an overall improvement of their rank according to the UK Index of Multiple Deprivation for the area over the five-year census period. Our analysis sheds light on the relationship between the prosperity of people and places, distinguishing between different categories and urban geographies of consequence to the development of urban policy and the next generation of socially-aware location-based applications.
【Keywords】: geo-social networks; social diversity; urban deprivation
【Paper Link】 【Pages】:31-40
【Authors】: Jiliang Tang ; Charu C. Aggarwal ; Huan Liu
【Abstract】: Recommender systems play a crucial role in mitigating the information overload problem in social media by suggesting relevant information to users. The popularity of pervasively available social activities for social media users has encouraged a large body of literature on exploiting social networks for recommendation. The vast majority of these systems focus on unsigned social networks (or social networks with only positive links), while little work exists for signed social networks (or social networks with positive and negative links). The availability of negative links in signed social networks presents both challenges and opportunities in the recommendation process. We provide a principled and mathematical approach to exploit signed social networks for recommendation, and propose a model, RecSSN, to leverage positive and negative links in signed social networks. Empirical results on real-world datasets demonstrate the effectiveness of the proposed framework. We also perform further experiments to explicitly understand the effect of signed networks in RecSSN.
【Keywords】: negative links; signed social networks; social recommendation
【Paper Link】 【Pages】:41-50
【Authors】: Qingbo Hu ; Sihong Xie ; Jiawei Zhang ; Qiang Zhu ; Songtao Guo ; Philip S. Yu
【Abstract】: Nowadays, a modern e-commerce company may have both online sales and offline sales departments. Normally, online sales attempt to sell in small quantities to individual customers through broadcasting a large amount of emails or promotion codes, which heavily rely on the designed backend algorithms. Offline sales, on the other hand, try to sell in much larger quantities to enterprise customers through contacts initiated by sales representatives, which are more costly compared to online sales. Unlike many previous research works focusing on machine learning algorithms to support online sales, this paper introduces an approach that utilizes heterogenous social networks to improve the effectiveness of offline sales. More specifically, we propose a two-phase framework, HeteroSales, which first constructs a company-to-company graph, a.k.a. Company Homophily Graph (CHG), from semantics based meta-path learning, and then adopts label propagation on the graph to predict promising companies that we may successfully close an offline deal with. Based on the statistical analysis on the world's largest professional social network, LinkedIn, we demonstrate interesting discoveries showing that not all the social connections in a heterogeneous social network are useful in this task. In other words, some proper data preprocessing is essential to ensure the effectiveness of offline sales. Finally, through the experiments on LinkedIn social network data and third-party offline sales records, we demonstrate the power of HereroSales to identify potential enterprise customers in offline sales.
【Keywords】: heterogeneous social networks; label propagation; maximum likelihood estimation; meta-path learning; offline sales
【Paper Link】 【Pages】:51-62
【Authors】: Cheng-Kang Hsieh ; Longqi Yang ; Honghao Wei ; Mor Naaman ; Deborah Estrin
【Abstract】: We propose a new user-centric recommendation model, called Immersive Recommendation, that incorporates cross-platform and diverse personal digital traces into recommendations. Our context-aware topic modeling algorithm systematically profiles users' interests based on their traces from different contexts, and our hybrid recommendation algorithm makes high-quality recommendations by fusing users' personal profiles, item profiles, and existing ratings. Specifically, in this work we target personalized news and local event recommendations for their utility and societal importance. We evaluated the model with a large-scale offline evaluation leveraging users' public Twitter traces. In addition, we conducted a direct evaluation of the model's recommendations in a 33-participant study using Twitter, Facebook and email traces. In the both cases, the proposed model showed significant improvement over the state-of-the-art algorithms, suggesting the value of using this new user-centric recommendation model to improve recommendation quality, including in cold-start situations.
【Keywords】: personal digital traces; personalization; recommendations; small data
【Paper Link】 【Pages】:63-72
【Authors】: Oren Sar Shalom ; Noam Koenigstein ; Ulrich Paquet ; Hastagiri P. Vanchinathan
【Abstract】: Most Collaborative Filtering (CF) algorithms are optimized using a dataset of isolated user-item tuples. However, in commercial applications recommended items are usually served as an ordered list of several items and not as isolated items. In this setting, inter-item interactions have an effect on the list's Click-Through Rate (CTR) that is unaccounted for using traditional CF approaches. Most CF approaches also ignore additional important factors like click propensity variation, item fatigue, etc. In this work, we introduce the list recommendation problem. We present useful insights gleaned from user behavior and consumption patterns from a large scale real world recommender system. We then propose a novel two-layered framework that builds upon existing CF algorithms to optimize a list's click probability. Our approach accounts for inter-item interactions as well as additional information such as item fatigue, trendiness patterns, contextual information etc. Finally, we evaluate our approach using a novel adaptation of Inverse Propensity Scoring (IPS) which facilitates off-policy estimation of our method's CTR and showcases its effectiveness in real-world settings.
【Keywords】: click prediction; collaborative filtering
【Paper Link】 【Pages】:73-83
【Authors】: Yongfeng Zhang ; Qi Zhao ; Yi Zhang ; Daniel Friedman ; Min Zhang ; Yiqun Liu ; Shaoping Ma
【Abstract】: A prime function of many major World Wide Web applications is Online Service Allocation (OSA), the function of matching individual consumers with particular services/goods (which may include loans or jobs as well as products) each with its own producer. In the applications of interest, consumers are free to choose, so OSA usually takes the form of personalized recommendation or search in practice. The performance metrics of recommender and search systems currently tend to focus on just one side of the match, in some cases the consumers (e.g. satisfaction) and in other cases the producers (e.g., profit). However, a sustainable OSA platform needs benefit both consumers and producers; otherwise the neglected party eventually may stop using it. In this paper, we show how to adapt economists' traditional idea of maximizing total surplus (the sum of consumer net benefit and producer profit) to the heterogeneous world of online service allocation, in an effort to promote the web intelligence for social good in online eco-systems. Modifications of traditional personalized recommendation algorithms enable us to apply Total Surplus Maximization (TSM) to three very different types of real-world tasks -- e-commerce, P2P lending and freelancing. The results for all three tasks suggest that TSM compares very favorably to currently popular approaches, to the benefit of both producers and consumers.
【Keywords】: computational economics; online service allocation; recommendation systems; total surplus maximization; web-based services
【Paper Link】 【Pages】:85-97
【Authors】: Dokyun Lee ; Kartik Hosanagar
【Abstract】: We investigate the moderating effect of product attributes and consumer reviews on the efficacy of a collaborative filtering recommender system on an e-commerce site. We run a randomized field experiment on a top North American retailer's website with 184,375 users split into a recommender-treated group and a control group with 37,215 unique products in the dataset. By augmenting the dataset with Amazon Mechanical Turk tagged product attributes and consumer review data from the website, we study their moderating influence on recommenders in generating conversion. We first confirm that the use of recommenders increases the baseline conversion rate by 5.9%. We find that the recommenders act as substitutes for high average review ratings with the effect of using recommenders increasing the conversion rate as much as about 1.4 additional average star ratings. Additionally, we find that the positive impacts on conversion from recommenders are greater for hedonic products compared to utilitarian products while search-experience quality did not have any impact. We also find that the higher the price, the lower the positive impact of recommenders, while having lengthier product descriptions and higher review volumes increased the recommender's effectiveness. More findings are discussed in the Results. For managers, we 1) identify the products and product attributes for which the recommenders work well, 2) show how other product information sources on e-commerce sites interact with recommenders. Additionally, the insights from the results could inform novel recommender algorithm designs that are aware of strength and shortcomings. From an academic standpoint, we provide insight into the underlying mechanism behind how recommenders cause consumers to purchase.
【Keywords】: business analytics; data science; e-commerce; machine learning in business; recommender systems; social media marketing
【Paper Link】 【Pages】:99-109
【Authors】: Wei Meng ; Byoungyoung Lee ; Xinyu Xing ; Wenke Lee
【Abstract】: Recent advance in web tracking technologies has raised many privacy concerns. To combat users' fear of privacy invasion, online vendors have taken measures such as being more transparent with users about their data use and providing options for users to manage their online activities. Such efforts gain users' trust in online vendors and improve their willingness to share their digital footprints. However, there are still a significant amount of users who actively limit involuntarily sharing of data because vendor provided management tools only restrict the use of collected data and users worry vendors do not have enough measures in place to protect their privacy sensitive information. In this paper, we propose TrackMeOrNot, a new anti-tracking mechanism. It allows users to selectively share their online footprints with vendors. With TrackMeOrNot, users are no longer concerned with privacy. Using it, users can specify their privacy sensitive activities and selectively disclose their activities to vendors based on their specified privacy demands. We implemented TrackMeOrNot on Chromium browser and systematically evaluated its performance using a large set of test cases. We show that TrackMeOrNot can efficiently and effectively shield privacy sensitive browsing activities.
【Keywords】: browser; privacy; tracking
【Paper Link】 【Pages】:111-120
【Authors】: Yixuan Li ; Oscar Martinez ; Xing Chen ; Yi Li ; John E. Hopcroft
【Abstract】: How can web services that depend on user generated content discern fake social engagement activities by spammers from legitimate ones? In this paper, we focus on the social site of YouTube and the problem of identifying bad actors posting inorganic contents and inflating the count of social engagement metrics. We propose an effective method, Leas (Local Expansion at Scale), and show how the fake engagement activities on YouTube can be tracked over time by analyzing the temporal graph based on the engagement behavior pattern between users and YouTube videos. With the domain knowledge of spammer seeds, we formulate and tackle the problem in a semi-supervised manner --- with the objective of searching for individuals that have similar pattern of behavior as the known seeds --- based on a graph diffusion process via local spectral subspace. We offer a fast, scalable MapReduce deployment adapted from the localized spectral clustering algorithm. We demonstrate the effectiveness of our deployment at Google by achieving a manual review accuracy of 98% on YouTube Comments graph in practice. Comparing with the state-of-the-art algorithm CopyCatch, Leas achieves 10 times faster running time on average. Leas is now actively in use at Google, searching for daily deceptive practices on YouTube's engagement graph spanning over a billion users.
【Keywords】: anomaly detection; fake social engagement; local spectral clustering; seed expansion; social networks
【Paper Link】 【Pages】:121-132
【Authors】: Zhonghao Yu ; Sam Macbeth ; Konark Modi ; Josep M. Pujol
【Abstract】: Online tracking poses a serious privacy challenge that has drawn significant attention in both academia and industry. Existing approaches for preventing user tracking, based on curated blocklists, suffer from limited coverage and coarse-grained resolution for classification, rely on exceptions that impact sites' functionality and appearance, and require significant manual maintenance. In this paper we propose a novel approach, based on the concepts leveraged from $k$-Anonymity, in which users collectively identify unsafe data elements, which have the potential to identify uniquely an individual user, and remove them from requests. We deployed our system to 200,000 German users running the Cliqz Browser or the Cliqz Firefox extension to evaluate its efficiency and feasibility. Results indicate that our approach achieves better privacy protection than blocklists, as provided by Disconnect, while keeping the site breakage to a minimum, even lower than the community-optimized AdBlock Plus. We also provide evidence of the prevalence and reach of trackers to over 21 million pages of 350,000 unique sites, the largest scale empirical evaluation to date. 95% of the pages visited contain 3rd party requests to potential trackers and 78% attempt to transfer unsafe data. Tracker organizations are also ranked, showing that a single organization can reach up to 42% of all page visits in Germany.
【Keywords】: data privacy; k-anonymity; privacy protection; tracking
【Paper Link】 【Pages】:133-143
【Authors】: Shomir Wilson ; Florian Schaub ; Rohan Ramanath ; Norman M. Sadeh ; Fei Liu ; Noah A. Smith ; Frederick Liu
【Abstract】: Website privacy policies are often long and difficult to understand. While research shows that Internet users care about their privacy, they do not have time to understand the policies of every website they visit, and most users hardly ever read privacy policies. Several recent efforts aim to crowdsource the interpretation of privacy policies and use the resulting annotations to build more effective user interfaces that provide users with salient policy summaries. However, very little attention has been devoted to studying the accuracy and scalability of crowdsourced privacy policy annotations, the types of questions crowdworkers can effectively answer, and the ways in which their productivity can be enhanced. Prior research indicates that most Internet users often have great difficulty understanding privacy policies, suggesting limits to the effectiveness of crowdsourcing approaches. In this paper, we assess the viability of crowdsourcing privacy policy annotations. Our results suggest that, if carefully deployed, crowdsourcing can indeed result in the generation of non-trivial annotations and can also help identify elements of ambiguity in policies. We further introduce and evaluate a method to improve the annotation process by predicting and highlighting paragraphs relevant to specific data practices.
【Keywords】: HCI; crowdsourcing; machine learning; privacy; privacy policies
【Paper Link】 【Pages】:145-153
【Authors】: Chikashi Nobata ; Joel R. Tetreault ; Achint Thomas ; Yashar Mehdad ; Yi Chang
【Abstract】: Detection of abusive language in user generated online content has become an issue of increasing importance in recent years. Most current commercial methods make use of blacklists and regular expressions, however these measures fall short when contending with more subtle, less ham-fisted examples of hate speech. In this work, we develop a machine learning based method to detect hate speech on online user comments from two domains which outperforms a state-of-the-art deep learning approach. We also develop a corpus of user comments annotated for abusive language, the first of its kind. Finally, we use our detection tool to analyze abusive language over time and in different settings to further enhance our knowledge of this behavior.
【Keywords】: abusive language; discourse classification; hate speech; natural language processing; nlp; stylistic classification
【Paper Link】 【Pages】:155-165
【Authors】: Md Mustafizur Rahman ; Hongning Wang
【Abstract】: Various topic models have been developed for sentiment analysis tasks. But the simple topic-sentiment mixture assumption prohibits them from finding fine-grained dependency between topical aspects and sentiments. In this paper, we build a Hidden Topic Sentiment Model (HTSM) to explicitly capture topic coherence and sentiment consistency in an opinionated text document to accurately extract latent aspects and corresponding sentiment polarities. In HTSM, 1) topic coherence is achieved by enforcing words in the same sentence to share the same topic assignment and modeling topic transition between successive sentences; 2) sentiment consistency is imposed by constraining topic transitions via tracking sentiment changes; and 3) both topic transition and sentiment transition are guided by a parameterized logistic function based on the linguistic signals directly observable in a document. Extensive experiments on four categories of product reviews from both Amazon and NewEgg validate the effectiveness of the proposed model.
【Keywords】: aspect detection; sentiment analysis; topic modeling
【Paper Link】 【Pages】:167-176
【Authors】: Shuai Wang ; Zhiyuan Chen ; Bing Liu
【Abstract】: Aspect-level sentiment analysis or opinion mining consists of several core sub-tasks: aspect extraction, opinion identification, polarity classification, and separation of general and aspect-specific opinions. Various topic models have been proposed by researchers to address some of these sub-tasks. However, there is little work on modeling all of them together. In this paper, we first propose a holistic fine-grained topic model, called the JAST (Joint Aspect-based Sentiment Topic) model, that can simultaneously model all of above problems under a unified framework. To further improve it, we incorporate the idea of lifelong machine learning and propose a more advanced model, called the LAST (Lifelong Aspect-based Sentiment Topic) model. LAST automatically mines the prior knowledge of aspect, opinion, and their correspondence from other products or domains. Such knowledge is automatically extracted and incorporated into the proposed LAST model without any human involvement. Our experiments using reviews of a large number of product domains show major improvements of the proposed models over state-of-the-art baselines.
【Keywords】: aspect-specific opinion; lifelong machine learning; opinion mining; topic model
【Paper Link】 【Pages】:177-187
【Authors】: Pu Yang ; Krishnamurthy Iyer ; Peter I. Frazier
【Abstract】: We consider a model of nomadic agents exploring and competing for time-varying location-specific resources, arising in crowdsourced transportation services, online communities, and in traditional location-based economic activity. This model comprises a group of agents, and a set of locations each endowed with a dynamic stochastic resource process. Each agent derives a periodic reward determined by the overall resource level at her location, and the number of other agents there. Each agent is strategic and free to move between locations, and at each time decides whether to stay at the same node or switch to another one. We study the equilibrium behavior of the agents as a function of dynamics of the stochastic resource process and the nature of the externality each agent imposes on others at the same location. In the asymptotic limit with the number of agents and locations increasing proportionally, we show that an equilibrium exists and has a threshold structure, where each agent decides to switch to a different location based only on their current location's resource level and the number of other agents at that location. This result provides insight into how system structure affects the agents' collective ability to explore their domain to find and effectively utilize resource-rich areas. It also allows assessing the impact of changing the reward structure through penalties or subsidies.
【Keywords】: exploration; mean field equilibrium; resource sharing
【Paper Link】 【Pages】:189-203
【Authors】: Eric Balkanski ; Jason D. Hartline
【Abstract】: We consider the problem of budget feasible mechanism design proposed by Singer, but in a Bayesian setting. A principal has a public value for hiring a subset of the agents and a budget, while the agents have private costs for being hired. We consider both additive and submodular value functions of the principal. We show that there are simple, practical, ex post budget balanced posted pricing mechanisms that approximate the value obtained by the Bayesian optimal mechanism that is budget balanced only in expectation. A main motivating application for this work is crowdsourcing, e.g., on Mechanical Turk, where workers are drawn from a large population and posted pricing is standard. Our analysis methods relate to contention resolution schemes in submodular optimization of Vondràk et al. and the correlation gap analysis of Yan.
【Keywords】: bayesian mechanism design; budget feasible mechanism design; crowdsourcing; posted pricing; submodular optimization
【Paper Link】 【Pages】:205-214
【Authors】: Yingjie Zhang ; Beibei Li ; Jason Hong
【Abstract】: The pervasiveness of mobile technologies today have facilitated the creation of massive crowdsourced and geotagged data from individual users in real time and at different locations in the city. Such ubiquitous user-generated data allow us to infer various patterns of human behavior, which help us understand the interactions between humans and cities. In this study, we focus on understanding users economic behavior in the city by examining the economic value from crowdsourced and geotaggged data. Specifically, we extract multiple traffic and human mobility features from publicly available data sources using NLP and geo-mapping techniques, and examine the effects of both static and dynamic features on economic outcome of local businesses. Our study is instantiated on a unique dataset of restaurant bookings from OpenTable for 3,187 restaurants in New York City from November 2013 to March 2014. Our results suggest that foot traffic can increase local popularity and business performance, while mobility and traffic from automobiles may hurt local businesses, especially the well-established chains and high-end restaurants. We also find that on average one more street closure nearby leads to a 4.7% decrease in the probability of a restaurant being fully booked during the dinner peak. Our study demonstrates the potential of how to best make use of the large volumes and diverse sources of crowdsourced and geotagged user-generated data to create matrices to predict local economic demand in a manner that is fast, cheap, accurate, and meaningful.
【Keywords】: city demand; crowdsourced user behavior; econometrics; economic analysis; geotagged social media; location-based service; mobility analytics; nlp
【Paper Link】 【Pages】:215-225
【Authors】: Yoram Bachrach ; Sofia Ceppi ; Ian A. Kash ; Peter B. Key ; Mohammad Reza Khani
【Abstract】: The Generalized Second Price (GSP) auction has appealing properties when ads are simple (text based and identical in size), but does not generalize to richer ad settings, whereas truthful mechanisms such as VCG do. However, a straight switch from GSP to VCG incurs significant revenue loss for the search engine. We introduce a transitional mechanism which encourages advertisers to update their bids to their valuations, while mitigating revenue loss. In this setting, it is easier to propose first a payment function rather than an allocation function, so we give a general framework which guarantees incentive compatibility by requiring that the payment functions satisfy two specific properties. Finally, we analyze the revenue impacts of our mechanism on a sample of Bing data.
【Keywords】: algorithmic game theory; mechanism design; revenue optimization
【Paper Link】 【Pages】:227-238
【Authors】: Mark Kaminski ; Egor V. Kostylev ; Bernardo Cuenca Grau
【Abstract】: Answering aggregate queries is a key requirement of emerging applications of Semantic Technologies, such as data warehousing, business intelligence and sensor networks. In order to fulfill the requirements of such applications, the standardisation of SPARQL 1.1 led to the introduction of a wide range of constructs that enable value computation, aggregation, and query nesting. In this paper we provide an in-depth formal analysis of the semantics and expressive power of these new constructs as defined in the SPARQL 1.1 specification, and hence lay the necessary foundations for the development of robust, scalable and extensible query engines supporting complex numerical and analytics tasks.
【Keywords】: RDF; SPARQL; aggregation; data analytics; expressive power; grouping; linked data; query languages; query nesting; semantics; subqueries; value computation
【Paper Link】 【Pages】:239-249
【Authors】: Marcelo Arenas ; Gonzalo I. Diaz ; Egor V. Kostylev
【Abstract】: Semantic Web systems provide open interfaces for end-users to access data via a powerful high-level query language, SPARQL. But users unfamiliar with either the details of SPARQL or properties of the target dataset may find it easier to query by example -- give examples of the information they want (or examples of both what they want and what they do not want) and let the system reverse engineer the desired query from the examples. This approach has been heavily used in the setting of relational databases. We provide here an investigation of the reverse engineering problem in the context of SPARQL. We first provide a theoretical study, formalising variants of the reverse engineering problem and giving tight bounds on its complexity. We next explain an implementation of a reverse engineering tool for positive examples. An experimental analysis of the tool shows that it scales well in the data size, number of examples, and in the size of the smallest query that fits the data. We also give evidence that reverse engineering tools can provide benefits on real-life datasets.
【Keywords】: RDF; SparQL; query by example; reverse engineering; semantic web
【Paper Link】 【Pages】:251-261
【Authors】: Dominique Ritze ; Oliver Lehmberg ; Yaser Oulabi ; Christian Bizer
【Abstract】: Cross-domain knowledge bases such as DBpedia, YAGO, or the Google Knowledge Graph have gained increasing attention over the last years and are starting to be deployed within various use cases. However, the content of such knowledge bases is far from being complete, far from always being correct, and suffers from deprecation (i.e. population numbers become outdated after some time). Hence, there are efforts to leverage various types of Web data to complement, update and extend such knowledge bases. A source of Web data that potentially provides a very wide coverage are millions of relational HTML tables that are found on the Web. The existing work on using data from Web tables to augment cross-domain knowledge bases reports only aggregated performance numbers. The actual content of the Web tables and the topical areas of the knowledge bases that can be complemented using the tables remain unclear. In this paper, we match a large, publicly available Web table corpus to the DBpedia knowledge base. Based on the matching results, we profile the potential of Web tables for augmenting different parts of cross-domain knowledge bases and report detailed statistics about classes, properties, and instances for which missing values can be filled using Web table data as evidence. In order to estimate the potential quality of the new values, we empirically examine the Local Closed World Assumption and use it to determine the maximal number of correct facts that an ideal data fusion strategy could generate. Using this as ground truth, we compare three data fusion strategies and conclude that knowledge-based trust outperforms PageRank- and voting-based fusion.
【Keywords】: data fusion; data profiling; knowledge base augmentation; schema and data matching; slot filling; web tables
【Paper Link】 【Pages】:263-273
【Authors】: Felipe Pezoa ; Juan L. Reutter ; Fernando Suarez ; Martín Ugarte ; Domagoj Vrgoc
【Abstract】: JSON -- the most popular data format for sending API requests and responses -- is still lacking a standardized schema or meta-data definition that allows the developers to specify the structure of JSON documents. JSON Schema is an attempt to provide a general purpose schema language for JSON, but it is still work in progress, and the formal specification has not yet been agreed upon. Why this could be a problem becomes evident when examining the behaviour of numerous tools for validating JSON documents against this initial schema proposal: although they agree on most general cases, when presented with the greyer areas of the specification they tend to differ significantly. In this paper we provide the first formal definition of syntax and semantics for JSON Schema and use it to show that implementing this layer on top of JSON is feasible in practice. This is done both by analysing the theoretical aspects of the validation problem and by showing how to set up and validate a JSON Schema for Wikidata, the central storage for Wikimedia.
【Keywords】: JSON; JSON schema; JSON validation; expressiveness of schema languages
【Paper Link】 【Pages】:275-285
【Authors】: Hong-Han Shuai ; Chih-Ya Shen ; De-Nian Yang ; Yi-Feng Lan ; Wang-Chien Lee ; Philip S. Yu ; Ming-Syan Chen
【Abstract】: An increasing number of social network mental disorders (SNMDs), such as Cyber-Relationship Addiction, Information Overload, and Net Compulsion, have been recently noted. Symptoms of these mental disorders are usually observed passively today, resulting in delayed clinical intervention. In this paper, we argue that mining online social behavior provides an opportunity to actively identify SNMDs at an early stage. It is challenging to detect SNMDs because the mental factors considered in standard diagnostic criteria (questionnaire) cannot be observed from online social activity logs. Our approach, new and innovative to the practice of SNMD detection, does not rely on self-revealing of those mental factors via questionnaires. Instead, we propose a machine learning framework, namely, Social Network Mental Disorder Detection (SNMDD), that exploits features extracted from social network data to accurately identify potential cases of SNMDs. We also exploit multi-source learning in SNMDD and propose a new SNMDbased Tensor Model (STM) to improve the performance. Our framework is evaluated via a user study with 3126 online social network users. We conduct a feature analysis, and also apply SNMDD on large-scale datasets and analyze the characteristics of the three SNMD types. The results show that SNMDD is promising for identifying online social network users with potential SNMDs.
【Keywords】: feature extraction; mental disorder detection; online social network; tensor factorization
【Paper Link】 【Pages】:287-297
【Authors】: Jian Tang ; Jingzhou Liu ; Ming Zhang ; Qiaozhu Mei
【Abstract】: We study the problem of visualizing large-scale and high-dimensional data in a low-dimensional (typically 2D or 3D) space. Much success has been reported recently by techniques that first compute a similarity structure of the data points and then project them into a low-dimensional space with the structure preserved. These two steps suffer from considerable computational costs, preventing the state-of-the-art methods such as the t-SNE from scaling to large-scale and high-dimensional data (e.g., millions of data points and hundreds of dimensions). We propose the LargeVis, a technique that first constructs an accurately approximated K-nearest neighbor graph from the data and then layouts the graph in the low-dimensional space. Comparing to t-SNE, LargeVis significantly reduces the computational cost of the graph construction step and employs a principled probabilistic model for the visualization step, the objective of which can be effectively optimized through asynchronous stochastic gradient descent with a linear time complexity. The whole procedure thus easily scales to millions of high-dimensional data points. Experimental results on real-world data sets demonstrate that the LargeVis outperforms the state-of-the-art methods in both efficiency and effectiveness. The hyper-parameters of LargeVis are also much more stable over different data sets.
【Keywords】: big data; high-dimensional data; visualization
【Paper Link】 【Pages】:299-310
【Authors】: Ke Zhou ; Miriam Redi ; Andrew Haines ; Mounia Lalmas
【Abstract】: Native advertising is a specific form of online advertising where ads replicate the look-and-feel of their serving platform. In such context, providing a good user experience with the served ads is crucial to ensure long-term user engagement. In this work, we explore the notion of ad quality, namely the effectiveness of advertising from a user experience perspective. We design a learning framework to predict the pre-click quality of native ads. More specifically, we look at detecting offensive native ads, showing that, to quantify ad quality, ad offensive user feedback rates are more reliable than the commonly used click-through rate metrics. We then conduct a crowd-sourcing study to identify which criteria drive user preferences in native advertising. We translate these criteria into a set of ad quality features that we extract from the ad text, image and advertiser, and then use them to train a model able to identify offensive ads. We show that our model is very effective in detecting offensive ads, and provide in-depth insights on how different features affect ad quality. Finally, we deploy a preliminary version of such model and show its effectiveness in the reduction of the offensive ad feedback rate.
【Keywords】: ad feedback; ad quality; features; image and text; native advertising; offensive rate; pre-click experience
【Paper Link】 【Pages】:311-320
【Authors】: Jiezhong Qiu ; Yixuan Li ; Jie Tang ; Zheng Lu ; Hao Ye ; Bo Chen ; Qiang Yang ; John E. Hopcroft
【Abstract】: Social instant messaging services are emerging as a transformative form with which people connect, communicate with friends in their daily life they catalyze the formation of social groups, and they bring people stronger sense of community and connection. However, research community still knows little about the formation and evolution of groups in the context of social messaging their lifecycles, the change in their underlying structures over time, and the diffusion processes by which they develop new members. In this paper, we analyze the daily usage logs from WeChat group messaging platform the largest standalone messaging communication service in China with the goal of understanding the processes by which social messaging groups come together, grow new members, and evolve over time. Specifically, we discover a strong dichotomy among groups in terms of their lifecycle, and develop a separability model by taking into account a broad range of group-level features, showing that long-term and short-term groups are inherently distinct. We also found that the lifecycle of messaging groups is largely dependent on their social roles and functions in users' daily social experiences and specific purposes. Given the strong separability between the long-term and short-term groups, we further address the problem concerning the early prediction of successful communities. In addition to modeling the growth and evolution from group-level perspective, we investigate the individual-level attributes of group members and study the diffusion process by which groups gain new members. By considering members' historical engagement behavior as well as the local social network structure that they embedded in, we develop a membership cascade model and demonstrate the effectiveness by achieving AUC of 95.31% in predicting inviter, and an AUC of 98.66% in predicting invitee.
【Keywords】: group formation; information diffusion; online community; social messaging
【Paper Link】 【Pages】:321-332
【Authors】: Xiaojing Liao ; Chang Liu ; Damon McCoy ; Elaine Shi ; Shuang Hao ; Raheem A. Beyah
【Abstract】: The popularity of long-tail search engine optimization (SEO) brings with new security challenges: incidents of long-tail keyword poisoning to lower competition and increase revenue have been reported. The emergence of cloud web hosting services provides a new and effective platform for long-tail SEO spam attacks. There is growing evidence that large-scale long-tail SEO campaigns are being carried out on cloud hosting platforms because they offer low-cost, high-speed hosting services. In this paper, we take the first step toward understanding how long-tail SEO spam is implemented on cloud hosting platforms. After identifying 3,186 cloud directories and 318,470 doorway pages on the leading cloud platforms for long-tail SEO spam, we characterize their abusive behavior. One highlight of our findings is the effectiveness of the cloud-based long-tail SEO spam, with 6% of the doorway pages successfully appearing in the top 10 search results of the poisoned long-tail keywords. Examples of other important discoveries include how such doorway pages monetize traffic and their ability to manage cloud platform's countermeasures. These findings bring such abuse to the spotlight and provide some insights to eliminating this practice.
【Keywords】: cloud hosting; cloud security; longtail seo; seo
【Paper Link】 【Pages】:333-343
【Authors】: Onur Catakoglu ; Marco Balduzzi ; Davide Balzarotti
【Abstract】: Indicators of Compromise (IOCs) are forensic artifacts that are used as signs that a system has been compromised by an attack or that it has been infected with a particular malicious software. In this paper we propose for the first time an automated technique to extract and validate IOCs for web applications, by analyzing the information collected by a high-interaction honeypot. Our approach has several advantages compared with traditional techniques used to detect malicious websites. First of all, not all the compromised web pages are malicious or harmful for the user. Some may be defaced to advertise product or services, and some may be part of affiliate programs to redirect users toward (more or less legitimate) online shopping websites. In any case, it is important to detect these pages to inform their owners and to alert the users on the fact that the content of the page has been compromised and cannot be trusted. Also in the case of more traditional drive-by-download pages, the use of IOCs allows for a prompt detection and correlation of infected pages, even before they may be blocked by more traditional URLs blacklists. Our experiments show that our system is able to automatically generate web indicators of compromise that have been used by attackers for several months (and sometimes years) in the wild without being detected. So far, these apparently harmless scripts were able to stay under the radar of the existing detection methodologies -- despite being hosted for a long time on public web sites.
【Keywords】: defacements; indicators of compromise; phishing; security; web security
【Paper Link】 【Pages】:345-356
【Authors】: Bin Liang ; Miaoqiang Su ; Wei You ; Wenchang Shi ; Gang Yang
【Abstract】: Various classifiers based on the machine learning techniques have been widely used in security applications. Meanwhile, they also became an attack target of adversaries. Many existing studies have paid much attention to the evasion attacks on the online classifiers and discussed defensive methods. However, the security of the classifiers deployed in the client environment has not got the attention it deserves. Besides, earlier studies only concentrated on the experimental classifiers developed for research purposes only. The security of widely-used commercial classifiers still remains unclear. In this paper, we use the Google's phishing pages filter (GPPF), a classifier deployed in the Chrome browser which owns over one billion users, as a case to investigate the security challenges for the client-side classifiers. We present a new attack methodology targeting on client-side classifiers, called classifiers cracking. With the methodology, we successfully cracked the classification model of GPPF and extracted sufficient knowledge can be exploited for evasion attacks, including the classification algorithm, scoring rules and features, etc. Most importantly, we completely reverse engineered 84.8% scoring rules, covering most of high-weighted rules. Based on the cracked information, we performed two kinds of evasion attacks to GPPF, using 100 real phishing pages for the evaluation purpose. The experiments show that all the phishing pages (100%) can be easily manipulated to bypass the detection of GPPF. Our study demonstrates that the existing client-side classifiers are very vulnerable to classifiers cracking attacks.
【Keywords】: classifiers; collision attacks; cracking; evasion attacks; machine learning; phishing detection
【Paper Link】 【Pages】:357-368
【Authors】: Miriam Marciel ; Rubén Cuevas ; Albert Banchs ; Roberto Gonzalez ; Stefano Traverso ; Mohamed Ahmed ; Arturo Azcorra
【Abstract】: While substantial effort has been devoted to understand fraudulent activity in traditional online advertising (search and banner), more recent forms such as video ads have received little attention. The understanding and identification of fraudulent activity (i.e., fake views) in video ads for advertisers, is complicated as they rely exclusively on the detection mechanisms deployed by video hosting portals. In this context, the development of independent tools able to monitor and audit the fidelity of these systems are missing today and needed by both industry and regulators. In this paper we present a first set of tools to serve this purpose. Using our tools, we evaluate the performance of the audit systems of five major online video portals. Our results reveal that YouTube's detection system significantly outperforms all the others. Despite this, a systematic evaluation indicates that it may still be susceptible to simple attacks. Furthermore, we find that YouTube penalizes its videos' public and monetized view counters differently, the former being more aggressive. This means that views identified as fake and discounted from the public view counter are still monetized. We speculate that even though YouTube's policy puts in lots of effort to compensate users after an attack is discovered, this practice places the burden of the risk on the advertisers, who pay to get their ads displayed.
【Keywords】: active probing; advertising; fake views; fraud; youtube
【Paper Link】 【Pages】:369-379
【Authors】: Santosh K. C. ; Arjun Mukherjee
【Abstract】: Recently, the problem of opinion spam has been widespread and has attracted a lot of research attention. While the problem has been approached on a variety of dimensions, the temporal dynamics in which opinion spamming operates is unclear. Are there specific spamming policies that spammers employ? What kind of changes happen with respect to the dynamics to the truthful ratings on entities. How do buffered spamming operate for entities that need spamming to retain threshold popularity and reduced spamming for entities making better success? We analyze these questions in the light of time-series analysis on Yelp. Our analyses discover various temporal patterns and their relationships with the rate at which fake reviews are posted. Building on our analyses, we employ vector autoregression to predict the rate of deception across different spamming policies. Next, we explore the effect of filtered reviews on (long-term and imminent) future rating and popularity prediction of entities. Our results discover novel temporal dynamics of spamming which are intuitive, arguable and also render confidence on Yelp's filtering. Lastly, we leverage our discovered temporal patterns in deception detection. Experimental results on large-scale reviews show the effectiveness of our approach that significantly improves the existing approaches.
【Keywords】: opinion spam; spam detection; time series
【Paper Link】 【Pages】:381-390
【Authors】: Arnab Bhadury ; Jianfei Chen ; Jun Zhu ; Shixia Liu
【Abstract】: Dynamic topic models (DTMs) are very effective in discovering topics and capturing their evolution trends in time series data. To do posterior inference of DTMs, existing methods are all batch algorithms that scan the full dataset before each update of the model and make inexact variational approximations with mean-field assumptions. Due to a lack of a more scalable inference algorithm, despite the usefulness, DTMs have not captured large topic dynamics. This paper fills this research void, and presents a fast and parallelizable inference algorithm using Gibbs Sampling with Stochastic Gradient Langevin Dynamics that does not make any unwarranted assumptions. We also present a Metropolis-Hastings based $O(1)$ sampler for topic assignments for each word token. In a distributed environment, our algorithm requires very little communication between workers during sampling (almost embarrassingly parallel) and scales up to large-scale applications. We are able to learn the largest Dynamic Topic Model to our knowledge, and learned the dynamics of 1,000 topics from 2.6 million documents in less than half an hour, and our empirical results show that our algorithm is not only orders of magnitude faster than the baselines but also achieves lower perplexity.
【Keywords】: MCMC; MPI; dynamic topic model; large scale machine learning; parallel computing; topic model
【Paper Link】 【Pages】:391-400
【Authors】: Yupeng Gu ; Bo Zhao ; David Hardtke ; Yizhou Sun
【Abstract】: Recommender systems typically leverage two types of signals to effectively recommend items to users: user activities and content matching between user and item profiles, and recommendation models in literature are usually categorized into collaborative filtering models, content-based models and hybrid models. In practice, when rich profiles about users and items are available, and user activities are sparse (cold-start), effective content matching signals become much more important in the relevance of the recommendation. The de-facto method to measure similarity between two pieces of text is computing the cosine similarity of the two bags of words, and each word is weighted by TF (term frequency within the document) x IDF (inverted document frequency of the word within the corpus). In general sense, TF can represent any local weighting scheme of the word within each document, and IDF can represent any global weighting scheme of the word across the corpus. In this paper, we focus on the latter, i.e., optimizing the global term weights, for a particular recommendation domain by leveraging supervised approaches. The intuition is that some frequent words (lower IDF, e.g. ``database'') can be essential and predictive for relevant recommendation, while some rare words (higher IDF, e.g. the name of a small company) could have less predictive power. Given plenty of observed activities between users and items as training data, we should be able to learn better domain-specific global term weights, which can further improve the relevance of recommendation. We propose a unified method that can simultaneously learn the weights of multiple content matching signals, as well as global term weights for specific recommendation tasks. Our method is efficient to handle large-scale training data generated by production recommender systems. And experiments on LinkedIn job recommendation data justify the effectiveness of our approach.
【Keywords】: feature selection; recommender systems; term weighting
【Paper Link】 【Pages】:401-412
【Authors】: Kenneth Joseph ; Wei Wei ; Kathleen M. Carley
【Abstract】: Sociologists have long been interested in the ways that identities, or labels for people, are created, used and applied across various social contexts. The present work makes two contributions to the study of identity, in particular the study of identity in text. We first consider the following novel NLP task: given a set of text data (here, from Twitter), label each word in the text as being representative of a (possibly multi-word) identity. To address this task, we develop a comprehensive feature set that leverages several avenues of recent NLP work on Twitter and use these features to train a supervised classifier. Our model outperforms a surprisingly strong rule-based baseline by 33%. We then use our model for a case study, applying it to a large corpora of Twitter data from users who actively discussed the Eric Garner and Michael Brown cases. Among other findings, we observe that the identities used by individuals differ in interesting ways based on social context measures derived from census data.
【Keywords】: Ferguson; computational social science; identity; natural language processing; twitter
【Paper Link】 【Pages】:413-423
【Authors】: Marco De Nadai ; Jacopo Staiano ; Roberto Larcher ; Nicu Sebe ; Daniele Quercia ; Bruno Lepri
【Abstract】: The Death and Life of Great American Cities was written in 1961 and is now one of the most influential book in city planning. In it, Jane Jacobs proposed four conditions that promote life in a city. However, these conditions have not been empirically tested until recently. This is mainly because it is hard to collect data about "city life". The city of Seoul recently collected pedestrian activity through surveys at an unprecedented scale, with an effort spanning more than a decade, allowing researchers to conduct the first study successfully testing Jacobs's conditions. In this paper, we identify a valuable alternative to the lengthy and costly collection of activity survey data: mobile phone data. We extract human activity from such data, collect land use and socio-demographic information from the Italian Census and Open Street Map, and test the four conditions in six Italian cities. Although these cities are very different from the places for which Jacobs's conditions were spelled out (i.e., great American cities) and from the places in which they were recently tested (i.e., the Asian city of Seoul), we find those conditions to be indeed associated with urban life in Italy as well. Our methodology promises to have a great impact on urban studies, not least because, if replicated, it will make it possible to test Jacobs's theories at scale.
【Keywords】: cities; mobile phone data; open data; urban informatics
【Paper Link】 【Pages】:425-434
【Authors】: Chris Smith-Clarke ; Licia Capra
【Abstract】: Within the remit of `Data for Development' there have been a number of promising recent works that investigate the use of mobile phone Call Detail Records (CDRs) to estimate the spatial distribution of poverty or socio-economic status. The methods being developed have the potential to offer immense value to organisations and agencies who currently struggle to identify the poorest parts of a country, due to the lack of reliable and up to date survey data in certain parts of the world. However, the results of this research have thus far only been presented in isolation rather than in comparison to any alternative approach or benchmark. Consequently, the true practical value of these methods remains unknown. Here, we seek to allay this shortcoming, by proposing two baseline poverty estimators grounded on concrete usage scenarios: one that exploits correlation with population density only, to be used when no poverty data exists at all; and one that also exploits spatial autocorrelation, to be used when poverty data has been collected for a few regions within a country. We then compare the predictive performance of these baseline models with models that also include features derived from CDRs, so to establish their real added value. We present extensive analysis of the performance of all these models on data acquired for two developing countries -- Senegal and Ivory Coast. Our results reveal that CDR-based models do provide more accurate estimates in most cases; however, the improvement is modest and more significant when estimating (extreme) poverty intensity rates rather than mean wealth.
【Keywords】: call detail records; data for development; mobile phone data; poverty
【Paper Link】 【Pages】:435-445
【Authors】: Rodérick Fanou ; Gareth Tyson ; Pierre François ; Arjuna Sathiaseelan
【Abstract】: It is well known that Africa's mobile and fixed Internet infrastructure is progressing at a rapid pace. A flurry of recent research has quantified this, highlighting the expansion of its underlying connectivity network. However, improving the infrastructure is not useful without appropriately provisioned services to utilise it. This paper measures the availability of web content infrastructure in Africa. Whereas others have explored web infrastructure in developed regions, we shed light on practices in developing regions. To achieve this, we apply a comprehensive measurement methodology to collect data from a variety of sources. We focus on a large content delivery network to reveal that Africa's content infrastructure is, indeed, expanding. However, we find much web content is still served from the US and Europe. We discover that many of the problems faced are actually caused by significant inter-AS delays in Africa, which contribute to local ISPs not sharing their cache capacity. We discover that a related problem is the poor DNS configuration used by some ISPs, which confounds the attempts of providers to optimise their delivery. We then explore a number of other websites to show that large web infrastructure deployments are a rarity in Africa and that even regional websites host their services abroad. We conclude by making suggestions for improvements.
【Keywords】: DNS; content infrastructure; measurements; web
【Paper Link】 【Pages】:447-458
【Authors】: Yoon-Sik Cho ; Greg Ver Steeg ; Emilio Ferrara ; Aram Galstyan
【Abstract】: With the emergence of social networking services, researchers enjoy the increasing availability of large-scale heterogenous datasets capturing online user interactions and behaviors. Traditional analysis of techno-social systems data has focused mainly on describing either the dynamics of social interactions, or the attributes and behaviors of the users. However, overwhelming empirical evidence suggests that the two dimensions affect one another, and therefore they should be jointly modeled and analyzed in a multi-modal framework. The benefits of such an approach include the ability to build better predictive models, leveraging social network information as well as user behavioral signals. To this purpose, here we propose the Constrained Latent Space Model (CLSM), a generalized framework that combines Mixed Membership Stochastic Blockmodels (MMSB) and Latent Dirichlet Allocation (LDA) incorporating a constraint that forces the latent space to concurrently describe the multiple data modalities. We derive an efficient inference algorithm based on Variational Expectation Maximization that has a computational cost linear in the size of the network, thus making it feasible to analyze massive social datasets. We validate the proposed framework on two problems: prediction of social interactions from user attributes and behaviors, and behavior prediction exploiting network information. We perform experiments with a variety of multi-modal social systems, spanning location-based social networks (Gowalla), social media services (Instagram, Orkut), e-commerce and review sites (Amazon, Ciao), and finally citation networks (Cora). The results indicate significant improvement in prediction accuracy over state of the art methods, and demonstrate the flexibility of the proposed approach for addressing a variety of different learning problems commonly occurring with multi-modal social data.
【Keywords】: LDA; multi-modal social networks; topic models
【Paper Link】 【Pages】:459-469
【Authors】: Bin Bi ; Junghoo Cho
【Abstract】: Twitter (and similar microblogging services) has become a central nexus for discussion of the topics of the day. Twitter data contains rich content and structured information on users' topics of interest and behavior patterns. Correctly analyzing and modeling Twitter data enables the prediction of the user behavior and preference in a variety of practical applications, such as tweet recommendation and followee recommendation. Although a number of models have been developed on Twitter data in prior work, most of these only model the tweets from users, while neglecting their valuable retweet information in the data. Models would enhance their predictive power by incorporating users' retweet content as well as their retweet behavior. In this paper, we propose two novel Bayesian nonparametric models, URM and UCM, on retweet data. Both of them are able to integrate the analysis of tweet text and users' retweet behavior in the same probabilistic framework. Moreover, they both jointly model users' interest in tweet and retweet. As nonparametric models, URM and UCM can automatically determine the parameters of the models based on input data, avoiding arbitrary parameter settings. Extensive experiments on real-world Twitter data show that both URM and UCM are superior to all the baselines, while UCM further outperforms URM, confirming the appropriateness of our models in retweet modeling.
【Keywords】: bayesian nonparametric; retweet; topic modeling; twitter modeling
【Paper Link】 【Pages】:471-481
【Authors】: Flavio Chierichetti ; Anirban Dasgupta ; Ravi Kumar ; Silvio Lattanzi ; Tamás Sarlós
【Abstract】: Random walk is an important tool in many graph mining applications including estimating graph parameters, sampling portions of the graph, and extracting dense communities. In this paper we consider the problem of sampling nodes from a large graph according to a prescribed distribution by using random walk as the basic primitive. Our goal is to obtain algorithms that make a small number of queries to the graph but output a node that is sampled according to the prescribed distribution. Focusing on the uniform distribution case, we study the query complexity of three algorithms and show a near-tight bound expressed in terms of the parameters of the graph such as average degree and the mixing time. Both theoretically and empirically, we show that some algorithms are preferable in practice than the others. We also extend our study to the problem of sampling nodes according to some polynomial function of their degrees; this has implications for designing efficient algorithms for applications such as triangle counting.
【Keywords】: metropolis-hastings; mixing time; random walks; stationary distribution
【Paper Link】 【Pages】:483-493
【Authors】: Ethan R. Elenberg ; Karthikeyan Shanmugam ; Michael Borokhovich ; Alexandros G. Dimakis
【Abstract】: We present a novel distributed algorithm for counting all four-node induced subgraphs in a big graph. These counts, called the 4-profile, describe a graph's connectivity properties and have found several uses ranging from bioinformatics to spam detection. We also study the more complicated problem of estimating the local 4-profiles centered at each vertex of the graph. The local 4-profile embeds every vertex in an 11-dimensional space that characterizes the local geometry of its neighborhood: vertices that connect different clusters will have different local 4-profiles compared to those that are only part of one dense cluster. Our algorithm is a local, distributed message-passing scheme on the graph and computes all the local 4-profiles in parallel. We rely on two novel theoretical contributions: we show that local 4-profiles can be calculated using compressed two-hop information and also establish novel concentration results that show that graphs can be substantially sparsified and still retain good approximation quality for the global 4-profile. We empirically evaluate our algorithm using a distributed GraphLab implementation that we scaled up to 640 cores. We show that our algorithm can compute global and local 4-profiles of graphs with millions of edges in a few minutes, significantly improving upon the previous state of the art.
【Keywords】: 4-profiles; graph analytics; graph engines; graph sparsifiers; motifs
【Paper Link】 【Pages】:495-505
【Authors】: Kyle Williams ; Julia Kiseleva ; Aidan C. Crook ; Imed Zitouni ; Ahmed Hassan Awadallah ; Madian Khabsa
【Abstract】: Web search queries for which there are no clicks are referred to as abandoned queries and are usually considered as leading to user dissatisfaction. However, there are many cases where a user may not click on any search result page (SERP) but still be satisfied. This scenario is referred to as good abandonment and presents a challenge for most approaches measuring search satisfaction, which are usually based on clicks and dwell time. The problem is exacerbated further on mobile devices where search providers try to increase the likelihood of users being satisfied directly by the SERP. This paper proposes a solution to this problem using gesture interactions, such as reading times and touch actions, as signals for differentiating between good and bad abandonment. These signals go beyond clicks and characterize user behavior in cases where clicks are not needed to achieve satisfaction. We study different good abandonment scenarios and investigate the different elements on a SERP that may lead to good abandonment. We also present an analysis of the correlation between user gesture features and satisfaction. Finally, we use this analysis to build models to automatically identify good abandonment in mobile search achieving an accuracy of 75%, which is significantly better than considering query and session signals alone. Our findings have implications for the study and application of user satisfaction in search systems.
【Keywords】: good abandonment; implicit relevance feedback; mobile search behavior; touch interaction models
【Paper Link】 【Pages】:507-517
【Authors】: Ruining He ; Julian McAuley
【Abstract】: Building a successful recommender system depends on understanding both the dimensions of people's preferences as well as their dynamics. In certain domains, such as fashion, modeling such preferences can be incredibly difficult, due to the need to simultaneously model the visual appearance of products as well as their evolution over time. The subtle semantics and non-linear dynamics of fashion evolution raise unique challenges especially considering the sparsity and large scale of the underlying datasets. In this paper we build novel models for the One-Class Collaborative Filtering setting, where our goal is to estimate users' fashion-aware personalized ranking functions based on their past feedback. To uncover the complex and evolving visual factors that people consider when evaluating products, our method combines high-level visual features extracted from a deep convolutional neural network, users' past feedback, as well as evolving trends within the community. Experimentally we evaluate our method on two large real-world datasets from Amazon.com, where we show it to outperform state-of-the-art personalized ranking measures, and also use it to visualize the high-level fashion trends across the 11-year span of our dataset.
【Keywords】: fashion evolution; personalized ranking; recommender systems; visual dimensions
【Paper Link】 【Pages】:519-529
【Authors】: Austin R. Benson ; Ravi Kumar ; Andrew Tomkins
【Abstract】: We study sequences of consumption in which the same item may be consumed multiple times. We identify two macroscopic behavior patterns of repeated consumptions. First, in a given user's lifetime, very few items live for a long time. Second, the last consumptions of an item exhibit growing inter-arrival gaps consistent with the notion of increasing boredom leading up to eventual abandonment. We then present what is to our knowledge the first holistic model of sequential repeated consumption, covering all observed aspects of this behavior. Our simple and purely combinatorial model includes no planted notion of lifetime distributions or user boredom; nonetheless, the model correctly predicts both of these phenomena. Further, we provide theoretical analysis of the behavior of the model confirming these phenomena. Additionally, the model quantitatively matches a number of microscopic phenomena across a broad range of datasets. Intriguingly, these findings suggest that the observation in a variety of domains of increasing user boredom leading to abandonment may be explained simply by probabilistic conditioning on an extinction event in a simple model, without resort to explanations based on complex human dynamics.
【Keywords】: boredom; repeat consumption; sequence mining
【Paper Link】 【Pages】:531-541
【Authors】: Alexey Borisov ; Ilya Markov ; Maarten de Rijke ; Pavel Serdyukov
【Abstract】: Understanding user browsing behavior in web search is key to improving web search effectiveness. Many click models have been proposed to explain or predict user clicks on search engine results. They are based on the probabilistic graphical model (PGM) framework, in which user behavior is represented as a sequence of observable and hidden events. The PGM framework provides a mathematically solid way to reason about a set of events given some information about other events. But the structure of the dependencies between the events has to be set manually. Different click models use different hand-crafted sets of dependencies. We propose an alternative based on the idea of distributed representations: to represent the user's information need and the information available to the user with a vector state. The components of the vector state are learned to represent concepts that are useful for modeling user behavior. And user behavior is modeled as a sequence of vector states associated with a query session: the vector state is initialized with a query, and then iteratively updated based on information about interactions with the search engine results. This approach allows us to directly understand user browsing behavior from click-through data, i.e., without the need for a predefined set of rules as is customary for PGM-based click models. We illustrate our approach using a set of neural click models. Our experimental results show that the neural click model that uses the same training data as traditional PGM-based click models, has better performance on the click prediction task (i.e., predicting user click on search engine results) and the relevance prediction task (i.e., ranking documents by their relevance to a query). An analysis of the best performing neural click model shows that it learns similar concepts to those used in traditional click models, and that it also learns other concepts that cannot be designed manually.
【Keywords】: click modeling; deep learning; distributed representations; recurrent neural networks; user behavior; web search
【Paper Link】 【Pages】:543-553
【Abstract】: Web search has been a reactive scenario for decades which often starts by users issuing queries. By studying the user behavior in search engine logs, we have discovered that many of the search tasks such as stock-price checking, news reading exhibit strong repeated patterns from day to day. In addition, users exhibit even stronger repetition on mobile devices. This provides us chances to perform proactive recommendations without user issuing queries. In this work, we aim at discovering and characterizing these types of tasks so that we can automatically predict when and what types of tasks will be repeated by the users in the future, through analyzing search logs from a commercial Web search engine and user interaction logs from a mobile App that offers proactive recommendations. We first introduce a set of novel features that can accurately capture task repetition. We then propose a novel deep learning framework that learns user preferences and makes automatic predictions. Our framework is capable of learning both user-independent global models as well as catering personalized models via model adaptation. The model we developed significantly outperforms other state-of-the-art predictive models by large margins. We also demonstrate the power of our model and features through an application to improve the recommendation quality of the mobile App. Results indicate a significant relevance improvement over the current production system.
【Keywords】: predicting future events; user behavior; web log analysis
【Paper Link】 【Pages】:555-565
【Authors】: Liangda Li ; Hongbo Deng ; Yunlong He ; Anlei Dong ; Yi Chang ; Hongyuan Zha
【Abstract】: Search tasks in users' query sequences are dynamic and interconnected. The formulation of search tasks can be influenced by multiple latent factors such as user characteristics, product features and search interactions, which makes search task identification a challenging problem. In this paper, we propose an unsupervised approach to identify search tasks via topic membership along with topic transition probabilities, thus it becomes possible to interpret how user's search intent emerges and evolves over time. Moreover, a novel hidden semi-Markov model is introduced to model topic transitions by considering not only the semantic information of queries but also the latent search factors originated from user search behaviors. A variational inference algorithm is developed to identify remarkable search behavior patterns, typical topic transition tracks, and the topic membership of each query from query logs. The learned topic transition tracks and the inferred topic memberships enable us to identify both small search tasks, where a user searches the same topic, and big search tasks, where a user searches a series of related topics. We extensively evaluate the proposed approach and compare with several state-of-the-art search task identification methods on both synthetic and real-world query log data, and experimental results illustrate the effectiveness of our proposed model.
【Keywords】: Markov model; search behavior; search task identification
【Paper Link】 【Pages】:567-578
【Authors】: Marco Cornolti ; Paolo Ferragina ; Massimiliano Ciaramita ; Stefan Rüd ; Hinrich Schütze
【Abstract】: In this paper we study the problem of linking open-domain web-search queries towards entities drawn from the full entity inventory of Wikipedia articles. We introduce SMAPH-2, a second-order approach that, by piggybacking on a web search engine, alleviates the noise and irregularities that characterize the language of queries and puts queries in a larger context in which it is easier to make sense of them. The key algorithmic idea underlying SMAPH-2 is to first discover a candidate set of entities and then link-back those entities to their mentions occurring in the input query. This allows us to confine the possible concepts pertinent to the query to only the ones really mentioned in it. The link-back is implemented via a collective disambiguation step based upon a supervised ranking model that makes one joint prediction for the annotation of the complete query optimizing directly the F1 measure. We evaluate both known features, such as word embeddings and semantic relatedness among entities, and several novel features such as an approximate distance between mentions and entities (which can handle spelling errors). We demonstrate that SMAPH-2 achieves state-of-the-art performance on the ERD@SIGIR2014 benchmark. We also publish GERDAQ (General Entity Recognition, Disambiguation and Annotation in Queries), a novel, public dataset built specifically for web-query entity linking via a crowdsourcing effort. SMAPH-2 outperforms the benchmarks by comparable margins also on GERDAQ.
【Keywords】: entity linking; erd; piggyback; query annotation
【Paper Link】 【Pages】:579-590
【Authors】: Aston Zhang ; Amit Goyal ; Ricardo A. Baeza-Yates ; Yi Chang ; Jiawei Han ; Carl A. Gunter ; Hongbo Deng
【Abstract】: We study the new mobile query auto-completion (QAC) problem to exploit mobile devices' exclusive signals, such as those related to mobile applications (apps). We propose AppAware, a novel QAC model using installed app and recently opened app signals to suggest queries for matching input prefixes on mobile devices. To overcome the challenge of noisy and voluminous signals, AppAware optimizes composite objectives with a lighter processing cost at a linear rate of convergence. We conduct experiments on a large commercial data set of mobile queries and apps. Installed app and recently opened app signals consistently and significantly boost the accuracy of various baseline QAC models on mobile devices.
【Keywords】: mobile application; mobile device; query auto-completion
【Paper Link】 【Pages】:591-602
【Authors】: Srijan Kumar ; Robert West ; Jure Leskovec
【Abstract】: Wikipedia is a major source of information for many people. However, false information on Wikipedia raises concerns about its credibility. One way in which false information may be presented on Wikipedia is in the form of hoax articles, i.e., articles containing fabricated facts about nonexistent entities or events. In this paper we study false information on Wikipedia by focusing on the hoax articles that have been created throughout its history. We make several contributions. First, we assess the real-world impact of hoax articles by measuring how long they survive before being debunked, how many pageviews they receive, and how heavily they are referred to by documents on the Web. We find that, while most hoaxes are detected quickly and have little impact on Wikipedia, a small number of hoaxes survive long and are well cited across the Web. Second, we characterize the nature of successful hoaxes by comparing them to legitimate articles and to failed hoaxes that were discovered shortly after being created. We find characteristic differences in terms of article structure and content, embeddedness into the rest of Wikipedia, and features of the editor who created the hoax. Third, we successfully apply our findings to address a series of classification tasks, most notably to determine whether a given article is a hoax. And finally, we describe and evaluate a task involving humans distinguishing hoaxes from non-hoaxes. We find that humans are not particularly good at the task and that our automated classifier outperforms them by a big margin.
【Keywords】: detection; disinformation; hoax; misinformation; rumor; wikipedia
【Paper Link】 【Pages】:603-612
【Authors】: Lu Zhou ; Wenbo Wang ; Keke Chen
【Abstract】: Inappropriate tweets can cause severe damages on authors' reputation or privacy. However, many users do not realize the negative consequences until they publish these tweets. Published tweets have lasting effects that may not be eliminated by simple deletion because other users may have read them or third-party tweet analysis platforms have cached them. Regrettable tweets, i.e., tweets with identifiable regrettable contents, cause the most damage on their authors because other users can easily notice them. In this paper, we study how to identify the regrettable tweets published by \emph{normal individual users} via the contents and users' historical deletion patterns. We identify normal individual users based on their publishing, deleting, followers and friends statistics. We manually examine a set of randomly sampled deleted tweets from these users to identify regrettable tweets and understand the corresponding regrettable reasons. By applying content-based features and personalized history-based features, we develop classifiers that can effectively predict regrettable tweets.
【Keywords】: deleted tweets; regret; regrettable tweets; twitter; user clustering
【Paper Link】 【Pages】:613-624
【Authors】: Chenhao Tan ; Vlad Niculae ; Cristian Danescu-Niculescu-Mizil ; Lillian Lee
【Abstract】: Changing someone's opinion is arguably one of the most important challenges of social interaction. The underlying process proves difficult to study: it is hard to know how someone's opinions are formed and whether and how someone's views shift. Fortunately, ChangeMyView, an active community on Reddit, provides a platform where users present their own opinions and reasoning, invite others to contest them, and acknowledge when the ensuing discussions change their original views. In this work, we study these interactions to understand the mechanisms behind persuasion. We find that persuasive arguments are characterized by interesting patterns of interaction dynamics, such as participant entry-order and degree of back-and-forth exchange. Furthermore, by comparing similar counterarguments to the same opinion, we show that language factors play an essential role. In particular, the interplay between the language of the opinion holder and that of the counterargument provides highly predictive cues of persuasiveness. Finally, since even in this favorable setting people may not be persuaded, we investigate the problem of determining whether someone's opinion is susceptible to being changed at all. For this more difficult task, we show that stylistic choices in how the opinion is expressed carry predictive power.
【Keywords】: change my view; persuasion; reddit; social media
【Paper Link】 【Pages】:625-635
【Authors】: Julian McAuley ; Alex Yang
【Abstract】:
Online reviews are often our first port of call when considering products and purchases online. When evaluating a potential purchase, we may have a specific query in mind, e.g. will this baby seat fit in the overhead compartment of a 747?' or
will I like this album if I liked Taylor Swift's 1989?'. To answer such questions we must either wade through huge volumes of consumer reviews hoping to find one that is relevant, or otherwise pose our question directly to the community via a Q/A system. In this paper we hope to fuse these two paradigms: given a large volume of previously answered queries about products, we hope to automatically learn whether a review of a product is relevant to a given query. We formulate this as a machine learning problem using a mixture-of-experts-type framework---here each review is an expert' that gets to vote on the response to a particular query; simultaneously we learn a relevance function such that
relevant' reviews are those that vote correctly. At test time this learned relevance function allows us to surface reviews that are relevant to new queries on-demand. We evaluate our system, Moqa, on a novel corpus of 1.4 million questions (and answers) and 13 million reviews. We show quantitatively that it is effective at addressing both binary and open-ended queries, and qualitatively that it surfaces reviews that human evaluators consider to be relevant.
【Keywords】: bilinear models; question answering; relevance ranking; reviews; text modeling
【Paper Link】 【Pages】:637-648
【Authors】: Gabriel Doyle ; Daniel Yurovsky ; Michael C. Frank
【Abstract】: When people talk, they tend to adopt the behaviors, gestures, and language of their conversational partners. This "accommodation" to one's partners is largely automatic, but the degree to which it occurs is influenced by social factors, such as gender, relative power, and attraction. In settings where such social information is not known, this accommodation can be a useful cue for the missing information. This is especially important in web-based communication, where social dynamics are often fluid and rarely stated explicitly. But connecting accommodation and social dynamics on the web requires accurate quantification of the different amounts of accommodation being made. We focus specifically on accommodation in the form of "linguistic alignment": the amount that one person's word use is influenced by another's. Previous studies have used many measures for linguistic alignment, with no clear standard. In this paper, we lay out a set of desiderata for a linguistic alignment measure, including robustness to sparse and short messages, explicit conditionality, and consistency across linguistic features with different baseline frequencies. We propose a straightforward and flexible model-based framework for calculating linguistic alignment, with a focus on the sparse data and limited social information observed in social media. We show that this alignment measure fulfills our desiderata on simulated data. We then analyze a large corpus of Twitter data, both replicating previous results and extending them: Our measure's improved resolution reveals a previously undetectable effect of interpersonal power in Twitter interactions.
【Keywords】: communication; conversation; coordination; language; psychology; social media; social networks; social status; twitter
【Paper Link】 【Pages】:649-660
【Authors】: Yating Zhang ; Adam Jatowt ; Katsumi Tanaka
【Abstract】: Analyzing how technology evolves is important for understanding technological progress and its impact on society. Although the concept of evolution has been explored in many domains (e.g., evolution of topics, events or terminology, evolution of species), little research has been done on automatically analyzing the evolution of products and technology in general. In this paper, we propose a novel approach for investigating the technology evolution based on collections of product reviews. We are particularly interested in understanding social impact of technology and in discovering how changes of product features influence changes in our social lives. We address this challenge by first distinguishing two kinds of product-related terms: physical product features and terms describing situations when products are used. We then detect changes in both types of terms over time by tracking fluctuations in their popularity and usage. Finally, we discover cases when changes of physical product features trigger the changes in product's use. We experimentally demonstrate the effectiveness of our approach on the Amazon Product Review Dataset that spans over 18 years.
【Keywords】: causality detection; product evolution analysis; social influence; technology evolution analysis
【Paper Link】 【Pages】:661-670
【Authors】: David Garcia ; Markus Strohmaier
【Abstract】: The QWERTY effect postulates that the keyboard layout influences word meanings by linking positivity to the use of the right hand and negativity to the use of the left hand. For example, previous research has established that words with more right hand letters are rated more positively than words with more left hand letters by human subjects in small scale experiments. In this paper, we perform large scale investigations of the QWERTY effect on the web. Using data from eleven web platforms related to products, movies, books, and videos, we conduct observational tests whether a hand-meaning relationship can be found in text interpretations by web users. Furthermore, we investigate whether writing text on the web exhibits the QWERTY effect as well, by analyzing the relationship between the text of online reviews and their star ratings in four additional datasets. Overall, we find robust evidence for the QWERTY effect both at the point of text interpretation (decoding) and at the point of text creation (encoding). We also find under which conditions the effect might not hold. Our findings have implications for any algorithmic method aiming to evaluate the meaning of words on the web, including for example semantic or sentiment analysis, and show the existence of "dactilar onomatopoeias" that shape the dynamics of word-meaning associations. To the best of our knowledge, this is the first work to reveal the extent to which the QWERTY effect exists in large scale human-computer interaction on the web.
【Keywords】: human-computer interaction; semantics
【Paper Link】 【Pages】:671-681
【Authors】: Justin Cheng ; Lada A. Adamic ; Jon M. Kleinberg ; Jure Leskovec
【Abstract】: Cascades of information-sharing are a primary mechanism by which content reaches its audience on social media, and an active line of research has studied how such cascades, which form as content is reshared from person to person, develop and subside. In this paper, we perform a large-scale analysis of cascades on Facebook over significantly longer time scales, and find that a more complex picture emerges, in which many large cascades recur, exhibiting multiple bursts of popularity with periods of quiescence in between. We characterize recurrence by measuring the time elapsed between bursts, their overlap and proximity in the social network, and the diversity in the demographics of individuals participating in each peak. We discover that content virality, as revealed by its initial popularity, is a main driver of recurrence, with the availability of multiple copies of that content helping to spark new bursts. Still, beyond a certain popularity of content, the rate of recurrence drops as cascades start exhausting the population of interested individuals. We reproduce these observed patterns in a simple model of content recurrence simulated on a real social network. Using only characteristics of a cascade's initial burst, we demonstrate strong performance in predicting whether it will recur in the future.
【Keywords】: cascade prediction; content recurrence; information diffusion; memes; virality
【Paper Link】 【Pages】:683-694
【Authors】: Travis Martin ; Jake M. Hofman ; Amit Sharma ; Ashton Anderson ; Duncan J. Watts
【Abstract】: How predictable is success in complex social systems? In spite of a recent profusion of prediction studies that exploit online social and information network data, this question remains unanswered, in part because it has not been adequately specified. In this paper we attempt to clarify the question by presenting a simple stylized model of success that attributes prediction error to one of two generic sources: insufficiency of available data and/or models on the one hand; and inherent unpredictability of complex social systems on the other. We then use this model to motivate an illustrative empirical study of information cascade size prediction on Twitter. Despite an unprecedented volume of information about users, content, and past performance, our best performing models can explain less than half of the variance in cascade sizes. In turn, this result suggests that even with unlimited data predictive performance would be bounded well below deterministic accuracy. Finally, we explore this potential bound theoretically using simulations of a diffusion process on a random scale free network similar to Twitter. We show that although higher predictive power is possible in theory, such performance requires a homogeneous system and perfect ex-ante knowledge of it: even a small degree of uncertainty in estimating product quality or slight variation in quality across products leads to substantially more restrictive bounds on predictability. We conclude that realistic bounds on predictive accuracy are not dissimilar from those we have obtained empirically, and that such bounds for other complex social systems for which data is more difficult to obtain are likely even lower.
【Keywords】: cascade prediction; contagion; ex-ante prediction; information diffusion; limits to prediction
【Paper Link】 【Pages】:695-706
【Authors】: Flavio Figueiredo ; Bruno Ribeiro ; Jussara M. Almeida ; Christos Faloutsos
【Abstract】: Which song will Smith listen to next? Which restaurant will Alice go to tomorrow? Which product will John click next? These applications have in common the prediction of user trajectories that are in a constant state of flux over a hidden network (e.g. website links, geographic location). Moreover, what users are doing now may be unrelated to what they will be doing in an hour from now. Mindful of these challenges we propose TribeFlow, a method designed to cope with the complex challenges of learning personalized predictive models of non-stationary, transient, and time-heterogeneous user trajectories. TribeFlow is a general method that can perform next product recommendation, next song recommendation, next location prediction, and general arbitrary-length user trajectory prediction without domain-specific knowledge. TribeFlow is more accurate and up to 413x faster than top competitors.
【Keywords】: latent environments; user trajectory recommendation
【Paper Link】 【Pages】:707-719
【Authors】: Christopher J. Riederer ; Yunsung Kim ; Augustin Chaintreau ; Nitish Korula ; Silvio Lattanzi
【Abstract】: Linking accounts of the same user across datasets -- even when personally identifying information is removed or unavailable -- is an important open problem studied in many contexts. Beyond many practical applications, (such as cross domain analysis, recommendation, and link prediction), understanding this problem more generally informs us on the privacy implications of data disclosure. Previous work has typically addressed this question using either different portions of the same dataset or observing the same behavior across thematically similar domains. In contrast, the general cross-domain case where users have different profiles independently generated from a common but unknown pattern raises new challenges, including difficulties in validation, and remains under-explored. In this paper, we address the reconciliation problem for location-based datasets and introduce a robust method for this general setting. Location datasets are a particularly fruitful domain to study: such records are frequently produced by users in an increasing number of applications and are highly sensitive, especially when linked to other datasets. Our main contribution is a generic and self-tunable algorithm that leverages any pair of sporadic location-based datasets to determine the most likely matching between the users it contains. While making very general assumptions on the patterns of mobile users, we show that the maximum weight matching we compute is provably correct. Although true cross-domain datasets are a rarity, our experimental evaluation uses two entirely new data collections, including one we crawled, on an unprecedented scale. The method we design outperforms naive rules and prior heuristics. As it combines both sparse and dense properties of location-based data and accounts for probabilistic dynamics of observation, it can be shown to be robust even when data gets sparse.
【Keywords】: anonymization; entity resolution; linking; location data; privacy; privacy attacks; social networks
【Paper Link】 【Pages】:725-735
【Authors】: Fuzheng Zhang ; Nicholas Jing Yuan ; Kai Zheng ; Defu Lian ; Xing Xie ; Yong Rui
【Abstract】: The wide adoption of location-based services provide the potential to understand people's mobility pattern at an unprecedented level, which can also enable food-service industry to accurately predict consumers' dining behavior. In this paper, based on users' dining implicit feedbacks (restaurant visit via check-ins), explicit feedbacks (restaurant reviews) as well as some meta data (e.g., location, user demographics, restaurant attributes), we aim at recommending each user a list of restaurants for his next dining. Implicit and Explicit feedbacks of dining behavior exhibit different characteristics of user preference. Therefore, in our work, user's dining preference mainly contains two parts: implicit preference coming from check-in data (implicit feedbacks) and explicit preference coming from rating and review data (explicit feedbacks). For implicit preference, we first apply a probabilistic tensor factorization model (PTF) to capture preference in a latent subspace. Then, in order to incorporate contextual signals from meta data, we extend PTF by proposing an Implicit Preference Model (IPM), which can simultaneously capture users'/restaurants'/time' preference in the collaborative filtering and dining preference in a specific context (e.g., spatial distance preference, environmental preference). For explicit preference, we propose Explicit Preference Model (EPM) by combining matrix factorization with topic modeling to discover the user preference embedded both in rating score and text content. Finally, we design a unified model termed as Collective Implicit Explicit Preference Model (CIEPM) to combine implicit and explicit preference together for restaurant recommendation. To evaluate the performance of our system, we conduct extensive experiments with large-scale datasets covering hundreds of thousands of users and restaurants. The results reveal that our system is effective for restaurant recommendation.
【Keywords】: explicit preference; implicit preference; restaurant recommendation
【Paper Link】 【Pages】:737-747
【Authors】: Yasuko Matsubara ; Yasushi Sakurai ; Christos Faloutsos
【Abstract】: Given a large collection of time-evolving activities, such as Google search queries, which consist of d keywords/activities for m locations of duration n, how can we analyze temporal patterns and relationships among all these activities and find location-specific trends? How do we go about capturing non-linear evolutions of local activities and forecasting future patterns? For example, assume that we have the online search volume for multiple keywords, e.g., "Nokia/Nexus/Kindle" or "CNN/BBC" for 236 countries/territories, from 2004 to 2015. Our goal is to analyze a large collection of multi-evolving activities, and specifically, to answer the following questions: (a) Is there any sign of interaction/competition between two different keywords? If so, who competes with whom? (b) In which country is the competition strong? (c) Are there any seasonal/annual activities? (d) How can we automatically detect important world-wide (or local) events? We present COMPCUBE, a unifying non-linear model, which provides a compact and powerful representation of co-evolving activities; and also a novel fitting algorithm, COMPCUBE-FIT, which is parameter-free and scalable. Our method captures the following important patterns: (B)asic trends, i.e., non-linear dynamics of co-evolving activities, signs of (C)ompetition and latent interaction, e.g., Nokia vs. Nexus, (S)easonality, e.g., a Christmas spike for iPod in the U.S. and Europe, and (D)eltas, e.g., unrepeated local events such as the U.S. election in 2008. Thanks to its concise but effective summarization, COMPCUBE can also forecast long-range future activities. Extensive experiments on real datasets demonstrate that COMPCUBE consistently outperforms the best state-of- the-art methods in terms of both accuracy and execution speed.
【Keywords】: forecasting; non-linear; parameter-free; time-series
【Paper Link】 【Pages】:749-759
【Authors】: Jiawei Zhang ; Philip S. Yu
【Abstract】: People nowadays usually participate in multiple online social networks simultaneously to enjoy more social network services. Besides the common users, social networks providing similar services can also share many other kinds of information entities, e.g., locations, videos and products. However, these shared information entities in different networks are mostly isolated without any known corresponding connections. In this paper, we aim at inferring such potential corresponding connections linking multiple kinds of shared entities across networks simultaneously. Formally, the problem is referred to as the network "Partial Co-alignmenT" (PCT) problem. PCT is an important problem and can be the prerequisite for many concrete cross-network applications, like social network fusion, mutual information exchange and transfer. Meanwhile, the PCT problem is also very challenging to address due to various reasons, like (1) the heterogeneity of social networks, (2) lack of training instances to build models, and (3) one-to-one constraint on the correspondence connections. To resolve these challenges, a novel unsupervised network alignment framework, UNICOAT (UNsupervIsed COncurrent AlignmenT)), is introduced in this paper. Based on the heterogeneous information, UNICOAT transforms the PCT problem into a joint optimization problem. To solve the objective function, the one-to-one constraint on the corresponding relationships is relaxed, and the redundant non-existing corresponding connections introduced by such a relaxation will be pruned with a novel network co-matching algorithm proposed in this paper. Extensive experiments conducted on real-world co-aligned social network datasets demonstrate the effectiveness of UNICOAT in addressing the PCT problem.
【Keywords】: data mining; multiple heterogeneous social networks; partial network co-alignment; unsupervised learning
【Paper Link】 【Pages】:761-770
【Authors】: Nicola Barbieri ; Fabrizio Silvestri ; Mounia Lalmas
【Abstract】: In this paper we focus on estimating the post-click engagement on native ads by predicting the dwell time on the corresponding ad landing pages. To infer relationships between features of the ads and dwell time we resort to the application of survival analysis techniques, which allow us to estimate the distribution of the length of time that the user will spend on the ad. This information is then integrated into the ad ranking function with the goal of promoting the rank of ads that are likely to be clicked and consumed by users (dwell time greater than a given threshold). The online evaluation over live traffic shows that considering post-click engagement has a consistent positive effect on both CTR, decreases the number of bounces and increases the average dwell time, hence leading to a better user post-click experience.
【Keywords】: ad quality; dwell time; mobile advertising; post-click experience; survival analysis framework
【Paper Link】 【Pages】:771-782
【Authors】: Huan Sun ; Hao Ma ; Xiaodong He ; Wen-tau Yih ; Yu Su ; Xifeng Yan
【Abstract】: Tables are pervasive on the Web. Informative web tables range across a large variety of topics, which can naturally serve as a significant resource to satisfy user information needs. Driven by such observations, in this paper, we investigate an important yet largely under-addressed problem: Given millions of tables, how to precisely retrieve table cells to answer a user question. This work proposes a novel table cell search framework to attack this problem. We first formulate the concept of a relational chain which connects two cells in a table and represents the semantic relation between them. With the help of search engine snippets, our framework generates a set of relational chains pointing to potentially correct answer cells. We further employ deep neural networks to conduct more fine-grained inference on which relational chains best match the input question and finally extract the corresponding answer cells. Based on millions of tables crawled from the Web, we evaluate our framework in the open-domain question answering (QA) setting, using both the well-known WebQuestions dataset and user queries mined from Bing search engine logs. On WebQuestions, our framework is comparable to state-of-the-art QA systems based on knowledge bases (KBs), while on Bing queries, it outperforms other systems with a 56.7% relative gain. Moreover, when combined with results from our framework, KB-based QA performance can obtain a relative improvement of 28.1% to 66.7%, demonstrating that web tables supply rich knowledge that might not exist or is difficult to be identified in existing KBs.
【Keywords】: knowledge bases; question answering; table cell search; web search
【Paper Link】 【Pages】:783-793
【Authors】: Gilad Tsur ; Yuval Pinter ; Idan Szpektor ; David Carmel
【Abstract】: Vertical selection is the task of predicting relevant verticals for a Web query so as to enrich the Web search results with complementary vertical results. We investigate a novel variant of this task, where the goal is to detect queries with a question intent. Specifically, we address queries for which the user would like an answer with a human touch. We call these CQA-intent queries, since answers to them are typically found in community question answering (CQA) sites. A typical approach in vertical selection is using a vertical's specific language model of relevant queries and computing the query-likelihood for each vertical as a selective criterion. This works quite well for many domains like Shopping, Local and Travel. Yet, we claim that queries with CQA intent are harder to distinguish by modeling content alone, since they cover many different topics. We propose to also take the structure of queries into consideration, reasoning that queries with question intent have quite a different structure than other queries. We present a supervised classification scheme, random forest over word-clusters for variable length texts, which can model the query structure. Our experiments show that it substantially improves classification performance in the CQA-intent selection task compared to content-oriented based classification, especially as query length grows.
【Keywords】: question intent; vertical selection
【Paper Link】 【Pages】:795-805
【Authors】: Ronan Cummins
【Abstract】: Recent research has shown that long documents are unfairly penalised by a number of current retrieval methods. In this paper, we formally analyse two important but distinct reasons for normalising documents with respect to length, namely verbosity and scope, and discuss the practical implications of not normalising accordingly. We review a number of language modelling approaches and a range of recently developed retrieval methods, and show that most do not correctly model both phenomena, thus limiting their retrieval effectiveness in certain situations. Furthermore, the retrieval characteristics of long natural language queries have not traditionally had the same attention as short keyword queries. We develop a new discriminative query language modelling approach that demonstrates improved performance on long verbose queries by appropriately weighting salient aspects of the query. When combined with query expansion, we show that our new approach yields state-of-the-art performance for long verbose queries.
【Keywords】: document length normalisation; query language models; scope; verbosity
【Paper Link】 【Pages】:807-816
【Authors】: Kewen Liao ; Matthias Petri ; Alistair Moffat ; Anthony Wirth
【Abstract】: Web crawls generate vast quantities of text, retained and archived by the search services that initiate them. To store such data and to allow storage costs to be minimized, while still providing some level of random access to the compressed data, efficient and effective compression techniques are critical. The Relative Lempel Ziv (RLZ) scheme provides fast decompression and retrieval of documents from within large compressed collections, and even with a relatively small RAM-resident dictionary, is competitive relative to adaptive compression schemes. To date, the dictionaries required by RLZ compression have been formed from concatenations of substrings regularly sampled from the underlying document collection, then pruned in a manner that seeks to retain only the high-use sections. In this work, we develop new dictionary design heuristics, based on effective construction, rather than on pruning; we identify dictionary construction as a (string) covering problem. To avoid the complications of string covering algorithms on large collections, we focus on k-mers and their frequencies. First, with a reservoir sampler, we efficiently identify the most common k-mers. Then, since a collection typically comprises regions of local similarity, we select in each "epoch" a segment whose k-mers together achieve, locally, the highest coverage score. The dictionary is formed from the concatenation of these epoch-derived segments. Our selection process is inspired by the greedy approach to the Set Cover problem. Compared with the best existing pruning method, CARE, our scheme has a similar construction time, but achieves better compression effectiveness. Over several multi-gigabyte document collections, there are relative gains of up to 27%.
【Keywords】: corpus compression; dictionary representation; document retrieval; optimization
【Paper Link】 【Pages】:817-827
【Authors】: Markus Rokicki ; Sergej Zerr ; Stefan Siersdorfer
【Abstract】: Many modern data analytics applications in areas such as crisis management, stock trading, and healthcare, rely on components capable of nearly real-time processing of streaming data produced at varying rates. In addition to automatic processing methods, many tasks involved in those applications require further human assessment and analysis. However, current crowdsourcing platforms and systems do not support stream processing with variable loads. In this paper, we investigate how incentive mechanisms in competition based crowdsourcing can be employed in such scenarios. More specifically, we explore techniques for stimulating workers to dynamically adapt to both anticipated and sudden changes in data volume and processing demand, and we analyze effects such as data processing throughput, peak-to-average ratios, and saturation effects. To this end, we study a wide range of incentive schemes and utility functions inspired by real world applications. Our large-scale experimental evaluation with more than 900 participants and more than 6200 hours of work spent by crowd workers demonstrates that our competition based mechanisms are capable of adjusting the throughput of online workers and lead to substantial on-demand performance boosts.
【Keywords】: competitions; crowdsourcing; stream processing
【Paper Link】 【Pages】:829-841
【Authors】: Samuel Barbosa ; Dan Cosley ; Amit Sharma ; Roberto M. Cesar Jr.
【Abstract】: Online communities provide a fertile ground for analyzing people's behavior and improving our understanding of social processes. Because both people and communities change over time, we argue that analyses of these communities that take time into account will lead to deeper and more accurate results. Using Reddit as an example, we study the evolution of users based on comment and submission data from 2007 to 2014. Even using one of the simplest temporal differences between users---yearly cohorts---we find wide differences in people's behavior, including comment activity, effort, and survival. Further, not accounting for time can lead us to misinterpret important phenomena. For instance, we observe that average comment length decreases over any fixed period of time, but comment length in each cohort of users steadily increases during the same period after an abrupt initial drop, an example of Simpson's Paradox. Dividing cohorts into sub-cohorts based on the survival time in the community provides further insights; in particular, longer-lived users start at a higher activity level and make more and shorter comments than those who leave earlier. These findings both give more insight into user evolution in Reddit in particular, and raise a number of interesting questions around studying online behavior going forward.
【Keywords】: cohorts; computational social science; reddit; temporal; user behavior
【Paper Link】 【Pages】:843-853
【Authors】: Panagiotis Mavridis ; David Gross-Amblard ; Zoltán Miklós
【Abstract】: Besides the simple human intelligence tasks such as image labeling, crowdsourcing platforms propose more and more tasks that require very specific skills, especially in participative science projects. In this context, there is a need to reason about the required skills for a task and the set of available skills in the crowd, in order to increase the resulting quality. Most of the existing solutions rely on unstructured tags to model skills (vector of skills). In this paper we propose to finely model tasks and participants using a skill tree, that is a taxonomy of skills equipped with a similarity distance within skills. This model of skills enables to map participants to tasks in a way that exploits the natural hierarchy among the skills. We illustrate the effectiveness of our model and algorithms through extensive experimentation with synthetic and real data sets.
【Keywords】: crowdsourcing; skill modeling; task mapping
【Paper Link】 【Pages】:855-865
【Authors】: Djellel Eddine Difallah ; Gianluca Demartini ; Philippe Cudré-Mauroux
【Abstract】: Micro-task crowdsourcing has become a popular approach to effectively tackle complex data management problems such as data linkage, missing values, or schema matching. However, the backend crowdsourced operators of crowd-powered systems typically yield higher latencies than the machine-processable operators, this is mainly due to inherent efficiency differences between humans and machines. This problem can be further exacerbated by the lack of workers on the target crowdsourcing platform, or when the workers are shared unequally among a number of competing requesters; including the concurrent users from the same organization who execute crowdsourced queries with different types, priorities and prices. Under such conditions, a crowd-powered system acts mostly as a proxy to the crowdsourcing platform, and hence it is very difficult to provide effiency guarantees to its end-users. Scheduling is the traditional way of tackling such problems in computer science, by prioritizing access to shared resources. In this paper, we propose a new crowdsourcing system architecture that leverages scheduling algorithms to optimize task execution in a shared resources environment, in this case a crowdsourcing platform. Our study aims at assessing the efficiency of the crowd in settings where multiple types of tasks are run concurrently. We present extensive experimental results comparing i) different multi-tenant crowdsourcing jobs, including a workload derived from real traces, and ii) different scheduling techniques tested with real crowd workers. Our experimental results show that task scheduling can be leveraged to achieve fairness and reduce query latency in multi-tenant crowd-powered systems, although with very different tradeoffs compared to traditional settings not including human factors.
【Keywords】: crowd-powered system; crowdsourcing; scheduling
【Paper Link】 【Pages】:867-878
【Authors】: Gary Soeller ; Karrie Karahalios ; Christian Sandvig ; Christo Wilson
【Abstract】: Maps have long played a crucial role in enabling people to conceptualize and navigate the world around them. However, maps also encode the world-views of their creators. Disputed international borders are one example of this: governments may mandate that cartographers produce maps that conform to their view of a territorial dispute. Today, online maps maintained by private corporations have become the norm. However, these new maps are still subject to old debates. Companies like Google and Bing resolve these disputes by localizing their maps to meet government requirements and user preferences, i.e., users in different locations are shown maps with different international boundaries. We argue that this non-transparent personalization of maps may exacerbate nationalistic disputes by promoting divergent views of geopolitical realities. To address this problem, we present MapWatch, our system for detecting and cataloging personalization of international borders in online maps. Our system continuously crawls all map tiles from Google and Bing maps, and leverages crowdworkers to identify border personalization. In this paper, we present the architecture of MapWatch, and analyze the instances of border personalization on Google and Bing, including one border change that MapWatch identified live, as Google was rolling out the update.
【Keywords】: maps; measurement; personalization
【Paper Link】 【Pages】:879-889
【Authors】: Jiongqian Liang ; Deepak Ajwani ; Patrick K. Nicholson ; Alessandra Sala ; Srinivasan Parthasarathy
【Abstract】: An increasing number of applications are modeled and analyzed in network form, where nodes represent entities of interest and edges represent interactions or relationships between entities. Commonly, such relationship analysis tools assume homogeneity in both node type and edge type. Recent research has sought to redress the assumption of homogeneity and focused on mining heterogeneous information networks (HINs) where both nodes and edges can be of different types. Building on such efforts, in this work we articulate a novel approach for mining relationships across entities in such networks while accounting for user preference (prioritization) over relationship type and interestingness metric. We formalize the problem as a top-$k$ lightest paths problem, contextualized in a real-world communication network, and seek to find the $k$ most interesting path instances matching the preferred relationship type. Our solution, PROphetic HEuristic Algorithm for Path Searching (PRO-HEAPS), leverages a combination of novel graph preprocessing techniques, well designed heuristics and the venerable A* search algorithm. We run our algorithm on real-world large-scale graphs and show that our algorithm significantly outperforms a wide variety of baseline approaches with speedups as large as 100X. We also conduct a case study and demonstrate valuable applications of our algorithm.
【Keywords】: graph algorithms; heterogeneous information networks; semantic relationship queries
【Paper Link】 【Pages】:891-901
【Authors】: Aaron Cahn ; Scott Alfeld ; Paul Barford ; S. Muthukrishnan
【Abstract】: Web cookies are used widely by publishers and 3rd parties to track users and their behaviors. Despite the ubiquitous use of cookies, there is little prior work on their characteristics such as standard attributes, placement policies, and the knowledge that can be amassed via 3rd party cookies. In this paper, we present an empirical study of web cookie characteristics, placement practices and information transmission. To conduct this study, we implemented a lightweight web crawler that tracks and stores the cookies as it navigates to websites. We use this crawler to collect over 3.2M cookies from the two crawls, separated by 18 months, of the top 100K Alexa web sites. We report on the general cookie characteristics and add context via a cookie category index and website genre labels. We consider privacy implications by examining specific cookie attributes and placement behavior of 3rd party cookies. We find that 3rd party cookies outnumber 1st party cookies by a factor of two, and we illuminate the connection between domain genres and cookie attributes. We find that less than 1% of the entities that place cookies can aggregate information across 75% of web sites. Finally, we consider the issue of information transmission and aggregation by domains via 3rd party cookies. We develop a mathematical framework to quantify user information leakage for a broad class of users, and present findings using real world domains. In particular, we demonstrate the interplay between a domain's footprint across the Internet and the browsing behavior of users, which has significant impact on information transmission.
【Keywords】:
【Paper Link】 【Pages】:903-914
【Authors】: Amit K. Chopra ; Munindar P. Singh
【Abstract】: The overarching vision of social machines is to facilitate social processes by having computers provide administrative support. We conceive of a social machine as a sociotechnical system (STS): a software-supported system in which autonomous principals such as humans and organizations interact to exchange information and services. Existing approaches for social machines emphasize the technical aspects and inadequately support the meanings of social processes, leaving them informally realized in human interactions. We posit that a fundamental rethinking is needed to incorporate accountability, essential for addressing the openness of the Web and the autonomy of its principals. We introduce Interaction-Oriented Software Engineering (IOSE) as a paradigm expressly suited to capturing the social basis of STSs. Motivated by promoting openness and autonomy, IOSE focuses not on implementation but on social protocols, specifying how social relationships, characterizing the accountability of the concerned parties, progress as they interact. Motivated by providing computational support, IOSE adopts the accountability representation to capture the meaning of a social machine's states and transitions. We demonstrate IOSE via examples drawn from healthcare. We reinterpret the classical software engineering (SE) principles for the STS setting and show how IOSE is better suited than traditional software engineering for supporting social processes. The contribution of this paper is a new paradigm for STSs, evaluated via conceptual analysis.
【Keywords】: commitment; interaction protocols; norms; social computing; social machines; social protocols
【Paper Link】 【Pages】:915-925
【Authors】: Erdal Kuzey ; Vinay Setty ; Jannik Strötgen ; Gerhard Weikum
【Abstract】: Temporal expressions (TempEx's for short) are increasingly important in search, question answering, information extraction, and more. Techniques for identifying and normalizing explicit temporal expressions work well, but are not designed for and cannot cope with textual phrases that denote named events, such as "Clinton's term as secretary of state". This paper addresses the problem of detecting such temponyms, inferring their temporal scopes, and mapping them to events in a knowledge base if present there. We present methods for this kind of temponym resolution, using an entity- and TempEx-oriented document model and the Yago knowledge base for distant supervision. We develop a family of Integer Linear Programs for jointly inferring temponym mappings to the timeline and knowledge base. This enriches the document representation and also extends the knowledge base by obtaining new alias names for events. Experiments with three different corpora demonstrate the viability of our methods.
【Keywords】: temponym resolution; temporal knowledge; temporal tagging
【Paper Link】 【Pages】:927-938
【Authors】: Octavian-Eugen Ganea ; Marina Ganea ; Aurélien Lucchi ; Carsten Eickhoff ; Thomas Hofmann
【Abstract】: Many fundamental problems in natural language processing rely on determining what entities appear in a given text. Commonly referenced as entity linking, this step is a fundamental component of many NLP tasks such as text understanding, automatic summarization, semantic search or machine translation. Name ambiguity, word polysemy, context dependencies and a heavy-tailed distribution of entities contribute to the complexity of this problem. We here propose a probabilistic approach that makes use of an effective graphical model to perform collective entity disambiguation. Input mentions (i.e., linkable token spans) are disambiguated jointly across an entire document by combining a document-level prior of entity co-occurrences with local information captured from mentions and their surrounding context. The model is based on simple sufficient statistics extracted from data, thus relying on few parameters to be learned. Our method does not require extensive feature engineering, nor an expensive training procedure. We use loopy belief propagation to perform approximate inference. The low complexity of our model makes this step sufficiently fast for real-time usage. We demonstrate the accuracy of our approach on a wide range of benchmark datasets, showing that it matches, and in many cases outperforms, existing state-of-the-art methods.
【Keywords】: approximate inference; entity disambiguation; entity linking; loopy belief propagation; probabilistic graphical models; wikification
【Paper Link】 【Pages】:939-949
【Authors】: Alon Y. Halevy ; Natalya Fridman Noy ; Sunita Sarawagi ; Steven Euijong Whang ; Xiao Yu
【Abstract】: Recently, search engines have invested significant effort to answering entity--attribute queries from structured data, but have focused mostly on queries for frequent attributes. In parallel, several research efforts have demonstrated that there is a long tail of attributes, often thousands per class of entities, that are of interest to users. Researchers are beginning to leverage these new collections of attributes to expand the ontologies that power search engines and to recognize entity--attribute queries. Because of the sheer number of potential attributes, such tasks require us to impose some structure on this long and heavy tail of attributes. This paper introduces the problem of organizing the attributes by expressing the compositional structure of their names as a rule-based grammar. These rules offer a compact and rich semantic interpretation of multi-word attributes, while generalizing from the observed attributes to new unseen ones. The paper describes an unsupervised learning method to generate such a grammar automatically from a large set of attribute names. Experiments show that our method can discover a precise grammar over 100,000 attributes of {\sc Countries} while providing a 40-fold compaction over the attribute names. Furthermore, our grammar enables us to increase the precision of attributes from 47\% to more than 90\% with only a minimal curation effort. Thus, our approach provides an efficient and scalable way to expand ontologies with attributes of user interest.
【Keywords】: attribute; grammar; ontology
【Paper Link】 【Pages】:951-961
【Authors】: Dawen Liang ; Laurent Charlin ; James McInerney ; David M. Blei
【Abstract】: Collaborative filtering analyzes user preferences for items (e.g., books, movies, restaurants, academic papers) by exploiting the similarity patterns across users. In implicit feedback settings, all the items, including the ones that a user did not consume, are taken into consideration. But this assumption does not accord with the common sense understanding that users have a limited scope and awareness of items. For example, a user might not have heard of a certain paper, or might live too far away from a restaurant to experience it. In the language of causal analysis (Imbens & Rubin, 2015), the assignment mechanism (i.e., the items that a user is exposed to) is a latent variable that may change for various user/item combinations. In this paper, we propose a new probabilistic approach that directly incorporates user exposure to items into collaborative filtering. The exposure is modeled as a latent variable and the model infers its value from data. In doing so, we recover one of the most successful state-of-the-art approaches as a special case of our model (Hu et al. 2008), and provide a plug-in method for conditioning exposure on various forms of exposure covariates (e.g., topics in text, venue locations). We show that our scalable inference algorithm outperforms existing benchmarks in four different domains both with and without exposure covariates.
【Keywords】: collaborative filtering; matrix factorization; recommender systems
【Paper Link】 【Pages】:963-973
【Authors】: Austin R. Benson ; Ravi Kumar ; Andrew Tomkins
【Abstract】: Multinomial logistic regression is a powerful tool to model choice from a finite set of alternatives, but it comes with an underlying model assumption called the independence of irrelevant alternatives, stating that any item added to the set of choices will decrease all other items' likelihood by an equal fraction. We perform statistical tests of this assumption across a variety of datasets and give results showing how often it is violated. When this axiom is violated, choice theorists will often invoke a richer model known as nested logistic regression, in which information about competition among items is encoded in a tree structure known as a nest. However, to our knowledge there are no known algorithms to induce the correct nest structure. We present the first such algorithm, which runs in quadratic time under an oracle model, and we pair it with a matching lower bound. We then perform experiments on synthetic and real datasets to validate the algorithm, and show that nested logit over learned nests outperforms traditional multinomial regression. Finally, in addition to automatically learning nests, we show how nests may be constructed by hand to test hypotheses about the data, and evaluated by their explanatory power.
【Keywords】: discrete choice; independence of irrelevant alternatives; logit
【Paper Link】 【Pages】:975-985
【Authors】: Ellery Wulczyn ; Robert West ; Leila Zia ; Jure Leskovec
【Abstract】: The different Wikipedia language editions vary dramatically in how comprehensive they are. As a result, most language editions contain only a small fraction of the sum of information that exists across all Wikipedias. In this paper, we present an approach to filling gaps in article coverage across different Wikipedia editions. Our main contribution is an end-to-end system for recommending articles for creation that exist in one language but are missing in an- other. The system involves identifying missing articles, ranking the missing articles according to their importance, and recommending important missing articles to editors based on their interests. We empirically validate our models in a controlled experiment involving 12,000 French Wikipedia editors. We find that personalizing recommendations increases editor engagement by a factor of two. Moreover, recommending articles increases their chance of being created by a factor of 3.2. Finally, articles created as a result of our recommendations are of comparable quality to organically created articles. Overall, our system leads to more engaged editors and faster growth of Wikipedia with no effect on its quality.
【Keywords】: recommendation systems; translation; wikipedia
【Paper Link】 【Pages】:987-997
【Authors】: Tobias Schnabel ; Paul N. Bennett ; Susan T. Dumais ; Thorsten Joachims
【Abstract】: In this paper, we study shortlists as an interface component for recommender systems with the dual goal of supporting the user's decision process, as well as improving implicit feedback elicitation for increased recommendation quality. A shortlist is a temporary list of candidates that the user is currently considering, e.g., a list of a few movies the user is currently considering for viewing. From a cognitive perspective, shortlists serve as digital short-term memory where users can off-load the items under consideration -- thereby decreasing their cognitive load. From a machine learning perspective, adding items to the shortlist generates a new implicit feedback signal as a by-product of exploration and decision making which can improve recommendation quality. Shortlisting therefore provides additional data for training recommendation systems without the increases in cognitive load that requesting explicit feedback would incur. We perform an user study with a movie recommendation setup to compare interfaces that offer shortlist support with those that do not. From the user studies we conclude: (i) users make better decisions with a shortlist; (ii) users prefer an interface with shortlist support; and (iii) the additional implicit feedback from sessions with a shortlist improves the quality of recommendations by nearly a factor of two.
【Keywords】: decision making; digital memory; exploration; implicit feedback; interfaces; user engagement
【Paper Link】 【Pages】:999-1008
【Authors】: Nethanel Gelernter ; Amir Herzberg
【Abstract】: We present the malicious CAPTCHA attack, allowing a rogue website to trick users into unknowingly disclosing their private information. The rogue site displays the private information to the user in obfuscated manner, as if it is a CAPTCHA challenge; the user is unaware that solving the CAPTCHA, results in disclosing private information. This circumvents the Same Origin Policy (SOP), whose goal is to prevent access by rogue sites to private information, by exploiting the fact that many websites allow display of private information (to the user), upon requests from any (even rogue) website. Information so disclosed includes name, phone number, email and physical addresses, search history, preferences, partial credit card numbers, and more. The vulnerability is common and the attack works for many popular sites, including nine out of the ten most popular websites. We evaluated the attack using IRB-approved, ethical user experiments.
【Keywords】: attack; malicious captcha attack; web; web application attack; web application security
【Paper Link】 【Pages】:1009-1019
【Authors】: Frank Li ; Grant Ho ; Eric Kuan ; Yuan Niu ; Lucas Ballard ; Kurt Thomas ; Elie Bursztein ; Vern Paxson
【Abstract】: As miscreants routinely hijack thousands of vulnerable web servers weekly for cheap hosting and traffic acquisition, security services have turned to notifications both to alert webmasters of ongoing incidents as well as to expedite recovery. In this work we present the first large-scale measurement study on the effectiveness of combinations of browser, search, and direct webmaster notifications at reducing the duration a site remains compromised. Our study captures the life cycle of 760,935 hijacking incidents from July, 2014--June, 2015, as identified by Google Safe Browsing and Search Quality. We observe that direct communication with webmasters increases the likelihood of cleanup by over 50% and reduces infection lengths by at least 62%. Absent this open channel for communication, we find browser interstitials---while intended to alert visitors to potentially harmful content---correlate with faster remediation. As part of our study, we also explore whether webmasters exhibit the necessary technical expertise to address hijacking incidents. Based on appeal logs where webmasters alert Google that their site is no longer compromised, we find 80% of operators successfully clean up symptoms on their first appeal. However, a sizeable fraction of site owners do not address the root cause of compromise, with over 12% of sites falling victim to a new attack within 30 days. We distill these findings into a set of recommendations for improving web security and best practices for webmasters.
【Keywords】: compromised websites; large-scale measurement; security notifications; webmaster comprehension
【Paper Link】 【Pages】:1021-1032
【Authors】: Oleksii Starov ; Johannes Dahse ; Syed Sharique Ahmad ; Thorsten Holz ; Nick Nikiforakis
【Abstract】: Web shells are malicious scripts that attackers upload to a compromised web server in order to remotely execute arbitrary commands, maintain their access, and elevate their privileges. Despite their high prevalence in practice and heavy involvement in security breaches, web shells have never been the direct subject of any study. In contrast, web shells have been treated as malicious blackboxes that need to be detected and removed, rather than malicious pieces of software that need to be analyzed and, in detail, understood. In this paper, we report on the first comprehensive study of web shells. By utilizing different static and dynamic analysis methods, we discover and quantify the visible and invisible features offered by popular malicious shells, and we discuss how attackers can take advantage of these features. For visible features, we find the presence of password bruteforcers, SQL database clients, portscanners, and checks for the presence of security software installed on the compromised server. In terms of invisible features, we find that about half of the analyzed shells contain an authentication mechanism, but this mechanism can be bypassed in a third of the cases. Furthermore, we find that about a third of the analyzed shells perform homephoning, i.e., the shells, upon execution, surreptitiously communicate to various third parties with the intent of revealing the location of new shell installations. By setting up honeypots, we quantify the number of third-party attackers benefiting from shell installations and show how an attacker, by merely registering the appropriate domains, can completely take over all installations of specific vulnerable shells.
【Keywords】: backdoor; expired domains; homephoning; malware; web shells
【Paper Link】 【Pages】:1033-1043
【Authors】: Mohammad Karami ; Youngsam Park ; Damon McCoy
【Abstract】: DDoS-for-hire services, also known as booters, have commoditized DDoS attacks and enabled abusive subscribers of these services to cheaply extort, harass and intimidate businesses and people by taking them offline. However, due to the underground nature of these booters, little is known about their underlying technical and business structure. In this paper, we empirically measure many facets of their technical and payment infrastructure. We also perform an analysis of leaked and scraped data from three major booters---Asylum Stresser, Lizard Stresser and VDO---which provides us with an in-depth view of their customers and victims. Finally, we conduct a large-scale payment intervention in collaboration with PayPal and evaluate its effectiveness as a deterrent to their operations. Based on our analysis, we show that these booters are responsible for hundreds of thousands of DDoS attacks and identify potentially promising methods to undermine these services by increasing their costs of operation.
【Keywords】: booter; ddos; measurement; paypal
【Paper Link】 【Pages】:1045-1055
【Authors】: Noriaki Kawamae
【Abstract】: Our proposal, $N$-gram over Context (NOC), is a nonparametric topic model that aims to help our understanding of a given corpus, and be applied to many text mining applications. Like other topic models, NOC represents each document as a mixture of topics and generates each word from one topic. Unlike these models, NOC focuses on both a topic structure as an internal linguistic structure, and N-gram as an external linguistic structure. To improve the quality of topic specific N-grams, NOC reveals a tree of topics that captures the semantic relationship between topics from a given corpus as context, and forms $N$-gram by offering power-law distributions for word frequencies on this topic tree. To gain both these linguistic structures efficiently, NOC learns them from a given corpus in a unified manner. By accessing this entire tree at the word level in the generative process of each document, NOC enables each document to maintain a thematic coherence and form $N$-grams over context. We develop a parallelizable inference algorithm, D-NOC, to support large data sets. Experiments on review articles/papers/tweet show that NOC is useful as a generative model to discover both the topic structure and the corresponding N-grams, and well complements human experts and domain specific knowledge. D-NOC can process large data sets while preserving full generative model performance, by the help of an open-source distributed machine learning framework.
【Keywords】: N-gram topic model; graphical models; latent variable models; mapreduce; nonparametric models; topic models
【Paper Link】 【Pages】:1057-1067
【Authors】: Jialu Liu ; Xiang Ren ; Jingbo Shang ; Taylor Cassidy ; Clare R. Voss ; Jiawei Han
【Abstract】: Many text mining approaches adopt bag-of-words or $n$-grams models to represent documents. Looking beyond just the words, fiie, the explicit surface forms, in a document can improve a computer's understanding of text. Being aware of this, researchers have proposed concept-based models that rely on a human-curated knowledge base to incorporate other related concepts in the document representation. But these methods are not desirable when applied to vertical domains (eg, literature, enterprise, etc) due to low coverage of in-domain concepts in the general knowledge base and interference from out-of-domain concepts. In this paper, we propose a data-driven model named Latent Keyphrase Inference LAKI) that represents documents with a vector of closely related domain keyphrases instead of single words or existing concepts in the knowledge base. We show that given a corpus of in-domain documents, topical content units can be learned for each domain keyphrase, which enables a computer to do smart inference to discover latent document keyphrases, going beyond just explicit mentions. Compared with the state-of-art document representation approaches, LAKI fills the gap between bag-of-words and concept-based models by using domain keyphrases as the basic representation unit. It removes dependency on a knowledge base while providing, with keyphrases, readily interpretable representations. When evaluated against 8 other methods on two text mining tasks over two corpora, LAKI outperformed all.
【Keywords】: document representation; keyphrase extraction; keyphrase inference; noisy-or bayesian network
【Paper Link】 【Pages】:1069-1079
【Authors】: Christophe Van Gysel ; Maarten de Rijke ; Marcel Worring
【Abstract】: We introduce an unsupervised discriminative model for the task of retrieving experts in online document collections. We exclusively employ textual evidence and avoid explicit feature engineering by learning distributed word representations in an unsupervised way. We compare our model to state-of-the-art unsupervised statistical vector space and probabilistic generative approaches. Our proposed log-linear model achieves the retrieval performance levels of state-of-the-art document-centric methods with the low inference cost of so-called profile-centric approaches. It yields a statistically significant improved ranking over vector space and generative models in most cases, matching the performance of supervised methods on various benchmarks. That is, by using solely text we can do as well as methods that work with external evidence and/or relevance feedback. A contrastive analysis of rankings produced by discriminative and generative approaches shows that they have complementary strengths due to the ability of the unsupervised discriminative model to perform semantic matching.
【Keywords】: expertise retrieval; language models; semantic matching
【Paper Link】 【Pages】:1081-1091
【Authors】: Alexey Borisov ; Pavel Serdyukov ; Maarten de Rijke
【Abstract】: In web search, latent semantic models have been proposed to bridge the lexical gap between queries and documents that is due to the fact that searchers and content creators often use different vocabularies and language styles to express the same concept. Modern search engines simply use the outputs of latent semantic models as features for a so-called global ranker. We argue that this is not optimal, because a single value output by a latent semantic model may be insufficient to describe all aspects of the model's prediction, and thus some information captured by the model is not used effectively by the search engine. To increase the effectiveness of latent semantic models in web search, we propose to create metafeatures-feature vectors that describe the structure of the model's prediction for a given query-document pair and pass them to the global ranker along with the models? scores. We provide simple guidelines to represent the latent semantic model's prediction with more than a single number, and illustrate these guidelines using several latent semantic models. We test the impact of the proposed metafeatures on a web document ranking task using four latent semantic models. Our experiments show that (1) through the use of metafeatures, the performance of each individual latent semantic model can be improved by 10.2% and 4.2% in NDCG scores at truncation levels 1 and 10; and (2) through the use of metafeatures, the performance of a combination of latent semantic models can be improved by 7.6% and 3.8% in NDCG scores at truncation levels 1 and 10, respectively.
【Keywords】: latent semantic models; metafeatures; web search
【Paper Link】 【Pages】:1093-1102
【Authors】: Renato Paes Leme ; Martin Pál ; Sergei Vassilvitskii
【Abstract】: We study the question of setting and testing reserve prices in single item auctions when the bidders are not identical. At a high level, there are two generalizations of the standard second price auction: in the lazy version we first determine the winner, and then apply reserve prices; in the eager version we first discard the bidders not meeting their reserves, and then determine the winner among the rest. We show that the two versions have dramatically different properties: lazy reserves are easy to optimize, and A/B test in production, whereas eager reserves always lead to higher welfare, but their optimization is NP-complete, and naive A/B testing will lead to incorrect conclusions. Despite their different characteristics, we show that the overall revenue for the two scenarios is always within a factor of 2 of each other, even in the presence of correlated bids. Moreover, we prove that the eager auction dominates the lazy auction on revenue whenever the bidders are independent or symmetric. We complement our theoretical results with simulations on real world data that show that even suboptimally set eager reserve prices are preferred from a revenue standpoint.
【Keywords】: Theory
【Paper Link】 【Pages】:1103-1111
【Authors】: Dominic Coey ; Michael Bailey
【Abstract】: Identifying the same internet user across devices or over time is often infeasible. This presents a problem for online experiments, as it precludes person-level randomization. Randomization must instead be done using imperfect proxies for people, like cookies, email addresses, or device identifiers. Users may be partially treated and partially untreated as some of their cookies are assigned to the test group and some to the control group, complicating statistical inference. We show that the estimated treatment effect in a cookie-level experiment converges to a weighted average of the marginal effects of treating more of a user's cookies. If the marginal effects of cookie treatment exposure are positive and constant, it underestimates the true person-level effect by a factor equal to the number of cookies per person. Using two separate datasets---cookie assignment data from Atlas and advertising exposure and purchase data from Facebook---we empirically quantify the differences between cookie and person-level advertising effectiveness experiments. The effects are substantial: cookie tests underestimate the true person-level effects by a factor of about three, and require two to three times the number of people to achieve the same power as a test with perfect treatment assignment.
【Keywords】: advertising effectiveness; causal inference; cookies; experiments; online advertising
【Paper Link】 【Pages】:1113-1122
【Authors】: Maja R. Rudolph ; Joseph G. Ellis ; David M. Blei
【Abstract】: Many online companies sell advertisement space in second-price auctions with reserve. In this paper, we develop a probabilistic method to learn a profitable strategy to set the reserve price. We use historical auction data with features to fit a predictor of the best reserve price. This problem is delicate - the structure of the auction is such that a reserve price set too high is much worse than a reserve price set too low. To address this we develop objective variables, an approach for combining probabilistic modeling with optimal decision-making. Objective variables are "hallucinated observations" that transform the revenue maximization task into a regularized maximum likelihood estimation problem, which we solve with the EM algorithm. This framework enables a variety of prediction mechanisms to set the reserve price. As examples, we study objective variable methods with regression, kernelized regression, and neural networks on simulated and real data. Our methods outperform previous approaches both in terms of scalability and profit.
【Keywords】: graphical models; online auctions; second-price auctions
【Paper Link】 【Pages】:1123-1132
【Authors】: Cinar Kilcioglu ; Justin M. Rao
【Abstract】: The public cloud "infrastructure as a service" market possesses unique features that make it difficult to predict long-run economic behavior. On the one hand, major providers buy their hardware from the same manufacturers, operate in similar locations and offer a similar menu of products. On the other hand, the competitors use different proprietary "fabric" to manage virtualization, resource allocation and data transfer. The menus offered by each provider involve a discrete number of choices (virtual machine sizes) and allow providers to locate in different parts of the price-quality space. We document this differentiation empirically by running benchmarking tests. This allows us to calibrate a model of firm technology. Firm technology is an input into our theoretical model of price-quality competition. The monopoly case highlights the importance of competition in blocking "bad equilibrium" where performance is intentionally slowed down or options are unduly limited. In duopoly, price competition is fierce, but prices do not converge to the same level because of price-quality differentiation. The model helps explain market trends, such the healthy operating profit margin recently reported by Amazon Web Services. Our empirically calibrated model helps not only explain price cutting behavior but also how providers can manage a profit despite predictions that the market "should be" totally commoditized.
【Keywords】: cloud computing; game theory; pricing
【Paper Link】 【Pages】:1133-1144
【Authors】: Dhanya R. Krishnan ; Do Le Quoc ; Pramod Bhatotia ; Christof Fetzer ; Rodrigo Rodrigues
【Abstract】: Incremental and approximate computations are increasingly being adopted for data analytics to achieve low-latency execution and efficient utilization of computing resources. Incremental computation updates the output incrementally instead of re-computing everything from scratch for successive runs of a job with input changes. Approximate computation returns an approximate output for a job instead of the exact output. Both paradigms rely on computing over a subset of data items instead of computing over the entire dataset, but they differ in their means for skipping parts of the computation. Incremental computing relies on the memoization of intermediate results of sub-computations, and reusing these memoized results across jobs. Approximate computing relies on representative sampling of the entire dataset to compute over a subset of data items. In this paper, we observe that these two paradigms are complementary, and can be married together! Our idea is quite simple: design a sampling algorithm that biases the sample selection to the memoized data items from previous runs. To realize this idea, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. We implemented our algorithm in a data analytics system called IncApprox based on Apache Spark Streaming. Our evaluation using micro-benchmarks and real-world case-studies shows that IncApprox achieves the benefits of both incremental and approximate computing.
【Keywords】: approximate computation; dependance graph; error estimation; incremental computation; memoization; real-time processing; self-adjusting computation; stratified sampling; stream processing
【Paper Link】 【Pages】:1145-1155
【Authors】: Avigdor Gal ; Haggai Roitman ; Tomer Sagi
【Abstract】: Ontology & schema matching predictors assess the quality of matchers in the absence of an exact match. We propose MCD (Match Competitor Deviation), a new diversity-based predictor that compares the strength of a matcher confidence in the correspondence of a concept pair with respect to other correspondences that involve either concept. We also propose to use MCD as a regulator to optimally control a balance between Precision and Recall and use it towards 1:1 matching by combining it with a similarity measure that is based on solving a maximum weight bipartite graph matching (MWBM). Optimizing the combined measure is known to be an NP-Hard problem. Therefore, we propose CEM, an approximation to an optimal match by efficiently scanning multiple possible matches, using rare event estimation. Using a thorough empirical study over several benchmark real-world datasets, we show that MCD outperforms other state-of-the-art predictor and that CEM significantly outperform existing matchers.
【Keywords】: cross-entropy; matching prediction; ontology alignment; schema matching
【Paper Link】 【Pages】:1157-1167
【Authors】: Jessica Su ; Aneesh Sharma ; Sharad Goel
【Abstract】: Online social networks regularly offer users personalized, algorithmic suggestions of whom to connect to. Here we examine the aggregate effects of such recommendations on network structure, focusing on whether these recommendations increase the popularity of niche users or, conversely, those who are already popular. We investigate this issue by empirically and theoretically analyzing abrupt changes in Twitter's network structure around the mid-2010 introduction of its "Who to Follow" feature. We find that users across the popularity spectrum benefitted from the recommendations; however, the most popular users profited substantially more than average. We trace this "rich get richer" phenomenon to three intertwined factors. First, as is typical of network recommenders, the system relies on a "friend-of-friend"-style algorithm, which we show generally results in users being recommended proportional to their degree. Second, we find that the baseline growth rate of users is sublinear in degree. This mismatch between the recommender and the natural network dynamics thus alters the structural evolution of the network. Finally, we find that people are much more likely to respond positively to recommendations for popular users---perhaps because of their greater name recognition---further amplifying the cumulative advantage of well-known individuals.
【Keywords】: cumulative advantage; network evolution; social networks; twitter
【Paper Link】 【Pages】:1169-1179
【Authors】: Samuel F. Way ; Daniel B. Larremore ; Aaron Clauset
【Abstract】: Women are dramatically underrepresented in computer science at all levels in academia and account for just 15% of tenure-track faculty. Understanding the causes of this gender imbalance would inform both policies intended to rectify it and employment decisions by departments and individuals. Progress in this direction, however, is complicated by the complexity and decentralized nature of faculty hiring and the non-independence of hires. Using comprehensive data on both hiring outcomes and scholarly productivity for 2659 tenure-track faculty across 205 Ph.D.-granting departments in North America, we investigate the multi-dimensional nature of gender inequality in computer science faculty hiring through a network model of the hiring process. Overall, we find that hiring outcomes are most directly affected by (i) the relative prestige between hiring and placing institutions and (ii) the scholarly productivity of the candidates. After including these, and other features, the addition of gender did not significantly reduce modeling error. However, gender differences do exist, e.g., in scholarly productivity, postdoctoral training rates, and in career movements up the rankings of universities, suggesting that the effects of gender are indirectly incorporated into hiring decisions through gender's covariates. Furthermore, we find evidence that more highly ranked departments recruit female faculty at higher than expected rates, which appears to inhibit similar efforts by lower ranked departments. These findings illustrate the subtle nature of gender inequality in faculty hiring networks and provide new insights to the underrepresentation of women in computer science.
【Keywords】: data science; employment networks; modeling; gender; network analysis; social dynamics
【Paper Link】 【Pages】:1181-1190
【Authors】: Beidou Wang ; Martin Ester ; Jiajun Bu ; Yu Zhu ; Ziyu Guan ; Deng Cai
【Abstract】: Email is one of the most important communication tools today, but email overload resulting from the large number of unimportant or irrelevant emails is causing trillion-level economy loss every year. Thus personalized email prioritization algorithms are of urgent need. Despite lots of previous effort on this topic, broadcast email, an important type of email, is overlooked in previous literature. Broadcast emails are significantly different from normal emails, introducing both new challenges and opportunities. On one hand, lack of real senders and limited user interactions invalidate the key features exploited by traditional email prioritization algorithms; on the other hand, thousands of receivers for one broadcast email bring us the opportunity to predict importance through collaborative filtering. However, broadcast emails face a severe cold-start problem which hinders the direct application of collaborative filtering. In this paper, we propose the first framework for broadcast email prioritization by designing a novel active learning model that considers the collaborative filtering, implicit feedback and time sensitive responsiveness features of broadcast emails. Our method is thoroughly evaluated on a large scale real world industrial dataset from Samsung Electronics. Our method is proved highly effective and outperforms state-of-the-art personalized email prioritization methods.
【Keywords】: active learning; email prioritization; recommendation system
【Paper Link】 【Pages】:1191-1202
【Authors】: Bichen Shi ; Georgiana Ifrim ; Neil J. Hurley
【Abstract】: We address the problem of real-time recommendation of streaming Twitter hashtags to an incoming stream of news articles. The technical challenge can be framed as large scale topic classification where the set of topics (i.e., hashtags) is huge and highly dynamic. Our main applications come from digital journalism, e.g., promoting original content to Twitter communities and social indexing of news to enable better retrieval and story tracking. In contrast to the state-of-the-art that focuses on topic modelling approaches, we propose a learning-to-rank approach for modelling hashtag relevance. This enables us to deal with the dynamic nature of the problem, since a relevance model is stable over time, while a topic model needs to be continuously retrained. We present the data collection and processing pipeline, as well as our methodology for achieving low latency, high precision recommendations. Our empirical results show that our method outperforms the state-of-the-art, delivering more than 80% precision. Our techniques are implemented in a real-time system that is currently under user trial with a big news organisation.
【Keywords】: dynamic topics; hashtag recommendation; learning-to-rank; news; social indexing
【Paper Link】 【Pages】:1203-1213
【Authors】: Aditya Pal ; Amaç Herdagdelen ; Sourav Chatterji ; Sumit Taank ; Deepayan Chakrabarti
【Abstract】: Instagram has more than 400 million monthly active accounts who share more than 80 million pictures and videos daily. This large volume of user-generated content is the application's notable strength, but also makes the problem of finding the authoritative users for a given topic challenging. Discovering topical authorities can be useful for providing relevant recommendations to the users. In addition, it can aid in building a catalog of topics and top topical authorities in order to engage new users, and hence provide a solution to the cold-start problem. In this paper, we present a novel approach that we call the Authority Learning Framework (ALF) to find topical authorities in Instagram. ALF is based on the self-described interests of the follower base of popular accounts. We infer regular users' interests from their self-reported biographies that are publicly available and use Wikipedia pages to ground these interests as fine-grained, disambiguated concepts. We propose a generalized label propagation algorithm to propagate the interests over the follower graph to the popular accounts. We show that even if biography-based interests are sparse at an individual user level they provide strong signals to infer the topical authorities and let us obtain a high precision authority list per topic. Our experiments demonstrate that ALF performs significantly better at user recommendation task compared to fine-tuned and competitive methods, via controlled experiments, in-the-wild tests, and over an expert-curated list of topical authorities.
【Keywords】: instagram; topical authorities; user recommendation
【Paper Link】 【Pages】:1215-1224
【Authors】: Milad Shokouhi ; Umut Ozertem ; Nick Craswell
【Abstract】: Web search via voice is becoming increasingly popular, taking advantage of recent advances in automatic speech recognition. Speech recognition systems are trained using audio transcripts, which can be generated by a paid annotator listening to some audio and manually transcribing it. This paper considers an alternative source of training data for speech recognition, called implicit transcription. This is based on Web search clicks and reformulations, which can be interpreted as validating or correcting the recognition done during a real Web search. This can give a large amount of free training data that matches the exact characteristics of real incoming voice searches and the implicit transcriptions can better reflect the needs of real users because they come from the user who generated the audio. On an overall basis we demonstrate that the new training data has value in improving speech recognition. We further show that the in-context feedback from real users can allow the speech recognizer to exploit contextual signals, and reduce the recognition error rate further by up to 23%.
【Keywords】: personalized search; speech recognition; speech retrieval
【Paper Link】 【Pages】:1225-1235
【Authors】: Sandro Bauer ; Filip Radlinski ; Ryen W. White
【Abstract】: People commonly need to purchase things in person, from large garden supplies to home decor. Although modern search systems are very effective at finding online products, little research attention has been paid to helping users find places that sell a specific product offline. For instance, users searching for an apron are not typically directed to a nearby kitchen store by a standard search engine. In this paper, we investigate "where can I buy"-style queries related to in-person purchases of products and services. Answering these queries is challenging since little is known about the range of products sold in many stores, especially those which are smaller in size. To better understand this class of queries, we first present an in-depth analysis of typical offline purchase needs as observed by a major search engine, producing an ontology of such needs. We then propose ranking features for this new problem, and learn a ranking function that returns stores most likely to sell a queried item or service, even if there is very little information available online about some of the stores. Our final contribution is a new evaluation framework that combines distance with store relevance in measuring the effectiveness of such a search system. We evaluate our method using this approach and show that it outperforms a modern web search engine.
【Keywords】: local search; location search; map search; offline purchases; offline retail locations; shopping
【Paper Link】 【Pages】:1237-1247
【Authors】: Roi Blanco ; Matteo Catena ; Nicola Tonellotto
【Abstract】: Carbon dioxide emissions resulting from fossil fuels (brown energy) combustion are the main cause of global warming due to the greenhouse effect. Large IT companies have recently increased their efforts in reducing the carbon dioxide footprint originated from their data center electricity consumption. On one hand, better infrastructure and modern hardware allow for a more efficient usage of electric resources. On the other hand, data-centers can be powered by renewable sources (green energy) that are both environmental friendly and economically convenient. In this paper, we tackle the problem of targeting the usage of green energy to minimize the expenditure of running multi-center Web search engines, i.e., systems composed by multiple, geographically remote, computing facilities. We propose a mathematical model to minimize the operational costs of multi-center Web search engines by exploiting renewable energies whenever available at different locations. Using this model, we design an algorithm which decides what fraction of the incoming query load arriving into one processing facility must be forwarded to be processed at different sites to use green energy sources. We experiment using real traffic from a large search engine and we compare our model against state of the art baselines for query forwarding. Our experimental results show that the proposed solution maintains an high query throughput, while reducing by up to ~25% the energy operational costs of multi-center search engines. Additionally, our algorithm can reduce the brown energy consumption by almost 6% when energy-proportional servers are employed.
【Keywords】: green energy; query forwarding; web search engines
【Paper Link】 【Pages】:1249-1259
【Authors】: Giridhari Venkatadri ; Oana Goga ; Changtao Zhong ; Bimal Viswanath ; Krishna P. Gummadi ; Nishanth R. Sastry
【Abstract】: On most current websites untrustworthy or spammy identities are easily created. Existing proposals to detect untrustworthy identities rely on reputation signals obtained by observing the activities of identities over time within a single site or domain; thus, there is a time lag before which websites cannot easily distinguish attackers and legitimate users. In this paper, we investigate the feasibility of leveraging information about identities that is aggregated across multiple domains to reason about their trustworthiness. Our key insight is that while honest users naturally maintain identities across multiple domains (where they have proven their trustworthiness and have acquired reputation over time), attackers are discouraged by the additional effort and costs to do the same. We propose a flexible framework to transfer trust between domains that can be implemented in today's systems without significant loss of privacy or significant implementation overheads. We demonstrate the potential for inter-domain trust assessment using extensive data collected from Pinterest, Facebook, and Twitter. Our results show that newer domains such as Pinterest can benefit by transferring trust from more established domains such as Facebook and Twitter by being able to declare more users as likely to be trustworthy much earlier on (approx. one year earlier).
【Keywords】: online identities; online social networks; reputation; security; sybil attacks; trust
【Paper Link】 【Pages】:1261-1270
【Authors】: Yang Li ; Shulong Tan ; Huan Sun ; Jiawei Han ; Dan Roth ; Xifeng Yan
【Abstract】: Named Entity Disambiguation is the task of disambiguating named entity mentions in natural language text and link them to their corresponding entries in a reference knowledge base (e.g. Wikipedia). Such disambiguation can help add semantics to plain text and distinguish homonymous entities. Previous research has tackled this problem by making use of two types of context-aware features derived from the reference knowledge base, namely, the context similarity and the semantic relatedness. Both features heavily rely on the cross-document hyperlinks within the knowledge base: the semantic relatedness feature is directly measured via those hyperlinks, while the context similarity feature implicitly makes use of those hyperlinks to expand entity candidates' descriptions and then compares them against the query context. Unfortunately, cross-document hyperlinks are rarely available in many closed domain knowledge bases and it is very expensive to manually add such links. Therefore few algorithms can work well on linkless knowledge bases. In this work, we propose the challenging Named Entity Disambiguation with Linkless Knowledge Bases (LNED) problem and tackle it by leveraging the useful disambiguation evidences scattered across the reference knowledge base. We propose a generative model to automatically mine such evidences out of noisy information. The mined evidences can mimic the role of the missing links and help boost the LNED performance. Experimental results show that our proposed method substantially improves the disambiguation accuracy over the baseline approaches.
【Keywords】: entity disambiguation; evidence mining; generative model; linkless knowledge bases
【Paper Link】 【Pages】:1271-1281
【Authors】: Zongcheng Ji ; Aixin Sun ; Gao Cong ; Jialong Han
【Abstract】: Many users casually reveal their locations such as restaurants, landmarks, and shops in their tweets. Recognizing such fine-grained locations from tweets and then linking the location mentions to well-defined location profiles (e.g., with formal name, detailed address, and geo-coordinates etc.) offer a tremendous opportunity for many applications. Different from existing solutions which perform location recognition and linking as two sub-tasks sequentially in a pipeline setting, in this paper, we propose a novel joint framework to perform location recognition and location linking simultaneously in a joint search space. We formulate this end-to-end location linking problem as a structured prediction problem and propose a beam-search based algorithm. Based on the concept of multi-view learning, we further enable the algorithm to learn from unlabeled data to alleviate the dearth of labeled data. Extensive experiments are conducted to recognize locations mentioned in tweets and link them to location profiles in Foursquare. Experimental results show that the proposed joint learning algorithm outperforms the state-of-the-art solutions, and learning from unlabeled data improves both the recognition and linking accuracy.
【Keywords】: POI; beam search; location linking; location recognition; multi-view learning; structured prediction; tweet; twitter
【Paper Link】 【Pages】:1283-1292
【Authors】: Isabel Mette Kloumann ; Chenhao Tan ; Jon M. Kleinberg ; Lillian Lee
【Abstract】: Despite the existence of highly successful Internet collaborations on complex projects, including open-source software, little is known about how Internet collaborations work for solving "extremely" difficult problems, such as open-ended research questions. We quantitatively investigate a series of efforts known as the Polymath projects, which tackle mathematical research problems through open online discussion. A key analytical insight is that we can contrast the polymath projects with mini-polymaths -- spinoffs that were conducted in the same manner as the polymaths but aimed at addressing math Olympiad questions, which, while quite difficult, are known to be feasible. Our comparative analysis shifts between three elements of the projects: the roles and relationships of the authors, the temporal dynamics of how the projects evolved, and the linguistic properties of the discussions themselves. We find interesting differences between the two domains through each of these analyses, and present these analyses as a template to facilitate comparison between Polymath and other domains for collaboration and communication. We also develop models that have strong performance in distinguishing research-level comments based on any of our groups of features. Finally, we examine whether comments representing research breakthroughs can be recognized more effectively based on their intrinsic features, or by the (re-)actions of others, and find good predictive power in linguistic features.
【Keywords】: collaboration; highlights; polymath
【Paper Link】 【Pages】:1293-1303
【Authors】: Ming Yin ; Mary L. Gray ; Siddharth Suri ; Jennifer Wortman Vaughan
【Abstract】: Since its inception, crowdsourcing has been considered a black-box approach to solicit labor from a crowd of workers. Furthermore, the "crowd" has been viewed as a group of independent workers dispersed all over the world. Recent studies based on in-person interviews have opened up the black box and shown that the crowd is not a collection of independent workers, but instead that workers communicate and collaborate with each other. Put another way, prior work has shown the existence of edges between workers. We build on and extend this discovery by mapping the entire communication network of workers on Amazon Mechanical Turk, a leading crowdsourcing platform. We execute a task in which over 10,000 workers from across the globe self-report their communication links to other workers, thereby mapping the communication network among workers. Our results suggest that while a large percentage of workers indeed appear to be independent, there is a rich network topology over the rest of the population. That is, there is a substantial communication network within the crowd. We further examine how online forum usage relates to network topology, how workers communicate with each other via this network, how workers' experience levels relate to their network positions, and how U.S. workers differ from international workers in their network characteristics. We conclude by discussing the implications of our findings for requesters, workers, and platform providers like Amazon.
【Keywords】: crowdsourcing; mechanical turk; networks; online forums
【Paper Link】 【Pages】:1305-1315
【Authors】: Javad Nejati ; Aruna Balasubramanian
【Abstract】: Mobile page load times are an order of magnitude slower compared to non-mobile pages. It is not clear what causes the poor performance: the slower network, the slower computational speeds, or other reasons. Further, most Web optimizations are designed for non-mobile browsers and do not translate well to the mobile browser. Towards understanding mobile Web page load times, in this paper we: (1) perform an in-depth pairwise comparison of loading a page on a mobile versus a non-mobile browser, and (2) characterize the bottlenecks in the mobile browser {\em vis-a-vis} non-mobile browsers. To this end, we build a testbed that allows us to directly compare the low-level page load activities and bottlenecks when loading a page on a mobile versus a non-mobile browser. We find that computation is the main bottleneck when loading a page on mobile browsers. This is in contrast to non-mobile browsers where networking is the main bottleneck. We also find that the composition of the critical path during page load is different when loading pages on the mobile versus the non-mobile browser. A key takeaway of our work is that we need to fundamentally rethink optimizations for mobile browsers.
【Keywords】: bottleneck; browser; computation; controlled environment; dependency; mobile; networking; performance; web
【Paper Link】 【Pages】:1317-1327
【Authors】: Mahanth Gowda ; Ashutosh Dhekne ; Romit Roy Choudhury
【Abstract】: This paper explores the possibility of injecting mobility into wireless network infrastructure. We envision WiFi access points on wheels that move to optimize user performance. Movements need not be all around the floor, neither do they have to operate on batteries. As a first step, WiFi APs at home could remain tethered to power and Ethernet outlets while moving in small areas (perhaps under the couch). If such systems prove successful, perhaps future buildings and cities could offer explicit support for network infrastructure mobility. This paper begins with a higher level discussion of robotic wireless networks -- the opportunities and the hurdles -- and then pivots by developing a smaller slice of the vision through a system called iMob. With iMob, a WiFi AP is mounted on a Roomba robot and made to periodically move within a 2x2 sqft region. The core research questions pertain to finding the best location to move to, such that the SNRs from its clients are strong, and the interferences from other APs are weak. Our measurements show that the richness of wireless multipath offers significant opportunities -- even within a 2x2 sqft region, locations exist that are 1.7x better than the average location in terms of throughput. When multiple APs in a neighborhood coordinate, the gains can be even higher. In sum, although infrastructure mobility has been discussed in the context of Google Balloons, ad hoc networks, and delay tolerant networks, we believe that the possibility of moving our personal devices in homes and offices is relatively unexplored, and could open doors to new kinds of innovation.
【Keywords】: infrastructure; measurement; robotic networks; wireless
【Paper Link】 【Pages】:1329-1338
【Authors】: Yao Liu ; Mengbai Xiao ; Ming Zhang ; Xin Li ; Mian Dong ; Zhan Ma ; Zhenhua Li ; Songqing Chen
【Abstract】: During Internet streaming, a significant portion of the battery power is always consumed by the display panel on mobile devices. To reduce the display power consumption, backlight scaling, a scheme that intelligently dims the backlight has been proposed. To maintain perceived video appearance in backlight scaling, a computationally intensive luminance compensation process is required. However, this step, if performed by the CPU as existing schemes suggest, could easily offset the power savings gained from backlight scaling. Furthermore, computing the optimal backlight scaling values requires per-frame luminance information, which is typically too energy intensive for mobile devices to compute. Thus, existing schemes require such information to be available in advance. And such an offline approach makes these schemes impractical. To address these challenges, in this paper, we design and implement GoCAD, a GPU-assisted Online Content-Adaptive Display power saving scheme for mobile devices in Internet streaming sessions. In GoCAD, we employ the mobile device's GPU rather than the CPU to reduce power consumption during the luminance compensation phase. Furthermore, we compute the optimal backlight scaling values for small batches of video frames in an online fashion using a dynamic programming algorithm. Lastly, we make novel use of the widely available video storyboard, a pre-computed set of thumbnails associated with a video, to intelligently decide whether or not to apply our backlight scaling scheme for a given video. For example, when the GPU power consumption would offset the savings from dimming the backlight, no backlight scaling is conducted. To evaluate the performance of GoCAD, we implement a prototype within an Android application and use a Monsoon power monitor to measure the real power consumption. Experiments are conducted on more than 460 randomly selected YouTube videos. Results show that GoCAD can effectively produce power savings without affecting rendered video quality.
【Keywords】: backlight scaling; gpu; internet mobile streaming; liquid-crystal display (lcd); power consumption; storyboard
【Paper Link】 【Pages】:1339-1349
【Authors】: Le Chen ; Alan Mislove ; Christo Wilson
【Abstract】: The rise of e-commerce has unlocked practical applications for algorithmic pricing (also called dynamic pricing algorithms), where sellers set prices using computer algorithms. Travel websites and large, well known e-retailers have already adopted algorithmic pricing strategies, but the tools and techniques are now available to small-scale sellers as well. While algorithmic pricing can make merchants more competitive, it also creates new challenges. Examples have emerged of cases where competing pieces of algorithmic pricing software interacted in unexpected ways and produced unpredictable prices, as well as cases where algorithms were intentionally designed to implement price fixing. Unfortunately, the public currently lack comprehensive knowledge about the prevalence and behavior of algorithmic pricing algorithms in-the-wild. In this study, we develop a methodology for detecting algorithmic pricing, and use it empirically to analyze their prevalence and behavior on Amazon Marketplace. We gather four months of data covering all merchants selling any of 1,641 best-seller products. Using this dataset, we are able to uncover the algorithmic pricing strategies adopted by over 500 sellers. We explore the characteristics of these sellers and characterize the impact of these strategies on the dynamics of the marketplace.
【Keywords】: algorithmic pricing; dynamic pricing algorithms; e-commerce
【Paper Link】 【Pages】:1351-1362
【Authors】: Huoran Li ; Wei Ai ; Xuanzhe Liu ; Jian Tang ; Gang Huang ; Feng Feng ; Qiaozhu Mei
【Abstract】: Smartphone users have adopted an explosive number of mobile applications (a.k.a., apps) in the recent years. App marketplaces for iOS, Android and Windows Phone platforms host millions of apps which have been downloaded for more than 100 billion times. Investigating how people manage mobile apps in their everyday lives creates a unique opportunity to understand the behavior and preferences of mobile users, to infer the quality of apps, and to improve the user experience. Existing literature provides very limited knowledge about app management activities, due to the lack of user behavioral data at scale. This paper takes the initiative to analyze a very large app management log collected through a leading Android app marketplace. The data set covers five months of detailed downloading, updating, and uninstallation activities, involving 17 million anonymized users and one million apps. We present a surprising finding that the metrics commonly used by app stores to rank apps do not truly reflect the users' real attitudes towards the apps. We then identify useful patterns from the app management activities that much more accurately predict the user preferences of an app even when no user rating is available.
【Keywords】: app management activities; behavior analysis; mobile apps
【Paper Link】 【Pages】:1363-1372
【Authors】: Hao Ma ; Xueqing Liu ; Zhihong Shen
【Abstract】: Many aspects and properties of Recommender Systems have been well studied in the past decade, however, the impact of User Fatigue has been mostly ignored in the literature. User fatigue represents the phenomenon that a user quickly loses the interest on the recommended item if the same item has been presented to this user multiple times before. The direct impact caused by the user fatigue is the dramatic decrease of the Click Through Rate (CTR, i.e., the ratio of clicks to impressions). In this paper, we present a comprehensive study on the research of the user fatigue in online recommender systems. By analyzing user behavioral logs from Bing Now news recommendation, we find that user fatigue is a severe problem that greatly affects the user experience. We also notice that different users engage differently with repeated recommendations. Depending on the previous users' interaction with repeated recommendations, we illustrate that under certain condition the previously seen items should be demoted, while some other times they should be promoted. We demonstrate how statistics about the analysis of the user fatigue can be incorporated into ranking algorithms for personalized recommendations. Our experimental results indicate that significant gains can be achieved by introducing features that reflect users' interaction with previously seen recommendations (up to 15% enhancement on all users and 34% improvement on heavy users).
【Keywords】: click prediction; news recommendation; recommender systems; user fatigue; user modeling
【Paper Link】 【Pages】:1373-1384
【Authors】: Chenwei Zhang ; Wei Fan ; Nan Du ; Philip S. Yu
【Abstract】: Text queries are naturally encoded with user intentions. An intention detection task tries to model and discover intentions that user encoded in text queries. Unlike conventional text classification tasks where the label of text is highly correlated with some topic-specific words, words from different topic categories tend to co-occur in medical related queries. Besides the existence of topic-specific words and word order, word correlations and the way words organized into sentence are crucial to intention detection tasks. In this paper, we present a neural network based jointly modeling approach to model and capture user intentions in medical related text queries. Regardless of the exact words in text queries, the proposed method incorporates two types of heterogeneous information: 1) pairwise word feature correlations and 2) part-of-speech tags of a sentence to jointly model user intentions. Variable-length text queries are first inherently taken care of by a fixed-size pairwise feature correlation matrix. Moreover, convolution and pooling operations are applied on feature correlations to fully exploit latent semantic structure within the query. Sentence rephrasing is finally introduced as a data augmentation technique to improve model generalization ability during model training. Experiment results on real world medical queries have shown that the proposed method is able to extract complete and precise user intentions from text queries.
【Keywords】: intention detection; jointly modeling; neural network; text query
【Paper Link】 【Pages】:1385-1394
【Authors】: Giovanni Quattrone ; Davide Proserpio ; Daniele Quercia ; Licia Capra ; Mirco Musolesi
【Abstract】: Sharing economy platforms have become extremely popular in the last few years, and they have changed the way in which we commute, travel, and borrow among many other activities. Despite their popularity among consumers, such companies are poorly regulated. For example, Airbnb, one of the most successful examples of sharing economy platform, is often criticized by regulators and policy makers. While, in theory, municipalities should regulate the emergence of Airbnb through evidence-based policy making, in practice, they engage in a false dichotomy: some municipalities allow the business without imposing any regulation, while others ban it altogether. That is because there is no evidence upon which to draft policies. Here we propose to gather evidence from the Web. After crawling Airbnb data for the entire city of London, we find out where and when Airbnb listings are offered and, by matching such listing information with census and hotel data, we determine the socio-economic conditions of the areas that actually benefit from the hospitality platform. The reality is more nuanced than one would expect, and it has changed over the years. Airbnb demand and offering have changed over time, and traditional regulations have not been able to respond to those changes. That is why, finally, we rely on our data analysis to envision regulations that are responsive to real-time demands, contributing to the emerging idea of ``algorithmic regulation''.
【Keywords】: sharing economy; regulation; policy
【Paper Link】 【Pages】:1395-1406
【Authors】: Rui Yan ; Cheng-Te Li ; Hsun-Ping Hsieh ; Po Hu ; Xiaohua Hu ; Tingting He
【Abstract】: In recent years, online social networks are among the most popular websites with high PV (Page View) all over the world, as they have renewed the way for information discovery and distribution. Millions of users have registered on these websites and hence generate formidable amount of user-generated contents every day. The social networks become "giants", likely eligible to carry on any research tasks. However, we have pointed out that these giants still suffer from their "Achilles Heel", i.e., extreme sparsity. Compared with the extremely large data over the whole collection, individual posting documents such as microblogs seem to be too sparse to make a difference under various research scenarios, while actually these postings are different. In this paper we propose to tackle the Achilles Heel of social networks by smoothing the language model via influence propagation. To further our previously proposed work to tackle the sparsity issue, we extend the socialized language model smoothing with bi-directional influence learned from propagation. Intuitively, it is insufficient not to distinguish the influence propagated between information source and target without directions. Hence, we formulate a bi-directional socialized factor graph model, which utilizes both the textual correlations between document pairs and the socialized augmentation networks behind the documents, such as user relationships and social interactions. These factors are modeled as attributes and dependencies among documents and their corresponding users, and then are distinguished on the direction level. We propose an effective learning algorithm to learn the proposed factor graph model with directions. Finally we propagate term counts to smooth documents based on the estimated influence. We run experiments on two instinctive datasets of Twitter and Weibo. The results validate the effectiveness of the proposed model. By incorporating direction information into the socialized language model smoothing, our approach obtains improvement over several alternative methods on both intrinsic and extrinsic evaluations measured in terms of perplexity, nDCG and MAP measurements.
【Keywords】: bi-directional influence propagation; language model smoothing; social networks
【Paper Link】 【Pages】:1407-1418
【Authors】: Yu Sun ; Nicholas Jing Yuan ; Xing Xie ; Kieran McDonald ; Rui Zhang
【Abstract】: Mobile digital assistants such as Microsoft Cortana and Google Now currently offer appealing proactive experiences to users, which aim to deliver the right information at the right time. To achieve this goal, it is crucial to precisely predict users' real-time intent. Intent is closely related to context, which includes not only the spatial-temporal information but also users' current activities that can be sensed by mobile devices. The relationship between intent and context is highly dynamic and exhibits chaotic sequential correlation. The context itself is often sparse and heterogeneous. The dynamics and co-movement among contextual signals are also elusive and complicated. Traditional recommendation models cannot directly apply to proactive experiences because they fail to tackle the above challenges. Inspired by the nowcasting practice in meteorology and macroeconomics, we propose an innovative collaborative nowcasting model to effectively resolve these challenges. The proposed model successfully addresses sparsity and heterogeneity of contextual signals. It also effectively models the convoluted correlation within contextual signals and between context and intent. Specifically, the model first extracts collaborative latent factors, which summarize shared temporal structural patterns in contextual signals, and then exploits the collaborative Kalman Filter to generate serially correlated personalized latent factors, which are utilized to monitor each user's real-time intent. Extensive experiments with real-world data sets from a commercial digital assistant demonstrate the effectiveness of the collaborative nowcasting model. The studied problem and model provide inspiring implications for new paradigms of recommendations on mobile intelligent devices.
【Keywords】: intent monitoring; nowcasting; recommendation
【Paper Link】 【Pages】:1419-1428
【Authors】: Thomas Pellissier Tanon ; Denny Vrandecic ; Sebastian Schaffert ; Thomas Steiner ; Lydia Pintscher
【Abstract】: Collaborative knowledge bases that make their data freely available in a machine-readable form are central for the data strategy of many projects and organizations. The two major collaborative knowledge bases are Wikimedia's Wikidata and Google's Freebase. Due to the success of Wikidata, Google decided in 2014 to offer the content of Freebase to the Wikidata community. In this paper, we report on the ongoing transfer efforts and data mapping challenges, and provide an analysis of the effort so far. We describe the Primary Sources Tool, which aims to facilitate this and future data migrations. Throughout the migration, we have gained deep insights into both Wikidata and Freebase, and share and discuss detailed statistics on both knowledge bases.
【Keywords】: crowdsourcing systems; freebase; semantic web; wikidata
【Paper Link】 【Pages】:1429-1439
【Authors】: Yeye He ; Kaushik Chakrabarti ; Tao Cheng ; Tomasz Tylenda
【Abstract】: Attribute synonyms are important ingredients for keyword-based search systems. For instance, web search engines recognize queries that seek the value of an entity on a specific attribute (referred to as e+a queries) and provide direct answers for them using a combination of knowledge bases, web tables and documents. However, users often refer to an attribute in their e+a query differently from how it is referred in the web table or text passage. In such cases, search engines may fail to return relevant answers. To address that problem, we propose to automatically discover all the alternate ways of referring to the attributes of a given class of entities (referred to as attribute synonyms) in order to improve search quality. The state-of-the-art approach that relies on attribute name co-occurrence in web tables suffers from low precision. Our main insight is to combine positive evidence of attribute synonymity from query click logs, with negative evidence from web table attribute name co-occurrences. We formalize the problem as an optimization problem on a graph, with the attribute names being the vertices and the positive and negative evidences from query logs and web table schemas as weighted edges. We develop a linear programming based algorithm to solve the problem that has bi-criteria approximation guarantees. Our experiments on real-life datasets show that our approach has significantly higher precision and recall compared with the state-of-the-art.
【Keywords】: attribute synonyms; clustering; web search