Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 23-27, 2018. ACM 【DBLP Link】
【Paper Link】 【Pages】:1-11
【Authors】: Rajan Vaish ; Shirish Goyal ; Amin Saberi ; Sharad Goel
【Abstract】: There has been a marked shift towards learning and consuming information through video. Most academic research, however, is still distributed only in text form, as researchers often have limited time, resources, and incentives to create video versions of their work. To address this gap, we propose, deploy, and evaluate a scalable, end-to-end system for crowdsourcing the creation of short, 5-minute research videos based on academic papers. Doing so requires solving complex coordination and collaborative video production problems. To assist coordination, we designed a structured workflow that enables efficient delegation of tasks, while also motivating the crowd through a collaborative learning environment. To facilitate video production, we developed an online tool with which groups can make micro-audio recordings that are automatically stitched together to create a complete talk. We tested this approach with a group of volunteers recruited from 52 countries through an open call. This distributed crowd produced over 100 video talks in 12 languages based on papers from top-tier computer science conferences. The produced talks consistently received high ratings from a diverse group of non-experts and experts, including the authors of the original papers. These results indicate that our crowdsourcing approach is a promising method for producing high-quality research talks at scale, increasing the distribution and accessibility of scientific knowledge.
【Keywords】: crowdsourcing; online creative collaboration; peer video production; scientific knowledge and learning at scale
【Paper Link】 【Pages】:13-22
【Authors】: Chenglin Miao ; Qi Li ; Lu Su ; Mengdi Huai ; Wenjun Jiang ; Jing Gao
【Abstract】: As an effective way to solicit useful information from the crowd, crowdsourcing has emerged as a popular paradigm to solve challenging tasks. However, the data provided by the participating workers are not always trustworthy. In real world, there may exist malicious workers in crowdsourcing systems who conduct the data poisoning attacks for the purpose of sabotage or financial rewards. Although data aggregation methods such as majority voting are conducted on workers» labels in order to improve data quality, they are vulnerable to such attacks as they treat all the workers equally. In order to capture the variety in the reliability of workers, the Dawid-Skene model, a sophisticated data aggregation method, has been widely adopted in practice. By conducting maximum likelihood estimation (MLE) using the expectation maximization (EM) algorithm, the Dawid-Skene model can jointly estimate each worker»s reliability and conduct weighted aggregation, and thus can tolerate the data poisoning attacks to some degree. However, the Dawid-Skene model still has weakness. In this paper, we study the data poisoning attacks against such crowdsourcing systems with the Dawid-Skene model empowered. We design an intelligent attack mechanism, based on which the attacker can not only achieve maximum attack utility but also disguise the attacking behaviors. Extensive experiments based on real-world crowdsourcing datasets are conducted to verify the desirable properties of the proposed mechanism.
【Keywords】: crowdsourcing; data poisoning; expectation maximization
【Paper Link】 【Pages】:23-32
【Authors】: Jie Yang ; Thomas Drake ; Andreas Damianou ; Yoelle Maarek
【Abstract】: This paper presents a generic Bayesian framework that enables any deep learning model to actively learn from targeted crowds. Our framework inherits from recent advances in Bayesian deep learning, and extends existing work by considering the targeted crowdsourcing approach, where multiple annotators with unknown expertise contribute an uncontrolled amount (often limited) of annotations. Our framework leverages the low-rank structure in annotations to learn individual annotator expertise, which then helps to infer the true labels from noisy and sparse annotations. It provides a unified Bayesian model to simultaneously infer the true labels and train the deep learning model in order to reach an optimal learning efficacy. Finally, our framework exploits the uncertainty of the deep learning model during prediction as well as the annotators» estimated expertise to minimize the number of required annotations and annotators for optimally training the deep learning model. We evaluate the effectiveness of our framework for intent classification in Alexa (Amazon»s personal assistant), using both synthetic and real-world datasets. Experiments show that our framework can accurately learn annotator expertise, infer true labels, and effectively reduce the amount of annotations in model training as compared to state-of-the-art approaches. We further discuss the potential of our proposed framework in bridging machine learning and crowdsourcing towards improved human-in-the-loop systems.
【Keywords】: conversational agents; crowdsourcing; deep active learning
【Paper Link】 【Pages】:33-43
【Authors】: Xiao Ma ; Megan Cackett ; Leslie Park ; Eric Chien ; Mor Naaman
【Abstract】: We build on the increasing availability of Virtual Reality (VR) devices and Web technologies to conduct behavioral experiments in VR using crowdsourcing techniques. A new recruiting and validation method allows us to create a panel of eligible experiment participants recruited from Amazon Mechanical Turk. Using this panel, we ran three different crowdsourced VR experiments, each reproducing one of three VR illusions: place illusion, embodiment illusion, and plausibility illusion. Our experience and worker feedback on these experiments show that conducting Web-based VR experiments using crowdsourcing is already feasible, though some challenges---including scale---remain. Such crowdsourced VR experiments on the Web have the potential to finally support replicable VR experiments with diverse populations at a low cost.
【Keywords】: crowdsourcing; experiments; virtual reality
【Paper Link】 【Pages】:45-54
【Authors】: Mário Almeida ; Muhammad Bilal ; Alessandro Finamore ; Ilias Leontiadis ; Yan Grunenberger ; Matteo Varvello ; Jeremy Blackburn
【Abstract】: While developing mobile apps is becoming easier, testing and characterizing their behavior is still hard. On the one hand, the de facto testing tool, called "Monkey," scales well due to being based on random inputs, but fails to gather inputs useful in understanding things like user engagement and attention. On the other hand, gathering inputs and data from real users requires distributing instrumented apps, or even phones with pre-installed apps, an expensive and inherently unscaleable task. To address these limitations we present CHIMP, a system that integrates automated tools and large-scale crowdsourced inputs. CHIMP is different from previous approaches in that it runs apps in a virtualized mobile environment that thousands of users all over the world can access via a standard Web browser. CHIMP is thus able to gather the full range of real-user inputs, detailed run-time traces of apps, and network traffic. We thus describe CHIMP»s design and demonstrate the efficiency of our approach by testing thousands of apps via thousands of crowdsourced users. We calibrate CHIMP with a large-scale campaign to understand how users approach app testing tasks. Finally, we show how CHIMP can be used to improve both traditional app testing tasks, as well as more novel tasks such as building a traffic classifier on encrypted network flows.
【Keywords】: crowdsourcing; mobile; testing; virtualization
【Paper Link】 【Pages】:55-64
【Authors】: Evgeny Krivosheev ; Fabio Casati ; Boualem Benatallah
【Abstract】: Systematic literature reviews (SLRs) are one of the most common and useful form of scientific research and publication. Tens of thousands of SLRs are published each year, and this rate is growing across all fields of science. Performing an accurate, complete and unbiased SLR is however a difficult and expensive endeavor. This is true in general for all phases of a literature review, and in particular for the paper screening phase, where authors filter a set of potentially in-scope papers based on a number of exclusion criteria. To address the problem, in recent years the research community has began to explore the use of the crowd to allow for a faster, accurate, cheaper and unbiased screening of papers. Initial results show that crowdsourcing can be effective, even for relatively complex reviews. In this paper we derive and analyze a set of strategies for crowd-based screening, and show that an adaptive strategy, that continuously re-assesses the statistical properties of the problem to minimize the number of votes needed to take decisions for each paper, significantly outperforms a number of non-adaptive approaches in terms of cost and accuracy. We validate both applicability and results of the approach through a set of crowdsourcing experiments, and discuss properties of the problem and algorithms that we believe to be generally of interest for classification problems where items are classified via a series of successive tests (as it often happens in medicine).
【Keywords】: crowdsourcing; human computation; literature reviews
【Paper Link】 【Pages】:65-75
【Authors】: Himel Dev ; Chase Geigle ; Qingtao Hu ; Jiahui Zheng ; Hari Sundaram
【Abstract】: In this paper, we interpret the community question answering websites on the StackExchange platform as knowledge markets, and analyze how and why these markets can fail at scale. A knowledge market framing allows site operators to reason about market failures, and to design policies to prevent them. Our goal is to provide insights on large-scale knowledge market failures through an interpretable model. We explore a set of interpretable economic production models on a large empirical dataset to analyze the dynamics of content generation in knowledge markets. Amongst these, the Cobb-Douglas model best explains empirical data and provides an intuitive explanation for content generation through the concepts of elasticity and diminishing returns. Content generation depends on user participation and also on how specific types of content (e.g. answers) depends on other types (e.g. questions). We show that these factors of content generation have constant elasticity and a percentage increase in any of the inputs leads to a constant percentage increase in the output. Furthermore, markets exhibit diminishing returns-the marginal output decreases as the input is incrementally increased. Knowledge markets also vary on their returns to scale-the increase in output resulting from a proportionate increase in all inputs. Importantly, many knowledge markets exhibit diseconomies of scale-measures of market health (e.g., the percentage of questions with an accepted answer) decrease as a function of the number of participants. The implications of our work are two-fold: site operators ought to design incentives as a function of system size (number of participants); the market lens should shed insight into complex dependencies amongst different content types and participant actions in general social networks.
【Keywords】: community question answering; content generation; diseconomies of scale; knowledge market
【Paper Link】 【Pages】:77-86
【Authors】: Sunil Mohan ; Nicolas Fiorini ; Sun Kim ; Zhiyong Lu
【Abstract】: Publications in the life sciences are characterized by a large technical vocabulary, with many lexical and semantic variations for expressing the same concept. Towards addressing the problem of relevance in biomedical literature search, we introduce a deep learning model for the relevance of a document's text to a keyword style query. Limited by a relatively small amount of training data, the model uses pre-trained word embeddings. With these, the model first computes a variable-length Delta matrix between the query and document, representing a difference between the two texts, which is then passed through a deep convolution stage followed by a deep feed-forward network to compute a relevance score. This results in a fast model suitable for use in an online search engine. The model is robust and outperforms comparable state-of-the-art deep learning approaches.
【Keywords】: biomedical information retrieval; deep learning; learning to rank; search
【Paper Link】 【Pages】:87-96
【Authors】: Bin Zou ; Vasileios Lampos ; Ingemar J. Cox
【Abstract】: We investigate the utility of multi-task learning to disease surveillance using Web search data. Our motivation is two-fold. Firstly, we assess whether concurrently training models for various geographies - inside a country or across different countries - can improve accuracy. We also test the ability of such models to assist health systems that are producing sporadic disease surveillance reports that reduce the quantity of available training data. We explore both linear and nonlinear models, specifically a multi-task expansion of elastic net and a multi-task Gaussian Process, and compare them to their respective single task formulations. We use influenza-like illness as a case study and conduct experiments on the United States (US) as well as England, where both health and Google search data were obtained. Our empirical results indicate that multi-task learning improves regional as well as national models for the US. The percentage of improvement on mean absolute error increases up to 14.8% as the historical training data is reduced from 5 to 1 year(s), illustrating that accurate models can be obtained, even by training on relatively short time intervals. Furthermore, in simulated scenarios, where only a few health reports (training data) are available, we show that multi-task learning helps to maintain a stable performance across all the affected locations. Finally, we present results from a cross-country experiment, where data from the US improves the estimates for England. As the historical training data for England is reduced, the benefits of multi-task learning increase, reducing mean absolute error by up to 40%.
【Keywords】: disease surveillance; gaussian processes; multi-task learning; regularized regression; user-generated content; web search
【Paper Link】 【Pages】:97-106
【Authors】: Junxiang Wang ; Liang Zhao
【Abstract】: Detection of vaccine adverse events is crucial to the discovery and improvement of problematic vaccines. To achieve it, traditionally formal reporting systems like VAERS support accurate but delayed surveillance, while recently social media have been mined for timely but noisy observations. Utilizing the complementary strengths of these two domains to boost the detection performance looks good but cannot be effectively achieved by existing methods due to significant differences between their data characteristics, including: 1) formal language v.s. informal language, 2) single-message per user v.s. multi-messages per user, and 3) one class v.s. binary class. In this paper, we propose a novel generic framework named Multi-instance Domain Adaptation (MIDA) to maximize the synergy between these two domains in the vaccine adverse event detection task for social media users. Specifically, we propose a generalized Maximum Mean Discrepancy (MMD) criterion to measure the semantic distances between the heterogeneous messages from these two domains in their shared latent semantic space. Then these message-level generalized MMD distances are synthesized by newly proposed mixed instance kernels to user-level distances. We finally minimize the distances between the samples of the partially-matched classes from these two domains. In order to solve the non-convex optimization problem, an efficient Alternating Direction Method of Multipliers (ADMM) based algorithm combined with the Convex-Concave Procedure (CCP) is developed to optimize parameters accurately. Extensive experiments demonstrated that our model outperformed the baselines by a large margin under six metrics. Case studies showed that formal reports and extracted adverse-relevant tweets by MIDA shared a similarity of keyword and description patterns.
【Keywords】: adverse event detection; multi-instance learning; transfer learning
【Paper Link】 【Pages】:107-116
【Authors】: Emma Pierson ; Tim Althoff ; Jure Leskovec
【Abstract】: Cycles are fundamental to human health and behavior. Examples include mood cycles, circadian rhythms, and the menstrual cycle. However, modeling cycles in time series data is challenging because in most cases the cycles are not labeled or directly observed and need to be inferred from multidimensional measurements taken over time. Here, we present Cyclic Hidden Markov Models (CyHMMs) for detecting and modeling cycles in a collection of multidimensional heterogeneous time series data. In contrast to previous cycle modeling methods, CyHMMs deal with a number of challenges encountered in modeling real-world cycles: they can model multivariate data with both discrete and continuous dimensions; they explicitly model and are robust to missing data; and they can share information across individuals to accommodate variation both within and between individual time series. Experiments on synthetic and real-world health-tracking data demonstrate that CyHMMs infer cycle lengths more accurately than existing methods, with 58% lower error on simulated data and 63% lower error on real-world data compared to the best-performing baseline. CyHMMs can also perform functions which baselines cannot: they can model the progression of individual features/symptoms over the course of the cycle, identify the most variable features, and cluster individual time series into groups with distinct characteristics. Applying CyHMMs to two real-world health-tracking datasets -- of human menstrual cycle symptoms and physical activity tracking data -- yields important insights including which symptoms to expect at each point during the cycle. We also find that people fall into several groups with distinct cycle patterns, and that these groups differ along dimensions not provided to the model. For example, by modeling missing data in the menstrual cycles dataset, we are able to discover a medically relevant group of birth control users even though information on birth control is not given to the model.
【Keywords】:
【Paper Link】 【Pages】:117-126
【Authors】: Shaika Chowdhury ; Chenwei Zhang ; Philip S. Yu
【Abstract】: Social media has grown to be a crucial information source for pharmacovigilance studies where an increasing number of people post adverse reactions to medical drugs that are previously unreported. Aiming to effectively monitor various aspects of Adverse Drug Reactions (ADRs) from diversely expressed social medical posts, we propose a multi-task neural network framework that learns several tasks associated with ADR monitoring with different levels of supervisions collectively. Besides being able to correctly classify ADR posts and accurately extract ADR mentions from online posts, the proposed framework is also able to further understand reasons for which the drug is being taken, known as »indications», from the given social media post. A coverage-based attention mechanism is adopted in our framework to help the model properly identify »phrasal» ADRs and Indications that are attentive to multiple words in a post. Our framework is applicable in situations where limited parallel data for different pharmacovigilance tasks are available. We evaluate the proposed framework on real-world Twitter datasets, where the proposed model outperforms the state-of-the-art alternatives of each individual task consistently.
【Keywords】: adverse drug reaction; attention mechanism; coverage; multi-task learning; pharmacovigilance; recurrent neural network; social media
【Paper Link】 【Pages】:137-146
【Authors】: Payam Karisani ; Eugene Agichtein
【Abstract】: Millions of users share their experiences on social media sites, such as Twitter, which in turn generate valuable data for public health monitoring, digital epidemiology, and other analyses of population health at global scale. The first, critical, task for these applications is classifying whether a personal health event was mentioned, which we call the (PHM) problem. This task is challenging for many reasons, including typically short length of social media posts, inventive spelling and lexicons, and figurative language, including hyperbole using diseases like "heart attack»» or "cancer»» for emphasis, and not as a health self-report. This problem is even more challenging for rarely reported, or frequent but ambiguously expressed conditions, such as "stroke»». To address this problem, we propose a general, robust method for detecting PHMs in social media, which we call WESPAD, that combines lexical, syntactic, word embedding-based, and context-based features. WESPAD is able to generalize from few examples by automatically distorting the word embedding space to most effectively detect the true health mentions. Unlike previously proposed state-of-the-art supervised and deep-learning techniques, WESPAD requires relatively little training data, which makes it possible to adapt, with minimal effort, to each new disease and condition. We evaluate WESPAD on both an established publicly available Flu detection benchmark, and on a new dataset that we have constructed with mentions of multiple health conditions. Our experiments show that WESPAD outperforms the baselines and state-of-the-art methods, especially in cases when the number and proportion of true health mentions in the training data is small.
【Keywords】: health tracking in social media; representation learning for text classification; social media classification
【Paper Link】 【Pages】:147-156
【Authors】: Chikara Hashimoto ; Manabu Sassano
【Abstract】: Intelligent assistants, such as Siri, are expected to converse comprehensibly with users. To facilitate improvement of their conversational ability, we have developed a method that detects absurd conversations recorded in intelligent assistant logs by identifying user feedback utterances that indicate users' favorable and unfavorable evaluations of intelligent assistant responses; e.g., "great!" is favorable, whereas "what are you talking about?" is unfavorable. Assuming that absurd/comprehensible conversations tend to be followed by unfavorable/favorable utterances, our method extracts some absurd/comprehensible conversations from the log to train a conversation classifier that sorts all the conversations recorded in the log as either absurd or not. The challenge is that user feedback utterances are often ambiguous; e.g., a user may give an unfavorable utterance (e.g., "don't be silly!") to a comprehensible conversation in which the intelligent assistant was attempting to make a joke. An utterance classifier is thus used to score the feedback utterances in accordance with how unambiguously they indicate absurdity. Experiments showed that our method significantly outperformed methods that lacked a conversation and/or utterance classifier, indicating the effectiveness of the two classifiers. Our method only requires user feedback utterances, which would be independent of domains. Experiments focused on chitchat, web search, and weather domains indicated that our method is likely domain-independent.
【Keywords】: absurdity detection; intelligent assistant; natural language processing; user feedback
【Paper Link】 【Pages】:157-166
【Authors】: Xiang Li ; Ben Kao ; Siqiang Luo ; Martin Ester
【Abstract】: We investigate the effectiveness of spectral methods in clustering multi-scale data, which is data whose clusters are of various sizes and densities. We review existing spectral methods that are designed to handle multi-scale data and propose an alternative approach that is orthogonal to existing methods. We put forward the algorithm ROSC, which computes an affinity matrix that takes into account both objects' feature similarity and reachability similarity. We perform extensive experiments comparing ROSC against 9 other methods on both real and synthetic datasets. Our results show that ROSC performs very well against the competitors. In particular, it is very robust in that it consistently performs well over all the datasets tested. Also, it outperforms others by wide margins for datasets that are highly multi-scale.
【Keywords】: multi-scale data; robustness; spectral clustering
【Paper Link】 【Pages】:167-176
【Authors】: Guanjie Zheng ; Fuzheng Zhang ; Zihan Zheng ; Yang Xiang ; Nicholas Jing Yuan ; Xing Xie ; Zhenhui Li
【Abstract】: In this paper, we propose a novel Deep Reinforcement Learning framework for news recommendation. Online personalized news recommendation is a highly challenging problem due to the dynamic nature of news features and user preferences. Although some online recommendation models have been proposed to address the dynamic nature of news recommendation, these methods have three major issues. First, they only try to model current reward (e.g., Click Through Rate). Second, very few studies consider to use user feedback other than click / no click labels (e.g., how frequent user returns) to help improve recommendation. Third, these methods tend to keep recommending similar news to users, which may cause users to get bored. Therefore, to address the aforementioned challenges, we propose a Deep Q-Learning based recommendation framework, which can model future reward explicitly. We further consider user return pattern as a supplement to click / no click label in order to capture more user feedback information. In addition, an effective exploration strategy is incorporated to find new attractive news for users. Extensive experiments are conducted on the offline dataset and online production environment of a commercial news recommendation application and have shown the superior performance of our methods.
【Keywords】: deep Q-Learning; news recommendation; reinforcement learning
【Paper Link】 【Pages】:177-186
【Authors】: Huijun Wu ; Chen Wang ; Jie Yin ; Kai Lu ; Liming Zhu
【Abstract】: Despite outperforming humans in many tasks, deep neural network models are also criticized for the lack of transparency and interpretability in decision making. The opaqueness results in uncertainty and low confidence when deploying such a model in model sharing scenarios, where the model is developed by a third party. For a supervised machine learning model, sharing training process including training data is a way to gain trust and to better understand model predictions. However, it is not always possible to share all training data due to privacy and policy constraints. In this paper, we propose a method to disclose a small set of training data that is just sufficient for users to get the insight into a complicated model. The method constructs a boundary tree using selected training data and the tree is able to approximate the complicated deep neural network models with high fidelity. We show that data point pairs in the tree give users significantly better understanding of the model decision boundaries and paves the way for trustworthy model sharing.
【Keywords】: decision boundary; deep neural networks; interpretability; model sharing
【Paper Link】 【Pages】:187-196
【Authors】: Haowen Xu ; Wenxiao Chen ; Nengwen Zhao ; Zeyan Li ; Jiahao Bu ; Zhihan Li ; Ying Liu ; Youjian Zhao ; Dan Pei ; Yang Feng ; Jie Chen ; Zhaogang Wang ; Honglin Qiao
【Abstract】: To ensure undisrupted business, large Internet companies need to closely monitor various KPIs (e.g., Page Views, number of online users, and number of orders) of its Web applications, to accurately detect anomalies and trigger timely troubleshooting/mitigation. However, anomaly detection for these seasonal KPIs with various patterns and data quality has been a great challenge, especially without labels. In this paper, we proposed Donut, an unsupervised anomaly detection algorithm based on VAE. Thanks to a few of our key techniques, Donut greatly outperforms a state-of-arts supervised ensemble approach and a baseline VAE approach, and its best F-scores range from 0.75 to 0.9 for the studied KPIs from a top global Internet company. We come up with a novel KDE interpretation of reconstruction for Donut, making it the first VAE-based anomaly detection algorithm with solid theoretical explanation.
【Keywords】: anomaly detection; seasonal KPI; variational auto-encoder
【Paper Link】 【Pages】:197-206
【Authors】: Diego Perino ; Matteo Varvello ; Claudio Soriente
【Abstract】: Free web proxies promise anonymity and censorship circumvention at no cost. Several websites publish lists of free proxies organized by country, anonymity level, and performance. These lists index hundreds of thousand of hosts discovered via automated tools and crowd-sourcing. A complex free proxy ecosystem has been forming over the years, of which very little is known. In this paper we shed light on this ecosystem via ProxyTorrent, a distributed measurement platform that leverages both active and passive measurements. Active measurements discover free proxies, assess their performance, and detect potential malicious activities. Passive measurements relate to proxy performance and usage in the wild, and are collected by free proxies users via a Chrome plugin we developed. ProxyTorrent has been running since January 2017, monitoring up to 180,000 free proxies and totaling more than 1,500 users over a 10 months period. Our analysis shows that less than 2% of the proxies announced on the Web indeed proxy traffic on behalf of users; further, only half of these proxies have decent performance and can be used reliably. Around 10% of the working proxies exhibit malicious behaviors, e.g., ads injection and TLS interception, and these proxies are also the ones providing the best performance. Through the analysis of more than 2 Terabytes of proxied traffic, we show that web browsing is the primary user activity. Geo-blocking avoidance is not a prominent use-case, with the exception of proxies located in countries hosting popular geo-blocked content.
【Keywords】: network measurements; proxies; web security and privacy
【Paper Link】 【Pages】:207-216
【Authors】: Timothy Libert
【Abstract】: A dominant regulatory model for web privacy is "notice and choice". In this model, users are notified of data collection and provided with options to control it. To examine the efficacy of this approach, this study presents the first large-scale audit of disclosure of third-party data collection in website privacy policies. Data flows on one million websites are analyzed and over 200,000 websites' privacy policies are audited to determine if users are notified of the names of the companies which collect their data. Policies from 25 prominent third-party data collectors are also examined to provide deeper insights into the totality of the policy environment. Policies are additionally audited to determine if the choice expressed by the "Do Not Track" browser setting is respected. Third-party data collection is wide-spread, but fewer than 15% of attributed data flows are disclosed. The third-parties most likely to be disclosed are those with consumer services users may be aware of, those without consumer services are less likely to be mentioned. Policies are difficult to understand and the average time requirement to read both a given site»s policy and the associated third-party policies exceeds 84 minutes. Only 7% of first-party site policies mention the Do Not Track signal, and the majority of such mentions are to specify that the signal is ignored. Among third-party policies examined, none offer unqualified support for the Do Not Track signal. Findings indicate that current implementations of "notice and choice" fail to provide notice or respect choice.
【Keywords】: internet policy; internet regulation; web privacy; web security
【Paper Link】 【Pages】:217-226
【Authors】: Yuxi Wu ; Panya Gupta ; Miranda Wei ; Yasemin Acar ; Sascha Fahl ; Blase Ur
【Abstract】: All major web browsers include a private browsing mode that does not store browsing history, cookies, or temporary files across browsing sessions. Unfortunately, users have misconceptions about what this mode does. Many factors likely contribute to these misconceptions. In this paper, we focus on browsers» disclosures, or their in-browser explanations of private browsing mode. In a 460-participant online study, each participant saw one of 13 different disclosures (the desktop and mobile disclosures of six popular browsers, plus a control). Based on the disclosure they saw, participants answered questions about what would happen in twenty browsing scenarios capturing previously documented misconceptions. We found that browsers» disclosures fail to correct the majority of the misconceptions we tested. These misconceptions included beliefs that private browsing mode would prevent geolocation, advertisements, viruses, and tracking by both the websites visited and the network provider. Furthermore, participants who saw certain disclosures were more likely to have misconceptions about private browsing»s impact on targeted advertising, the persistence of lists of downloaded files, and tracking by ISPs, employers, and governments.
【Keywords】: incognito mode; private browsing; private browsing mode; usable privacy; user study; web browser privacy
【Paper Link】 【Pages】:227-236
【Authors】: Oleksii Starov ; Yuchen Zhou ; Xiao Zhang ; Najmeh Miramirkhani ; Nick Nikiforakis
【Abstract】: To better understand the demographics of their visitors and their paths through their websites, the vast majority of modern website owners make use of third-party analytics platforms, such as, Google Analytics and ClickTale. Given that all the clients of a third-party analytics platform report to the same server, the tracking requests need to contain identifiers that allow the analytics server to differentiate between their clients. In this paper, we analyze the analytics identifiers utilized by eighteen different third-party analytics platforms and show that these identifiers enable the clustering of seemingly unrelated websites as part of a common third-party analytics account (i.e. websites whose analytics are managed by a single person or team). We focus our attention on malicious websites that also utilize third-party web analytics and show that threat analysts can utilize web analytics to both discover previously unknown malicious pages in a threat-agnostic fashion, as well as to cluster malicious websites into campaigns. We build a system for automatically identifying, isolating, and querying analytics identifiers from malicious pages and use it to discover an additional 11K live domains that use analytics associated with malicious pages. We show how our system can be used to improve the coverage of existing blacklists, discover previously unknown phishing campaigns, identify malicious binaries and Android apps, and even aid in attribution of malicious domains with protected WHOIS information.
【Keywords】:
【Paper Link】 【Pages】:237-246
【Authors】: Sajjad Arshad ; Seyed Ali Mirheidari ; Tobias Lauinger ; Bruno Crispo ; Engin Kirda ; William K. Robertson
【Abstract】: Relative Path Overwrite (RPO) is a recent technique to inject style directives into sites even when no style sink or markup injection vulnerability is present. It exploits differences in how browsers and web servers interpret relative paths (i.e., path confusion) to make a HTML page reference itself as a stylesheet; a simple text injection vulnerability along with browsers» leniency in parsing CSS resources results in an attacker»s ability to inject style directives that will be interpreted by the browser. Even though style injection may appear less serious a threat than script injection, it has been shown that it enables a range of attacks, including secret exfiltration. In this paper, we present the first large-scale study of the Web to measure the prevalence and significance of style injection using RPO. Our work shows that around 9% of the sites in the Alexa Top 10,000 contain at least one vulnerable page, out of which more than one third can be exploited. We analyze in detail various impediments to successful exploitation, and make recommendations for remediation. In contrast to script injection, relatively simple countermeasures exist to mitigate style injection. However, there appears to be little awareness of this attack vector as evidenced by a range of popular Content Management Systems (CMSes) that we found to be exploitable.
【Keywords】: relative path overwrite; scriptless attack; style injection
【Paper Link】 【Pages】:247-256
【Authors】: Abner Mendoza ; Phakpoom Chinprutthiwong ; Guofei Gu
【Abstract】: The paradigm shift to a mobile-first economy has seen a drastic increase in mobile-optimized websites that in many cases are derived from their desktop counterparts. Mobile website design is often focused on performance optimization rather than security, and possibly developed by different teams of developers. This has resulted in a number of subtle but critical inconsistencies in terms of security guarantees provided on the web platform, such as protection mechanisms against common web attacks. In this work, we have conducted the first systematic measurement study of inconsistencies between mobile and desktop HTTP security response configuration in the top 70,000 websites. We show that HTTP security configuration inconsistencies between mobile and desktop versions of the same website can lead to vulnerabilities. Our study compares data snapshots collected one year apart to garner insights into the longitudinal trends of mobile versus desktop inconsistencies in websites. To complement our measurement study, we present a threat analysis that explores some possible attack scenarios that can leverage the inconsistencies found on real websites. We systematically analyze the security impact of the inconsistent implementations between the mobile and desktop versions of a website and show how it can lead to real-world exploits. We present several case studies of popular websites to show real-world impact of how these inconsistencies are leveraged to compromise security and privacy of web users. Our results show little to no improvements across our datasets, which highlight the continued pervasiveness of subtle inconsistencies affecting even some high profile websites.
【Keywords】: http headers; inconsistencies; mobile web security
【Paper Link】 【Pages】:257-266
【Authors】: Najmeh Miramirkhani ; Timothy Barron ; Michael Ferdman ; Nick Nikiforakis
【Abstract】: An event that is rarely considered by technical users and laymen alike is that of a domain name expiration. The massive growth in the registration of domain names is matched by massive numbers of domain expirations, after which domains are made available for registration again. While the vast majority of expiring domains are of no value, among the hundreds of thousands of daily expirations, there exist domains that are clearly valuable, either because of their lexical composition, or because of their residual trust. In this paper, we investigate the dynamics of domain dropcatching where companies, on behalf of users, compete to register the most desirable domains as soon as they are made available and then auction them off to the highest bidder. Using a data-driven approach, we monitor the expiration of 28 million domains over the period of nine months, collecting domain features, WHOIS records, and crawling the registered domains on a regular basis to uncover the purpose for which they were re-registered (caught). Among others, we find that on average, only 10% of the expired (dropped) domains are caught with the vast majority of the re-registrations happening on the day they are released. We investigate the features that make some domains more likely to be caught than others and discover that a domain that was malicious at the time of its expiration is twice as likely to be caught than the average domain. Moreover, previously-malicious domains are significantly more likely to be reused for malicious purposes than previously benign domains. We identify three types of users who are interested in purchasing dropped domains, ranging from freelancers who purchase one or two domains to professionals who invest more than $115K purchasing dropped domains in only three months. Finally, we observe that less than 11% were used to host web content with the remaining domains used either by speculators, or by malicious actors.
【Keywords】: deleted domain; dns; drop catch; expired domain
【Paper Link】 【Pages】:267-276
【Authors】: Rahat Masood ; Dinusha Vatsalan ; Muhammad Ikram ; Mohamed Ali Kâafar
【Abstract】: Users leave a trail of their personal data, interests, and intents while surfing or sharing information on the Web. Web data could therefore reveal some private/sensitive information about users based on inference analysis. The possible identification of information corresponding to a single individual by an inference attack holds true even if the user identifiers are encoded or removed in the Web data. Several works have been done on improving privacy of Web data through obfuscation methods~\citeHow09,Dom09,Sha05,Che14. However, these methods are neither comprehensive, generic to be applicable to any Web data, nor effective against adversarial attacks. To this end, we propose a privacy-aware obfuscation method for Web data addressing these identified drawbacks of existing methods. We use probabilistic methods to predict privacy risk of Web data that incorporates all key privacy aspects, which are uniqueness, uniformity, and linkability of Web data. The Web data with high predicted risk are then obfuscated by our method to minimize the privacy risk using semantically similar data. Our method is resistant against adversary who has knowledge about the datasets and model learned risk probabilities using differential privacy-based noise addition. Experimental study conducted on two real Web datasets validates the significance and efficacy of our method. Our results indicate that the average privacy risk reaches to 100% with a minimum of 10 sensitive Web entries, while at most 0% privacy risk could be attained with our obfuscation method at the cost of average utility loss of 64.3%.
【Keywords】: adversarial machine learning; data obfuscation; privacy risk evaluation; probabilistic model; semantic similarity; web data privacy
【Paper Link】 【Pages】:277-286
【Authors】: Martin Dittus ; Joss Wright ; Mark Graham
【Abstract】: Does recent growth of darknet markets signify a slow reorganisation of the illicit drug trade? Where are darknet markets situated in the global drug supply chain? In principle, these platforms allow producers to sell directly to end users, bypassing traditional trafficking routes. And yet, there is evidence that many offerings originate from a small number of highly active consumer countries, rather than from countries that are primarily known for drug production. In a large-scale empirical study, we determine the darknet trading geography of three plant-based drugs across four of the largest darknet markets, and compare it to the global footprint of production and consumption for these drugs. We present strong evidence that cannabis and cocaine vendors are primarily located in a small number of consumer countries, rather than producer countries, suggesting that darknet trading happens at the »last mile», possibly leaving old trafficking routes intact. A model to explain trading volumes of opiates is inconclusive. We cannot find evidence for significant production-side offerings across any of the drug types or marketplaces. Our evidence further suggests that the geography of darknet market trades is primarily driven by existing consumer demand, rather than new demand fostered by individual markets.
【Keywords】: cryptomarkets; darknet markets; economic geography; information geography; online crime; platforms
【Paper Link】 【Pages】:287-296
【Authors】: Yang Zhang ; Mathias Humbert ; Tahleen Rahman ; Cheng-Te Li ; Jun Pang ; Michael Backes
【Abstract】: Hashtag has emerged as a widely used concept of popular culture and campaigns, but its implications on people»s privacy have not been investigated so far. In this paper, we present the first systematic analysis of privacy issues induced by hashtags. We concentrate in particular on location, which is recognized as one of the key privacy concerns in the Internet era. By relying on a random forest model, we show that we can infer a user»s precise location from hashtags with accuracy of 70% to 76%, depending on the city. To remedy this situation, we introduce a system called Tagvisor that systematically suggests alternative hashtags if the user-selected ones constitute a threat to location privacy. Tagvisor realizes this by means of three conceptually different obfuscation techniques and a semantics-based metric for measuring the consequent utility loss. Our findings show that obfuscating as little as two hashtags already provides a near-optimal trade-off between privacy and utility in our dataset. This in particular renders Tagvisor highly time-efficient, and thus, practical in real-world settings.
【Keywords】: hashtag; location privacy; online social networks
【Paper Link】 【Pages】:297-307
【Authors】: I Luk Kim ; Weihang Wang ; Yonghwi Kwon ; Yunhui Zheng ; Yousra Aafer ; Weijie Meng ; Xiangyu Zhang
【Abstract】: In this paper, we present a new ad budget draining attack. By repeatedly pulling ads from targeted advertisers using crafted browsing profiles, we are able to reduce the chance of showing their ads to real-human visitors and trash the ad budget. From the advertiser profiles collected by an automated crawler, we infer advertising strategies, train satisfying browsing profiles and launch large-scale attacks. We evaluate our methods on 291 public advertisers selected from Alexa Top 500, where we successfully reveal the targeting strategies used by 87% of the advertisers we considered. We also executed a series of attacks against a controlled advertiser and 3 real-world advertisers within the ethical and legal boundary. The results show that we are able to fetch 40,958 ads and drain up to $155.89 from the targeted advertisers within an hour.
【Keywords】: ad fraud; budget draining attack; online advertising
【Paper Link】 【Pages】:309-318
【Authors】: Alejandro Gómez-Boix ; Pierre Laperdrix ; Benoit Baudry
【Abstract】: Browser fingerprinting is a stateless technique, which consists in collecting a wide range of data about a device through browser APIs. Past studies have demonstrated that modern devices present so much diversity that fingerprints can be exploited to identify and track users online. With this work, we want to evaluate if browser fingerprinting is still effective at uniquely identifying a large group of users when analyzing millions of fingerprints over a few months. We collected 2,067,942 browser fingerprints from one of the top 15 French websites. The analysis of this novel dataset sheds a new light on the ever-growing browser fingerprinting domain. The key insight is that the percentage of unique fingerprints in our dataset is much lower than what was reported in the past: only 33.6% of fingerprints are unique by opposition to over 80% in previous studies. We show that non-unique fingerprints tend to be fragile. If some features of the fingerprint change, it is very probable that the fingerprint will become unique. We also confirm that the current evolution of web technologies is benefiting users» privacy significantly as the removal of plugins brings down substantively the rate of unique desktop machines.
【Keywords】: browser fingerprinting; privacy; software diversity
【Paper Link】 【Pages】:319-328
【Authors】: Bharat Srinivasan ; Athanasios Kountouras ; Najmeh Miramirkhani ; Monjur Alam ; Nick Nikiforakis ; Manos Antonakakis ; Mustaque Ahamad
【Abstract】: Technical Support Scams (TSS), which combine online abuse with social engineering over the phone channel, have persisted despite several law enforcement actions. Although recent research has provided important insights into TSS, these scams have now evolved to exploit ubiquitously used online services such as search and sponsored advertisements served in response to search queries. We use a data-driven approach to understand search-and-ad abuse by TSS to gain visibility into the online infrastructure that facilitates it. By carefully formulating tech support queries with multiple search engines, we collect data about both the support infrastructure and the websites to which TSS victims are directed when they search online for tech support resources. We augment this with a DNS-based amplification technique to further enhance visibility into this abuse infrastructure. By analyzing the collected data, we provide new insights into search-and-ad abuse by TSS and reinforce some of the findings of earlier research. Further, we demonstrate that tech support scammers are (1) successful in getting major as well as custom search engines to return links to websites controlled by them, and (2) they are able to get ad networks to serve malicious advertisements that lead to scam pages. Our study period of approximately eight months uncovered over 9,000 TSS domains, of both passive and aggressive types, with minimal overlap between sets that are reached via organic search results and sponsored ads. Also, we found over 2,400 support domains which aid the TSS domains in manipulating organic search results. Moreover, to our surprise, we found very little overlap with domains that are reached via abuse of domain parking and URL-shortening services which was investigated previously. Thus, investigation of search-and-ad abuse provides new insights into TSS tactics and helps detect previously unknown abuse infrastructure that facilitates these scams.
【Keywords】: advertisement abuse; cyber crime; omni/cross-channel scams; search engine abuse; social engineering; tech support; telephony fraud
【Paper Link】 【Pages】:329-338
【Authors】: Pedro Moreno-Sanchez ; Navin Modi ; Raghuvir Songhela ; Aniket Kate ; Sonia Fahmy
【Abstract】: The Ripple credit network has emerged as a payment backbone with key advantages for financial institutions and the remittance industry. Its path-based IOweYou (IOU) settlements across different (crypto)currencies conceptually distinguishes the Ripple blockchain from cryptocurrencies (such as Bitcoin and altcoins), and makes it highly suitable to an orthogonal yet vast set of applications in the remittance world for cross-border transactions and beyond. This work studies the structure and evolution of the Ripple network since its inception, and investigates its vulnerability to devilry attacks that affect the IOU credit of linnet users» wallets. We find that about 13M USD are at risk in the current Ripple network due to inappropriate configuration of the rippling flag on credit links, facilitating undesired redistribution of credit across those links. Although the Ripple network has grown around a few highly connected hub (gateway) wallets that constitute the core of the network and provide high liquidity to users, such a credit link distribution results in a user base of around 112,000 wallets that can be financially isolated by as few as 10 highly connected gateway wallets. Indeed, today about 4.9M USD cannot be withdrawn by their owners from the Ripple network due to PayRoutes, a gateway tagged as faulty by the Ripple community. Finally, we observe that stale exchange offers pose a real problem, and exchanges (market makers) have not always been vigilant about periodically updating their exchange offers according to current real-world exchange rates. For example, stale offers were used by 84 Ripple wallets to gain more than 4.5M USD from mid-July to mid-August 2017. Our findings should prompt the Ripple community to improve the health of the network by educating its users on increasing their connectivity, and by appropriately maintaining the credit limits, rippling flags, and exchange offers on their IOU credit links.
【Keywords】: credit devilry; faulty gateways; ioweyou (iou); ripple credit network; rippling; stale exchange offers
【Paper Link】 【Pages】:339-348
【Authors】: Young Bae Jeon ; Minchul Kim ; Hyunsoo Kim ; Hyoungshick Kim ; Jun Ho Huh ; Ji Won Yoon
【Abstract】: Electrical network frequency (ENF) signals have common patterns that can be used as signatures for identifying recorded time and location of videos and sound. To enable cost-efficient, reliable and scalable location inference, we created a reference map of ENF signals representing hundreds of locations world wide -- extracting real-world ENF signals from online multimedia streaming services (e.g., YouTube and Explore). Based on this reference map of ENF signals, we propose a novel side-channel attack that can identify the physical location of where a target video or sound was recorded or streamed from. Our attack does not require any expensive ENF signal receiver nor any software to be installed on a victim»s device -- all we need is the recorded video or sound files to perform the attack and they are collected from world wide web. The evaluation results show that our attack can infer the intra-grid location of the recorded audio files with an accuracy of $76$% when those files are $5$ minutes or longer. We also showed that our proposed attack works well even when video and audio data are processed within a certain distortion range with audio codecs used in real VoIP applications.
【Keywords】: electrical network frequency; location tracking; side channel analysis
【Paper Link】 【Pages】:349-358
【Authors】: Klaudia Krawiecka ; Arseny Kurnikov ; Andrew Paverd ; Mohammad Mannan ; N. Asokan
【Abstract】: Passwords are by far the most widely-used mechanism for authenticating users on the web, out-performing all competing solutions in terms of deployability (e.g. cost and compatibility). However, two critical security concerns are phishing and theft of password databases. These are exacerbated by users» tendency to reuse passwords across different services. Current solutions typically address only one of the two concerns, and do not protect passwords against rogue servers. Furthermore, they do not provide any verifiable evidence of their (server-side) adoption to users, and they face deployability challenges in terms of ease-of-use for end users, and/or costs for service providers. We present SafeKeeper, a novel and comprehensive solution to ensure secrecy of passwords in web authentication systems. Unlike previous approaches, SafeKeeper protects users» passwords against very strong adversaries, including external phishers as well as corrupted (rogue) servers. It is relatively inexpensive to deploy as it (i) uses widely available hardware-based trusted execution environments like Intel SGX, (ii) requires only minimal changes for integration into popular web platforms like WordPress, and (iii) imposes negligible performance overhead. We discuss several challenges in designing and implementing such a system, and how we overcome them. Via an 86-participant user study, systematic analysis and experiments, we show the usability, security and deployability of SafeKeeper, which is available as open-source.
【Keywords】: intel sgx; passwords; phishing; trusted execution environment
【Paper Link】 【Pages】:359-368
【Authors】: Yupeng Gu ; Yizhou Sun ; Yanen Li ; Yang Yang
【Abstract】: Network embedding algorithms that map nodes in a network into a low-dimensional vector space are prevalent in recent years, due to their superior performance in many network-based tasks, such as clustering, classification, and link prediction. The main assumption of existing algorithms is that the learned latent representation for nodes should preserve the structure of the network, in terms of first-order or higher-order connectivity. In other words, nodes that are more similar will have higher probability to connect to each other. This phenomena is typically explained as homophily in network science. However, there is another factor usually neglected by the existing embedding algorithms, which is the popularity of a node. For example, celebrities in a social network usually receive numerous followers, which cannot be fully explained by the similarity of the two users. We denote this factor with the terminology "social rank»». We then propose a network embedding model that considers both of the two factors in link generation, and learn proximity-based embedding and social rank-based embedding separately. Rather than simply treating these two factors independent with each other, a carefully designed link generation model is proposed, which explicitly models the interdependency between these two types of embeddings. Experiments on several real-world datasets across different domains demonstrate the superiority of our novel network embedding model over the state-of-the-art methods.
【Keywords】: network embedding; representation learning; social rank
【Paper Link】 【Pages】:369-378
【Authors】: Cameron Musco ; Christopher Musco ; Charalampos E. Tsourakakis
【Abstract】: The rise of social media and online social networks has been a disruptive force in society. Opinions are increasingly shaped by interactions on online social media, and social phenomena including disagreement and polarization are now tightly woven into everyday life. In this work we initiate the study of the following question: \beginquotation \noindent Given n agents, each with its own initial opinion that reflects its core value on a topic, and an opinion dynamics model, what is the structure of a social network that minimizes \em disagreementand \em controversy simultaneously? \endquotation \noindent This question is central to recommender systems: should a recommender system prefer a link suggestion between two online users with similar mindsets in order to keep disagreement low, or between two users with different opinions in order to expose each to the others viewpoint of the world, and decrease overall levels of polarization and controversy? Such decisions have an important global effect on society \citewilliams2007social. Our contributions include a mathematical formalization of this question as an optimization problem and an exact, time-efficient algorithm. We also prove that there always exists a graph with $O(n/ε^2)$ edges that is a $(1+ε)$ approximation to the optimum. Our formulation is an instance of optimization over \em graph topologies, see also \citeboyd2004fastest,daitch2009fitting,sun2006fastest. Furthermore, for a given graph, we show how to optimize the same objective over the agents» innate opinions in polynomial time. Finally, we perform an empirical study of our proposed methods on synthetic and real-world data that verify their value as mining tools to better understand the trade-off between of disagreement and polarization. We find that there is a lot of space to reduce both controversy and disagreement in real-world networks; for instance, on a Reddit network where users exchange comments on politics, our methods achieve a reduction in controversy and disagreement of the order $6.2 \times 10^4$.
【Keywords】: opinion dynamics; optimization; polarization; social networks
【Paper Link】 【Pages】:379-388
【Authors】: Abhimanyu Das ; Sreenivas Gollapudi ; Anthony Kim ; Debmalya Panigrahi ; Chaitanya Swamy
【Abstract】: Motivated by the popularity of online ride and delivery services, we study natural variants of classical multi-vehicle minimum latency problems where the objective is to route a set of vehicles located at depots to serve requests located on a metric space so as to minimize the total latency. In this paper, we consider point-to-point requests that come with source-destination pairs and release-time constraints that restrict when each request can be served. The point-to-point requests and release-time constraints model taxi rides and deliveries. For all the variants considered, we show constant-factor approximation algorithms based on a linear programming framework. To the best of our knowledge, these are the first set of results for the aforementioned variants of the minimum latency problems. Furthermore, we provide an empirical study of heuristics based on our theoretical algorithms on a real data set of taxi rides.
【Keywords】: minimum latency problem; online ride services; vehicle routing
【Paper Link】 【Pages】:389-398
【Authors】: Abram N. Magner ; Jithin K. Sreedharan ; Ananth Y. Grama ; Wojciech Szpankowski
【Abstract】: Inferring the node arrival sequence from a snapshot of a dynamic network is an important problem, with applications ranging from identifying sources of contagion to flow of capital in financial transaction networks. Variants of this problem have received significant recent research attention, including results on infeasibility of solution for prior formulations. We present a new formulation of the problem that admits probabilistic solutions for broad classes of dynamic network models. Instantiating our framework for a preferential attachment model, we present effectively computable and practically tight bounds on the tradeoff curve between optimal achievable precision and density/recall. We also present efficient algorithms for partial recovery of node arrival orders and derive theoretical and empirical performance bounds on the precision and density/recall of our methods in comparison to the best possible. We validate our methods through experiments on both synthetic and real networks to show that their performance is robust to model changes, and that they yield excellent results in practice. We also demonstrate their utility in the context of a novel application in analysis of the human brain connectome to draw new insights into the functional and structural organization and evolution of the human brain.
【Keywords】: integer programming; linear extension; partial order; preferential attachment graphs
【Paper Link】 【Pages】:399-408
【Authors】: Yizhou Zhang ; Yun Xiong ; Xiangnan Kong ; Shanshan Li ; Jinhong Mi ; Yangyong Zhu
【Abstract】: Collective classification has attracted considerable attention in the last decade, where the labels within a group of instances are correlated and should be inferred collectively, instead of independently. Conventional approaches on collective classification mainly focus on exploiting simple relational features (such ascount andexists aggregators on neighboring nodes). However, many real-world applications involve complex dependencies among the instances, which are obscure/hidden in the networks. To capture these dependencies in collective classification, we need to go beyond simple relational features and extract deep dependencies between the instances. In this paper, we study the problem of deep collective classification inHeterogeneous Information Networks (HINs), which involves different types of autocorrelations, from simple to complex relations, among the instances. Different from conventional autocorrelations, which are given explicitly by the links in the network, complex autocorrelations are obscure/hidden in HINs, and should be inferred from existing links in a hierarchical order. This problem is highly challenging due to the multiple types of dependencies among the nodes and the complexity of the relational features. In this study, we proposed a deep convolutional collective classification method, called GraphInception to learn the deep relational features in HINs. The proposed method can automatically generate a hierarchy of relational features with different complexities. Extensive experiments on four real-world networks demonstrate that our approach can improve the collective classification performance by considering deep relational features in HINs.
【Keywords】: collective classification; deep learning; graph convolution; graph mining; heterogeneous information networks
【Paper Link】 【Pages】:409-418
【Authors】: Minji Yoon ; Woojeong Jin ; U. Kang
【Abstract】: Given a time-evolving graph, how can we track similarity between nodes in a fast and accurate way, with theoretical guarantees on the convergence and the error? Random Walk with Restart (RWR) is a popular measure to estimate the similarity between nodes and has been exploited in numerous applications. Many real-world graphs are dynamic with frequent insertion/deletion of edges; thus, tracking RWR scores on dynamic graphs in an efficient way has aroused much interest among data mining researchers. Recently, dynamic RWR models based on the propagation of scores across a given graph have been proposed, and have succeeded in outperforming previous other approaches to compute RWR dynamically. However, those models fail to guarantee exactness and convergence time for updating RWR in a generalized form. In this paper, we propose OSP, a fast and accurate algorithm for computing dynamic RWR with insertion/deletion of nodes/edges in a directed/undirected graph. When the graph is updated, OSP first calculates offset scores around the modified edges, propagates the offset scores across the updated graph, and then merges them with the current RWR scores to get updated RWR scores. We prove the exactness of OSP and introduce OSP-T, a version of OSP which regulates a trade-off between accuracy and computation time by using error tolerance?. Given restart probability c, OSP-T guarantees to return RWR scores with O(∋/c) error in O(log(∋/2)/log(1-c)) iterations. Through extensive experiments, we show that OSP tracks RWR exactly up to 4605x faster than existing static RWR method on dynamic graphs, and OSP-T requires up to 15x less time with 730x lower L1 norm error and 3.3x lower rank error than other state-of-the-art dynamic RWR methods.
【Keywords】: online algorithm; random walk with restart; time-evolving graphs
【Paper Link】 【Pages】:419-428
【Authors】: Marian-Andrei Rizoiu ; Swapnil Mishra ; Quyu Kong ; Mark James Carman ; Lexing Xie
【Abstract】: Among the statistical tools for online information diffusion modeling, both epidemic models and Hawkes point processes are popular choices. The former originate from epidemiology, and consider information as a viral contagion which spreads into a population of online users. The latter have roots in geophysics and finance, view individual actions as discrete events in continuous time, and modulate the rate of events according to the self-exciting nature of event sequences. Here, we establish a novel connection between these two frameworks. Namely, the rate of events in an extended Hawkes model is identical to the rate of new infections in the Susceptible-Infected-Recovered (SIR) model after marginalizing out recovery events -- which are unobserved in a Hawkes process. This result paves the way to apply tools developed for SIR to Hawkes, and vice versa. It also leads to HawkesN, a generalization of the Hawkes model which accounts for a finite population size. Finally, we derive the distribution of cascade sizes for HawkesN, inspired by methods in stochastic SIR. Such distributions provide nuanced explanations to the general unpredictability of popularity: the distribution for diffusion cascade sizes tends to have two modes, one corresponding to large cascade sizes and another one around zero.
【Keywords】: diffusion size prediction; diffusions in finite population; epidemic models; temporal point processes
【Paper Link】 【Pages】:429-438
【Authors】: Yongzheng Jia ; Xue Liu ; Wei Xu
【Abstract】: Mobile dating applications such as Coffee Meets Bagel, Tantan, and Tinder, have become significant for young adults to meet new friends and discover romantic relationships. From a system designer's perspective, in order to achieve better user experience in these applications, we should take both the efficiency and fairness of a dating market into consideration, so as to increase the overall satisfaction for all users. Towards this goal, we investigate the nature of diminishing marginal returns for online dating markets (i.e., captured by the submodularity), and trade-off between the efficiency and fairness of the market with Nash social welfare. We further design effective online algorithms to the apps. We verify our models and algorithms through sound theoretical analysis and empirical studies by using real data and show that our algorithms can significantly improve the ecosystems of the online dating applications.
【Keywords】:
【Paper Link】 【Pages】:439-448
【Authors】: Nate Veldt ; David F. Gleich ; Anthony Wirth
【Abstract】: Graph clustering, or community detection, is the task of identifying groups of closely related objects in a large network. In this paper we introduce a new community detection framework called LambdaCC that is based on a specially weighted version of correlation clustering. A key component in our methodology is a clustering resolution parameter, lambda, which implicitly controls the size and structure of clusters formed by our framework. We show that, by increasing this parameter, our objective effectively interpolates between two different strategies in graph clustering: finding a sparse cut and forming dense subgraphs. Our methodology unifies and generalizes a number of other important clustering quality functions including modularity, sparsest cut, and cluster deletion, and places them all within the context of an optimization problem that has been well studied from the perspective of approximation algorithms. Our approach to clustering is particularly relevant in the regime of finding dense clusters, as it leads to a 2-approximation for the cluster deletion problem. We use our approach to cluster several graphs, including large collaboration networks and social networks.
【Keywords】: cluster deletion; community detection; correlation clustering; graph clustering; network analysis; sparsest cut
【Paper Link】 【Pages】:449-458
【Authors】: Talya Eden ; Shweta Jain ; Ali Pinar ; Dana Ron ; C. Seshadhri
【Abstract】: The degree distribution is one of the most fundamental properties used in the analysis of massive graphs. There is a large literature on graph sampling, where the goal is to estimate properties (especially the degree distribution) of a large graph through a small, random sample. Estimating the degree distribution of real-world graphs poses a significant challenge, due to their heavy-tailed nature and the large variance in degrees. We design a new algorithm, SADDLES, for this problem, using recent mathematical techniques from the field of sublinear algorithms. The SADDLES algorithm gives provably accurate outputs for all values of the degree distribution. For the analysis, we define two fatness measures of the degree distribution, called the h-index and the z-index. We prove that SADDLES is sublinear in the graph size when these indices are large. A corollary of this result is a provably sublinear algorithm for any degree distribution bounded below by a power law. We deploy our new algorithm on a variety of real datasets and demonstrate its excellent empirical behavior. In all instances, we get extremely accurate approximations for all values in the degree distribution by observing at most $1%$ of the vertices. This is a major improvement over the state-of-the-art sampling algorithms, which typically sample more than $10%$ of the vertices to give comparable results. We also observe that the h and z-indices of real graphs are large, validating our theoretical analysis.
【Keywords】: degree distribution; graphs; sampling; sublinear
【Paper Link】 【Pages】:459-468
【Authors】: Sofia Maria Nikolakaki ; Charalampos Mavroforakis ; Alina Ene ; Evimaria Terzi
【Abstract】: The proliferation of online social networks and the spread of smart mobile devices enable the collection of information related to a multitude of users' activities. These networks, where every node is associated with a type of action and a frequency, are usually referred to as activity networks. Examples of such networks include road networks, where the nodes are intersections and the edges are road segments. Each node is associated with a number of geolocated actions that users of an online platform took in its vicinity. In these networks, we define a prize-collecting subgraph to be a connected set of nodes, which exhibits high levels of activity, and is compact, i.e., the nodes are close to each other. The k-PCSubgraphs problem we address in this paper is defined as follows: given an activity network and an integer k, identify k non-overlapping and connected subgraphs of the network such that the nodes of each subgraph are close to each other, and the total number of actions they are associated with is high. Here, we define and study two new variants of the k-PCSubgraphs problem, where the subgraphs of interest are tours and paths. Since both these problems are NP-hard, we provide approximate and heuristic algorithms that run in time nearly-linear to the number of edges. In our experiments, we use real activity networks obtained by combining road networks and projecting on them user activity from Twitter and Flickr. Our experimental results demonstrate both the efficiency and the practical utility of our methods.
【Keywords】: activity networks; paths; prize-collecting Steiner tree; tours
【Paper Link】 【Pages】:469-478
【Authors】: Jingchao Ni ; Shiyu Chang ; Xiao Liu ; Wei Cheng ; Haifeng Chen ; Dongkuan Xu ; Xiang Zhang
【Abstract】: Network embedding aims to learn a low-dimensional vector representation for each node in the social and information networks, with the constraint to preserve network structures. Most existing methods focus on single network embedding, ignoring the relationship between multiple networks. In many real-world applications, however, multiple networks may contain complementary information, which can lead to further refined node embeddings. Thus, in this paper, we propose a novel multi-network embedding method, DMNE. DMNE is flexible. It allows different networks to have different sizes, to be (un)weighted and (un)directed. It leverages multiple networks via cross-network relationships between nodes in different networks, which may form many-to-many node mappings, and be associated with weights. To model the non-linearity of the network data, we develop DMNE to have a new deep learning architecture, which coordinates multiple neural networks (one for each input network data) with a co-regularized loss function. With multiple layers of non-linear mappings, DMNE progressively transforms each input network to a highly non-linear latent space, and in the meantime, adapts different spaces to each other through a co-regularized learning schema. Extensive experimental results on real-life datasets demonstrate the effectiveness of our method.
【Keywords】: multi-network; network embedding; representation learning
【Paper Link】 【Pages】:479-488
【Authors】: Linchuan Xu ; Xiaokai Wei ; Jiannong Cao ; Philip S. Yu
【Abstract】: There are increasing interests in learning low-dimensional and dense node representations from the network structure which is usually high-dimensional and sparse. However, most existing methods fail to consider semantic meanings of links. Different links may have different semantic meanings because the similarities between two nodes can be different, e.g., two nodes share common neighbors and two nodes share similar interests which are demonstrated in node-generated content. In this paper, the former type of links are referred to as structure-close links while the latter type are referred to as content-close links. These two types of links naturally indicate there are two types of characteristics that nodes expose in a social network. Hence, we propose to learn two representations for each node, and render each representation responsible for encoding the corresponding type of node characteristics, which is achieved by jointly embedding the network structure and inferring the type of each link. In the experiments, the proposed method is demonstrated to be more effective than five recent methods on four social networks through applications including visualization, link prediction and multi-label classification.
【Keywords】: data mining; network embedding; social networks
【Paper Link】 【Pages】:489-498
【Authors】: Xiaofeng Yang ; Deepak Ajwani ; Wolfgang Gatterbauer ; Patrick K. Nicholson ; Mirek Riedewald ; Alessandra Sala
【Abstract】: Many problems in areas as diverse as recommendation systems, social network analysis, semantic search, and distributed root cause analysis can be modeled as pattern search on labeled graphs (also called "heterogeneous information networks" or HINs). Given a large graph and a query pattern with node and edge label constraints, a fundamental challenge is to find the top-k matches according to a ranking function over edge and node weights. For users, it is difficult to select value k. We therefore propose the novel notion of an any-k ranking algorithm: for a given time budget, return as many of the top-ranked results as possible. Then, given additional time, produce the next lower-ranked results quickly as well. It can be stopped anytime, but may have to continue until all results are returned. This paper focuses on acyclic patterns over arbitrary labeled graphs. We are interested in practical algorithms that effectively exploit (1) properties of heterogeneous networks, in particular selective constraints on labels, and (2) that the users often explore only a fraction of the top-ranked results. Our solution, KARPET, carefully integrates aggressive pruning that leverages the acyclic nature of the query, and incremental guided search. It enables us to prove strong non-trivial time and space guarantees, which is generally considered very hard for this type of graph search problem. Through experimental studies we show that KARPET achieves running times in the order of milliseconds for tree patterns on large networks with millions of nodes and edges.
【Keywords】: anytime algorithm; labeled graphs; top-k
【Paper Link】 【Pages】:499-508
【Authors】: Chenyi Zhuang ; Qiang Ma
【Abstract】: The problem of extracting meaningful data through graph analysis spans a range of different fields, such as the internet, social networks, biological networks, and many others. The importance of being able to effectively mine and learn from such data continues to grow as more and more structured data become available. In this paper, we present a simple and scalable semi-supervised learning method for graph-structured data in which only a very small portion of the training data are labeled. To sufficiently embed the graph knowledge, our method performs graph convolution from different views of the raw data. In particular, a dual graph convolutional neural network method is devised to jointly consider the two essential assumptions of semi-supervised learning: (1) local consistency and (2) global consistency. Accordingly, two convolutional neural networks are devised to embed the local-consistency-based and global-consistency-based knowledge, respectively. Given the different data transformations from the two networks, we then introduce an unsupervised temporal loss function for the ensemble. In experiments using both unsupervised and supervised loss functions, our method outperforms state-of-the-art techniques on different datasets.
【Keywords】: adjacency matrix; graph convolutional networks; graph diffusion; pointwise mutual information; semi-supervised learning
【Paper Link】 【Pages】:509-518
【Authors】: Junghwan Kim ; Haekyu Park ; Ji-Eun Lee ; U. Kang
【Abstract】: Given a signed directed network, how can we learn node representations which fully encode structural information of the network including sign and direction of edges? Node representation learning or network embedding learns a mapping of each node to a vector. The mapping encodes structural information on network, providing low-dimensional dense node features for general machine learning and data mining frameworks. Since many social networks allow trust (friend) and distrust (enemy) relationships described by signed and directed edges, generalizing network embedding method to learn from sign and direction information in networks is crucial. In addition, social theories are critical tool in signed network analysis. However, none of the existing methods supports all of the desired properties: considering sign, direction, and social theoretical interpretation. In this paper, we propose SIDE, a general network embedding method that represents both sign and direction of edges in the embedding space. SIDE carefully formulates and optimizes likelihood over both direct and indirect signed connections. We provide socio-psychological interpretation for each component of likelihood function. We prove linear scalability of our algorithm and propose additional optimization techniques to reduce the training time and improve accuracy. Through extensive experiments on real-world signed directed networks, we show that SIDE effectively encodes structural information into the learned embedding.
【Keywords】: network embedding; representation learning; signed directed network
【Paper Link】 【Pages】:519-528
【Authors】: Arlei Silva ; Ambuj K. Singh ; Ananthram Swami
【Abstract】: The sparsest cut problem consists of identifying a small set of edges that breaks the graph into balanced sets of vertices. The normalized cut problem balances the total degree, instead of the size, of the resulting sets. Applications of graph cuts include community detection and computer vision. However, cut problems were originally proposed for static graphs, an assumption that does not hold in many modern applications where graphs are highly dynamic. In this paper, we introduce sparsest and normalized cuts in temporal graphs, which generalize their standard definitions by enforcing the smoothness of cuts over time. We propose novel formulations and algorithms for computing temporal cuts using spectral graph theory, divide-and-conquer and low-rank matrix approximation. Furthermore, we extend temporal cuts to dynamic graph signals, where vertices have attributes. Experiments show that our solutions are accurate and scalable, enabling the discovery of dynamic communities and the analysis of dynamic graph processes.
【Keywords】: graph mining; spectral theory
【Paper Link】 【Pages】:529-538
【Authors】: Srishti Gupta ; Abhinav Khattar ; Arpit Gogia ; Ponnurangam Kumaraguru ; Tanmoy Chakraborty
【Abstract】: Cybercriminals have leveraged the popularity of a large user base available on Online Social Networks~(OSNs) to spread spam campaigns by propagating phishing URLs, attaching malicious contents, etc. However, another kind of spam attacks using phone numbers has recently become prevalent on OSNs, where spammers advertise phone numbers to attract users» attention and convince them to make a call to these phone numbers. The dynamics of phone number based spam is different from URL-based spam due to an inherent trust associated with a phone number. While previous work has proposed strategies to mitigate URL-based spam attacks, phone number based spam attacks have received less attention. In this paper, we aim to detect spammers that use phone numbers to promote campaigns on Twitter. To this end, we collected information (tweets, user meta-data, etc.) about 3,370 campaigns spread by 670,251 users. We model the Twitter dataset as a heterogeneous network by leveraging various interconnections between different types of nodes present in the dataset. In particular, we make the following contributions -- (i) We propose a simple yet effective metric, called Hierarchical Meta-Path Score (HMPS) to measure the proximity of an unknown user to the other known pool of spammers. (ii) We design a feedback-based active learning strategy and show that it significantly outperforms three state-of-the-art baselines for the task of spam detection. Our method achieves 6.9% and 67.3% higher F1-score and AUC, respectively compared to the best baseline method. (iii) To overcome the problem of less training instances for supervised learning, we show that our proposed feedback strategy achieves 25.6% and 46% higher F1-score and AUC respectively than other oversampling strategies. Finally, we perform a case study to show how our method is capable of detecting those users as spammers who have not been suspended by Twitter (and other baselines) yet.
【Keywords】: heterogeneous network; meta-path; online social networks; phone number; spam campaign; twitter
【Paper Link】 【Pages】:539-548
【Authors】: Anton Tsitsulin ; Davide Mottin ; Panagiotis Karras ; Emmanuel Müller
【Abstract】: Embedding a web-scale information network into a low-dimensional vector space facilitates tasks such as link prediction, classification, and visualization. Past research has addressed the problem of extracting such embeddings by adopting methods from words to graphs, without defining a clearly comprehensible graph-related objective. Yet, as we show, the objectives used in past works implicitly utilize similarity measures among graph nodes. In this paper, we carry the similarity orientation of previous works to its logical conclusion; we propose VERtex Similarity Embeddings (VERSE), a simple, versatile, and memory-efficient method that derives graph embeddings explicitly calibrated to preserve the distributions of a selected vertex-to-vertex similarity measure. VERSE learns such embeddings by training a single-layer neural network. While its default, scalable version does so via sampling similarity information, we also develop a variant using the full information per vertex. Our experimental study on standard benchmarks and real-world datasets demonstrates that VERSE, instantiated with diverse similarity measures, outperforms state-of-the-art methods in terms of precision and recall in major data mining tasks and supersedes them in time and space efficiency, while the scalable sampling-based variant achieves equally good result as the non-scalable full variant.
【Keywords】: feature learning; graph embedding; graph representations; information networks; node embedding; vertex similarity
【Paper Link】 【Pages】:549-558
【Authors】: Abir De ; Sourangshu Bhattacharya ; Niloy Ganguly
【Abstract】: The networked opinion diffusion in online social networks (OSN) is governed by the two genres of opinions-endogenous opinions that are driven by the influence of social contacts between users, and exogenous opinions which are formed by external effects like news, feeds etc. Such duplex opinion dynamics is led by users belonging to two categories- organic users who generally post endogenous opinions and extrinsic users who are susceptible to externalities, and mostly post the exogenous messages. Precise demarcation of endogenous and exogenous messages offers an important cue to opinion modeling, thereby enhancing its predictive performance. On the other hand, accurate user selection aids to detect extrinsic users, which in turn helps in opinion shaping. In this paper, we design CherryPick, a novel learning machinery that classifies the opinions and users by solving a joint inference task in message and user set, from a temporal stream of sentiment messages. Furthermore, we validate the efficacy of our proposal from both modeling and shaping perspectives. Moreover, for the latter, we formulate the opinion shaping problem in a novel framework of stochastic optimal control, in which the selected extrinsic users optimally post exogenous messages so as to guide the opinions of others in a desired way. On five datasets crawled from Twitter, CherryPick offers a significant accuracy boost in terms of opinion forecasting, against several competitors. Furthermore, it can precisely determine the quality of a set of control users, which together with the proposed online shaping strategy, consistently steers the opinion dynamics more effectively than several state-of-the-art baselines.
【Keywords】: diffusion process; opinion dynamics; social networks
【Paper Link】 【Pages】:559-568
【Authors】: Chen Avin ; Avi Cohen ; Pierre Fraigniaud ; Zvi Lotker ; David Peleg
【Abstract】: This paper demonstrates that the Preferential Attachment rule naturally emerges in the context of evolutionary network formation, as the unique Nash equilibrium of a simple social network game. In this game, each node aims at maximizing its degree in the future, representing its social capital in the "society" formed by the nodes and their connections. This result provides additional formal support to the commonly used Preferential Attachment model, initially designed to capture the "rich get richer" aphorism. In the process of establishing our result, we expose new connections between Preferential Attachment, random walks, and Young»s Lattice.
【Keywords】: network formation games; preferential attachment; random walks; social networks; young's lattice
【Paper Link】 【Pages】:569-578
【Authors】: Haim Mendelson ; Ken Moon
【Abstract】: While apps on the whole increasingly drive and shape present-day Web usage, they individually thrive and fall based on rich patterns of user adoption and long-term engagement. In studying these patterns at the population level, researchers remain curtailed by practical difficulties in accessing consistently detailed data covering large, representative cross-sections of apps. We address this challenge by proposing an empirical framework for analyzing an app»s patterns of adoption and engagement. Importantly, our approach requires obtaining only an app»s daily, weekly, and monthly usership time-series data, which are popular metrics tracked and made available for many apps. Modeling how the nuanced co-dynamics of an app»s daily, weekly, and monthly usership measures (DAU, WAU, and MAU) reliably reveal user adoption and repeat engagement, we extend and demonstrably improve on the predictive performance of prior install- or DAU-based methods. To further study the mechanisms by which successful apps cultivate their userbases, we position apps along the dual axes of viral adoption and retentive engagement as mechanisms to potentially explain success. Applying our approach to data on Facebook apps, we show that these apps over time became less viral and more engaging. Interestingly, despite their diminished virality, their improved user retention subtly and counter-intuitively boosted these apps' rates of new user adoptions through user-activity-based word of mouth, because an engaged user contributes to word of mouth over a longer period of time. In a final case application, we introduce evidence that developers of apps that learn to successfully retain their users carry this valuable experience over into their new apps.
【Keywords】: apps; diffusion; experience curve; hidden markov; time series; user adoption; user engagement and retention
【Paper Link】 【Pages】:579-587
【Authors】: T.-H. Hubert Chan ; Arnaud Guerqin ; Mauro Sozio
【Abstract】: Static and dynamic clustering algorithms are a fundamental tool in any machine learning library. Most of the efforts in developing dynamic machine learning and data mining algorithms have been focusing on the sliding window model (where at any given point in time only the most recent data items are retained) or more simplistic models. However, in many real-world applications one might need to deal with arbitrary deletions and insertions. For example, one might need to remove data items that are not necessarily the oldest ones, because they have been flagged as containing inappropriate content or due to privacy concerns. Clustering trajectory data might also require to deal with more general update operations. We develop a (2+ε)-approximation algorithm for the k-center clustering problem with "small»» amortized cost under the fully dynamic adversarial model. In such a model, points can be added or removed arbitrarily, provided that the adversary does not have access to the random choices of our algorithm. The amortized cost of our algorithm is poly-logarithmic when the ratio between the maximum and minimum distance between any two points in input is bounded by a polynomial, while k and epsilon are constant. Our theoretical results are complemented with an extensive experimental evaluation on dynamic data from Twitter, Flickr, as well as trajectory data, demonstrating the effectiveness of our approach.
【Keywords】:
【Paper Link】 【Pages】:589-598
【Authors】: Maximilien Danisch ; Oana Denisa Balalau ; Mauro Sozio
【Abstract】: Motivated by recent studies in the data mining community which require to efficiently list all k-cliques, we revisit the iconic algorithm of Chiba and Nishizeki and develop the most efficient parallel algorithm for such a problem. Our theoretical analysis provides the best asymptotic upper bound on the running time of our algorithm for the case when the input graph is sparse. Our experimental evaluation on large real-world graphs shows that our parallel algorithm is faster than state-of-the-art algorithms, while boasting an excellent degree of parallelism. In particular, we are able to list all k-cliques (for any k) in graphs containing up to tens of millions of edges as well as all $10$-cliques in graphs containing billions of edges, within a few minutes and a few hours respectively. Finally, we show how our algorithm can be employed as an effective subroutine for finding the k-clique core decomposition and an approximate k-clique densest subgraphs in very large real-world graphs.
【Keywords】: k-clique listing and counting; real-world graph algorithms
【Paper Link】 【Pages】:599-608
【Authors】: Weiren Yu ; Fan Wang
【Abstract】: In real Web applications, CoSimRank has been proposed as a powerful measure of node-pair similarity based on graph topologies. However, existing work on CoSimRank is restricted to static graphs. When the graph is updated with new edges arriving over time, it is cost-inhibitive to recompute all CoSimRank scores from scratch, which is impractical. In this study, we propose a fast dynamic scheme, \DCoSim for accurate CoSimRank search over evolving graphs. Based on \DCoSim, we also propose a fast scheme, \FCoSim, that greatly accelerates CoSimRank search over static graphs. Our theoretical analysis shows that \DCoSim and \FCoSim guarantee the exactness of CoSimRank scores. On the static graph G, to efficiently retrieve CoSimRank scores $\mathbfS $, \FCoSim is based on three ideas: (i) It first finds a "spanning polytree»» T over G. (ii) On T, a fast algorithm is designed to compute the CoSimRank scores $\mathbfS (T)$ over the "spanning polytree»» T. (iii) On G, \DCoSim is employed to compute the changes of $\mathbfS (T)$ in response to the delta graph $(G øminus T)$. Experimental evaluations verify the superiority of \DCoSim over evolving graphs, and the fast speedup of \FCoSim on large-scale static graphs against its competitors, without any loss of accuracy.
【Keywords】: cosimrank similarity; graph algorithm; hyperlink analysis
【Paper Link】 【Pages】:609-618
【Authors】: Ricky Laishram ; Ahmet Erdem Sariyüce ; Tina Eliassi-Rad ; Ali Pinar ; Sucheta Soundarajan
【Abstract】: The concept of k-cores is important for understanding the global structure of networks, as well as for identifying central or important nodes within a network. It is often valuable to understand the resilience of the k-cores of a network to attacks and dropped edges (i.e., damaged communications links). We provide a formal definition of a network»s core resilience, and examine the problem of characterizing core resilience in terms of the network»s structural features: in particular, which structural properties cause a network to have high or low core resilience? To measure this, we introduce two novel node properties,Core Strength andCore Influence, which measure the resilience of individual nodes» core numbers and their influence on other nodes» core numbers. Using these properties, we propose theMaximize Resilience of k-Core algorithm to add edges to improve the core resilience of a network. We consider two attack scenarios - randomly deleted edges and randomly deleted nodes. Through experiments on a variety of technological and infrastructure network datasets, we verify the efficacy of our node-based resilience measures at predicting the resilience of a network, and evaluate MRKC at the task of improving a network»s core resilience. We find that on average, for edge deletion attacks, MRKC improves the resilience of a network by 11.1% over the original network, as compared to the best baseline method, which improves the resilience of a network by only 2%. For node deletion attacks, MRKC improves the core resilience of the original network by 19.7% on average, while the best baseline improves it by only 3%.
【Keywords】: graphs; k-core; resilience
【Paper Link】 【Pages】:619-628
【Authors】: Huda Nassar ; Nate Veldt ; Shahin Mohammadi ; Ananth Grama ; David F. Gleich
【Abstract】: Network alignment or graph matching is the classic problem of finding matching vertices between two graphs with applications in network de-anonymization and bioinformatics. There exist a wide variety of algorithms for it, but a challenging scenario for all of the algorithms is aligning two networks without any information about which nodes might be good matches. In this case, the vast majority of principled algorithms demand quadratic memory in the size of the graphs. We show that one such method---the recently proposed and theoretically grounded EigenAlign algorithm---admits a novel implementation which requires memory that is linear in the size of the graphs. The key step to this insight is identifying low-rank structure in the node-similarity matrix used by EigenAlign for determining matches. With an exact, closed-form low-rank structure, we then solve a maximum weight bipartite matching problem on that low-rank matrix to produce the matching between the graphs. For this task, we show a new, a-posteriori, approximation bound for a simple algorithm to approximate a maximum weight bipartite matching problem on a low-rank matrix. The combination of our two new methods then enables us to tackle much larger network alignment problems than previously possible and to do so quickly. Problems that take hours with existing methods take only seconds with our new algorithm. We thoroughly validate our low-rank algorithm against the original EigenAlign approach. We also compare a variety of existing algorithms on problems in bioinformatics and social networks. Our approach can also be combined with existing algorithms to improve their performance and speed.
【Keywords】: graph matching; low rank matrix; low-rank bipartite matching; network alignment
【Paper Link】 【Pages】:629-638
【Authors】: Manish Raghavan ; Ashton Anderson ; Jon M. Kleinberg
【Abstract】: The surge in political information, discourse, and interaction has been one of the most important developments in social media over the past several years. There is rich structure in the interaction among different viewpoints on the ideological spectrum. However, we still have only a limited analytical vocabulary for expressing the ways in which these viewpoints interact. In this paper, we develop network-based methods that operate on the ways in which users share content; we construct invocation graphs on Web domains showing the extent to which pages from one domain are invoked by users to reply to posts containing pages from other domains. When we locate the domains on a political spectrum induced from the data, we obtain an embedded graph showing how these interaction links span different distances on the spectrum. The structure of this embedded network, and its evolution over time, helps us derive macro-level insights about how political interaction unfolded through 2016, leading up to the US Presidential election. In particular, we find that the domains invoked in replies spanned increasing distances on the spectrum over the months approaching the election, and that there was clear asymmetry between the left-to-right and right-to-left patterns of linkage.
【Keywords】: political interaction
【Paper Link】 【Pages】:639-648
【Authors】: Zhiyong Cheng ; Ying Ding ; Lei Zhu ; Mohan S. Kankanhalli
【Abstract】: Although latent factor models (e.g., matrix factorization) achieve good accuracy in rating prediction, they suffer from several problems including cold-start, non-transparency, and suboptimal recommendation for local users or items. In this paper, we employ textual review information with ratings to tackle these limitations. Firstly, we apply a proposed aspect-aware topic model (ATM) on the review text to model user preferences and item features from different aspects, and estimate the aspect importance of a user towards an item. The aspect importance is then integrated into a novel aspect-aware latent factor model (ALFM), which learns user's and item's latent factors based on ratings. In particular, ALFM introduces a weighted matrix to associate those latent factors with the same set of aspects discovered by ATM, such that the latent factors could be used to estimate aspect ratings. Finally, the overall rating is computed via a linear combination of the aspect ratings, which are weighted by the corresponding aspect importance. To this end, our model could alleviate the data sparsity problem and gain good interpretability for recommendation. Besides, an aspect rating is weighted by an aspect importance, which is dependent on the targeted user's preferences and targeted item's features. Therefore, it is expected that the proposed method can model a user's preferences on an item more accurately for each user-item pair locally. Comprehensive experimental studies have been conducted on 19 datasets from Amazon and Yelp 2017 Challenge dataset. Results show that our method achieves significant improvement compared with strong baseline methods, especially for users with only few ratings. Moreover, our model could interpret the recommendation results in depth.
【Keywords】: aspect-aware; matrix factorization; recommendation; review-aware; topic model
【Paper Link】 【Pages】:649-658
【Authors】: Wenhui Yu ; Huidi Zhang ; Xiangnan He ; Xu Chen ; Li Xiong ; Zheng Qin
【Abstract】: Recently, product images have gained increasing attention in clothing recommendation since the visual appearance of clothing products has a significant impact on consumers» decision. Most existing methods rely on conventional features to represent an image, such as the visual features extracted by convolutional neural networks (CNN features) and the scale-invariant feature transform algorithm (SIFT features), color histograms, and so on. Nevertheless, one important type of features, the aesthetic features, is seldom considered. It plays a vital role in clothing recommendation since a users» decision depends largely on whether the clothing is in line with her aesthetics, however the conventional image features cannot portray this directly. To bridge this gap, we propose to introduce the aesthetic information, which is highly relevant with user preference, into clothing recommender systems. To achieve this, we first present the aesthetic features extracted by a pre-trained neural network, which is a brain-inspired deep structure trained for the aesthetic assessment task. Considering that the aesthetic preference varies significantly from user to user and by time, we then propose a new tensor factorization model to incorporate the aesthetic features in a personalized manner. We conduct extensive experiments on real-world datasets, which demonstrate that our approach can capture the aesthetic preference of users and significantly outperform several state-of-the-art recommendation methods.
【Keywords】: aesthetic features; clothing recommendation; dynamic collaborative filtering; side information; tensor factorization
【Paper Link】 【Pages】:659-668
【Authors】: Tomasz Kusmierczyk ; Manuel Gomez-Rodriguez
【Abstract】: A wide variety of online platforms use digital badges to encourage users to take certain types of desirable actions. However, despite their growing popularity, their causal effect on users» behavior is not well understood. This is partly due to the lack of counterfactual data and the myriad of complex factors that influence users» behavior over time. As a consequence, their design and deployment lacks general principles. In this paper, we focus on first-time badges, which are awarded after a user takes a particular type of action for the first time, and study their causal effect by harnessing the delayed introduction of several badges in a popular Q&A website. In doing so, we introduce a novel causal inference framework for first-time badges whose main technical innovations are a robust survival-based hypothesis testing procedure, which controls for the heterogeneity in the benefit users obtain from taking an action, and a bootstrap difference-in-differences method, which controls for the random fluctuations in users» behavior over time. Our results suggest that first-time badges steer users» behavior if the initial benefit a user obtains from taking the corresponding action is sufficiently low, otherwise, we do not find significant effects. Moreover, for badges that successfully steered user behavior, we perform a counterfactual analysis and show that they significantly improved the functioning of the site at a community level.
【Keywords】: badges; bootstrap; causality; confounders; counterfactual world; difference-in-differences; natural experiment; point processes; social platform; statistical testing
【Paper Link】 【Pages】:669-678
【Authors】: Surabhi Punjabi ; Priyanka Bhatt
【Abstract】: Factorization machines (FMs) are a state-of-the-art model class for user response prediction in the computational advertising domain. Rapid growth of internet and mobile device usage has given rise to multiple customer touchpoints. This coupled with factors like high cookie churn rate results in a fragmented view of user activity at the advertiser»s end. Current literature assumes procured user signals as the absolute truth, which is contested by the absence of deterministic identity linkage across a user's multiple avatars. In this work, we characterize the data uncertainty using Robust Optimization (RO) paradigm to design approaches that are immune against perturbations. We propose two novel algorithms: robust factorization machine (RFM) and its field-aware variant (RFFM), under interval uncertainty. These formulations are generic and can find applicability in any classification setting under noise. We provide a distributed and scalable Spark implementation using parallel stochastic gradient descent. In the experiments conducted on three real-world datasets, the robust counterparts outperform the baselines significantly under perturbed settings. Our experimental findings reveal interesting connections between choice of uncertainty set and the noise-proofness of resulting models.
【Keywords】: computational advertising; factorization machines; field aware factorization machines; interval uncertainty; response prediction; robust optimization
【Paper Link】 【Pages】:679-687
【Authors】: Vivek Sembium ; Rajeev Rastogi ; Lavanya Tekumalla ; Atul Saroop
【Abstract】: Lack of calibrated product sizing in popular categories such as apparel and shoes leads to customers purchasing incorrect sizes, which in turn results in high return rates due to fit issues. We address the problem of product size recommendations based on customer purchase and return data. We propose a novel approach based on Bayesian logit and probit regression models with ordinal categories Small, Fit, Largeto model size fits as a function of the difference between latent sizes of customers and products. We propose posterior computation based on mean-field variational inference, leveraging the Polya-Gamma augmentation for the logit prior, that results in simple updates, enabling our technique to efficiently handle large datasets. Our Bayesian approach effectively deals with issues arising from noise and sparsity in the data providing robust recommendations. Offline experiments with real-life shoe datasets show that our model outperforms the state-of-the-art in 5 of 6 datasets. and leads to an improvement of 17-26% in AUC over baselines when predicting size fit outcomes.
【Keywords】: bayesian; personalization; recommendation; polya-gamma; probit; variational inference; gibbs sampling
【Paper Link】 【Pages】:689-698
【Authors】: Dawen Liang ; Rahul G. Krishnan ; Matthew D. Hoffman ; Tony Jebara
【Abstract】: We extend variational autoencoders (VAEs) to collaborative filtering for implicit feedback. This non-linear probabilistic model enables us to go beyond the limited modeling capacity of linear factor models which still largely dominate collaborative filtering research.We introduce a generative model with multinomial likelihood and use Bayesian inference for parameter estimation. Despite widespread use in language modeling and economics, the multinomial likelihood receives less attention in the recommender systems literature. We introduce a different regularization parameter for the learning objective, which proves to be crucial for achieving competitive performance. Remarkably, there is an efficient way to tune the parameter using annealing. The resulting model and learning algorithm has information-theoretic connections to maximum entropy discrimination and the information bottleneck principle. Empirically, we show that the proposed approach significantly outperforms several state-of-the-art baselines, including two recently-proposed neural network approaches, on several real-world datasets. We also provide extended experiments comparing the multinomial likelihood with other commonly used likelihood functions in the latent factor collaborative filtering literature and show favorable results. Finally, we identify the pros and cons of employing a principled Bayesian inference approach and characterize settings where it provides the most significant improvements.
【Keywords】: bayesian models; collaborative filtering; implicit feedback; recommender systems; variational autoencoder
【Paper Link】 【Pages】:699-707
【Authors】: Alexander Peysakhovich ; Dean Eckles
【Abstract】: Scientific and business practices are increasingly resulting in large collections of randomized experiments. Analyzed together multiple experiments can tell us things that individual experiments cannot. We study how to learn causal relationships between variables from the kinds of collections faced by modern data scientists: the number of experiments is large, many experiments have very small effects, and the analyst lacks metadata (e.g., descriptions of the interventions). We use experimental groups as instrumental variables (IV) and show that a standard method (two-stage least squares) is biased even when the number of experiments is infinite. We show how a sparsity-inducing l0 regularization can (in a reversal of the standard bias--variance tradeoff) reduce bias (and thus error) of interventional predictions. We are interested in estimating causal effects, rather than just predicting outcomes, so we also propose a modified cross-validation procedure (IVCV) to feasibly select the regularization parameter. We show, using a trick from Monte Carlo sampling, that IVCV can be done using summary statistics instead of raw data. This makes our full procedure simple to use in many real-world applications.
【Keywords】: causality; experimentation; instrumental variables; machine learning
【Paper Link】 【Pages】:709-718
【Authors】: Chuxu Zhang ; Chao Huang ; Lu Yu ; Xiangliang Zhang ; Nitesh V. Chawla
【Abstract】: In this paper, we study the problem of author identification in big scholarly data, which is to effectively rank potential authors for each anonymous paper by using historical data. Most of the existing de-anonymization approaches predict relevance score of paper-author pair via feature engineering, which is not only time and storage consuming, but also introduces irrelevant and redundant features or miss important attributes. Representation learning can automate the feature generation process by learning node embeddings in academic network to infer the correlation of paper-author pair. However, the learned embeddings are often for general purpose (independent of the specific task), or based on network structure only (without considering the node content). To address these issues and make a further progress in solving the author identification problem, we propose Camel, a content-aware and meta-path augmented metric learning model. Specifically, first, the directly correlated paper-author pairs are modeled based on distance metric learning by introducing a push loss function. Next, the paper content embedding encoded by the gated recurrent neural network is integrated into the distance loss. Moreover, the historical bibliographic data of papers is utilized to construct an academic heterogeneous network, wherein a meta-path guided walk integrative learning module based on the task-dependent and content-aware Skipgram model is designed to formulate the correlations between each paper and its indirect author neighbors, and further augments the model. Extensive experiments demonstrate that Camel outperforms the state-of-the-art baselines. It achieves an average improvement of 6.3% over the best baseline method.
【Keywords】: author identification; deep learning; heterogeneous networks; metric learning; representation learning
【Paper Link】 【Pages】:719-728
【Authors】: Moshe Lichman ; Padhraic Smyth
【Abstract】: In this paper we address the problem of building user models that can predict the rate at which individuals consume items from a finite set, including items they have consumed in the past and items that are new. This combination of repeat and new item consumption is common in applications such as listening to music, visiting web sites, and purchasing products. We use zero-inflated Poisson (ZIP) regression models as the basis for our modeling approach, leading to a general framework for modeling user-item consumption rates over time. We show that these models are more flexible in capturing user behavior than alternatives such as well-known latent factor models based on matrix factorization. We compare the performance of ZIP regression and latent factor models on three different data sets involving music, restaurant reviews, and social media. The ZIP regression models are systematically more accurate across all three data sets and across different prediction metrics.
【Keywords】: consumption rate modeling; explore-exploit; repeat consumption; zero-inflated poisson
【Paper Link】 【Pages】:729-739
【Authors】: Yi Tay ; Luu Anh Tuan ; Siu Cheung Hui
【Abstract】: This paper proposes a new neural architecture for collaborative ranking with implicit feedback. Our model, LRML (Latent Relational Metric Learning) is a novel metric learning approach for recommendation. More specifically, instead of simple push-pull mechanisms between user and item pairs, we propose to learn latent relations that describe each user item interaction. This helps to alleviate the potential geometric inflexibility of existing metric learning approaches. This enables not only better performance but also a greater extent of modeling capability, allowing our model to scale to a larger number of interactions. In order to do so, we employ a augmented memory module and learn to attend over these memory blocks to construct latent relations. The memory-based attention module is controlled by the user-item interaction, making the learned relation vector specific to each user-item pair. Hence, this can be interpreted as learning an exclusive and optimal relational translation for each user-item interaction. The proposed architecture demonstrates the state-of-the-art performance across multiple recommendation benchmarks. LRML outperforms other metric learning models by 6%-7.5% in terms of [email protected] and [email protected] on large datasets such as Netflix and MovieLens20M. Moreover, qualitative studies also demonstrate evidence that our proposed model is able to infer and encode explicit sentiment, temporal and attribute information despite being only trained on implicit feedback. As such, this ascertains the ability of LRML to uncover hidden relational structure within implicit datasets.
【Keywords】: attention mechanism; collaborative ranking; deep learning; implicit feedback; information retrieval; neural networks; recommender system; collaborative filtering
【Paper Link】 【Pages】:741-751
【Authors】: Dongsheng Li ; Chao Chen ; Qin Lv ; Hansu Gu ; Tun Lu ; Li Shang ; Ning Gu ; Stephen M. Chu
【Abstract】: Gradient-based learning methods such as stochastic gradient descent are widely used in matrix approximation-based collaborative filtering algorithms to train recommendation models based on observed user-item ratings. One major difficulty in existing gradient-based learning methods is determining proper learning rates, since model convergence would be inaccurate or very slow if the learning rate is too large or too small, respectively. This paper proposes AdaError, an adaptive learning rate method for matrix approximation-based collaborative filtering. AdaError eliminates the need of manually tuning the learning rates by adaptively adjusting the learning rates based on the noisiness level of user-item ratings, using smaller learning rates for noisy ratings so as to reduce their impact on the learned models. Our theoretical and empirical analysis shows that AdaError can improve the generalization performance of the learned models. Experimental studies on the MovieLens and Netflix datasets also demonstrate that AdaError outperforms state-of-the-art adaptive learning rate methods in matrix approximation-based collaborative filtering. Furthermore, by applying AdaError to the standard matrix approximation method, we can achieve statistically significant improvements over state-of-the-art collaborative filtering methods in both rating prediction accuracy and top-N recommendation accuracy.
【Keywords】: collaborative filtering; matrix approximation
【Paper Link】 【Pages】:753-762
【Authors】: Brit Youngmann ; Elad Yom-Tov
【Abstract】: People seeking information through search engines are assumed to behave similarly, regardless of the topic which they are searching. Here we use mouse tracking, which is correlated with gaze, to show that the information seeking patterns of people differ dramatically depending on their level of anxiety at the time of the search. We investigate the behavior of people during searches for medical symptoms, ranging from benign indications, where users are not usually anxious, to ones which could harbinger life-threatening conditions, where extreme anxiety is expected. We show that for the latter, 90% of people never saw more than the top 67% of the screen, compared to over 95% scanned by people seeking information on benign symptoms, even though relevant documents are similarly distributed in the results pages to these queries. Based on this observation, we develop a model which can predict the level of anxiety experienced by a user, using attributes derived from mouse tracking data and other user interactions. The model achieves Kendall's Tau of 0.48 with the medical severity of the symptoms searched. We show the importance of using information about the users? level of anxiety as predicted by the model, when measuring search engine performance. Our results prove that ignoring this information can lead to significant over-estimation of performance. Additionally, we show the utility of the model in three special instances: where multiple symptoms are searched concurrently; where the searcher has an underlying medical condition; and when users seek information on ways to commit suicide. In the latter, our results demonstrate the importance of help-line notices, and emphasize the need to measure the effective number of results seen by the user. Our results indicate that measures of relevance which use anxiety information can lead to more accurate understanding of the quality of search results, especially when delivering potentially life-saving information to users.
【Keywords】: health; medicine.; mouse tracking; relevance
【Paper Link】 【Pages】:763-772
【Authors】: Zhenpeng Chen ; Xuan Lu ; Wei Ai ; Huoran Li ; Qiaozhu Mei ; Xuanzhe Liu
【Abstract】: Based on a large data set of emoji using behavior collected from smartphone users over the world, this paper investigates gender-specific usage of emojis. We present various interesting findings that evidence a considerable difference in emoji usage by female and male users. Such a difference is significant not just in a statistical sense; it is sufficient for a machine learning algorithm to accurately infer the gender of a user purely based on the emojis used in their messages. In real world scenarios where gender inference is a necessity, models based on emojis have unique advantages over existing models that are based on textual or contextual information. Emojis not only provide language-independent indicators, but also alleviate the risk of leaking private user information through the analysis of text and metadata.
【Keywords】: emojis; gender; language-independent; user profifiling
【Paper Link】 【Pages】:773-782
【Authors】: Yichao Lu ; Ruihai Dong ; Barry Smyth
【Abstract】: Collaborative filtering (CF) is a common recommendation approach that relies on user-item ratings. However, the natural sparsity of user-item rating data can be problematic in many domains and settings, limiting the ability to generate accurate predictions and effective recommendations. Moreover, in some CF approaches latent features are often used to represent users and items, which can lead to a lack of recommendation transparency and explainability. User-generated, customer reviews are now commonplace on many websites, providing users with an opportunity to convey their experiences and opinions of products and services. As such, these reviews have the potential to serve as a useful source of recommendation data, through capturing valuable sentiment information about particular product features. In this paper, we present a novel deep learning recommendation model, which co-learns user and item information from ratings and customer reviews, by optimizing matrix factorization and an attention-based GRU network. Using real-world datasets we show a significant improvement in recommendation performance, compared to a variety of alternatives. Furthermore, the approach is useful when it comes to assigning intuitive meanings to latent features to improve the transparency and explainability of recommender systems.
【Keywords】: collaborative filtering; deep learning; natural language processing; recommender systems; user modeling
【Paper Link】 【Pages】:783-792
【Authors】: Youngnam Lee ; Sang-Wook Kim ; Sunju Park ; Xing Xie
【Abstract】: Data sparsity is one of the biggest problems faced by collaborative filtering used in recommender systems. Data imputation alleviates the data sparsity problem by inferring missing ratings and imputing them to the original rating matrix. In this paper, we identify the limitations of existing data imputation approaches and suggest three new claims that all data imputation approaches should follow to achieve high recommendation accuracy. Furthermore, we propose a deep-learning based approach to compute imputed values that satisfies all three claims. Based on our hypothesis that most pre-use preferences (e.g., impressions) on items lead to their post-use preferences (e.g., ratings), our approach tries to understand via deep learning how pre-use preferences lead to post-use preferences differently depending on the characteristics of users and items. Through extensive experiments on real-world datasets, we verify our three claims and hypothesis, and also demonstrate that our approach significantly outperforms existing state-of-the-art approaches.
【Keywords】: collaborative filtering; data imputation; data sparsity; recommender systems
【Paper Link】 【Pages】:793-802
【Authors】: Gael Lederrey ; Robert West
【Abstract】: As online shopping becomes ever more prevalent, customers rely increasingly on product rating websites for making purchase decisions. The reliability of online ratings, however, is potentially compromised by the so-called herding effect: when rating a product, customers may be biased to follow other customers' previous ratings of the same product. This is problematic because it skews long-term customer perception through haphazard early ratings. The study of herding poses methodological challenges. In particular, observational studies are impeded by the lack of counterfactuals: simply correlating early with subsequent ratings is insufficient because we cannot know what the subsequent ratings would have looked like had the first ratings been different. The methodology introduced here exploits a setting that comes close to an experiment, although it is purely observational---a natural experiment. Our key methodological device consists in studying the same product on two separate rating sites, focusing on products that received a high first rating on one site, and a low first rating on the other. This largely controls for confounds such as a product»s inherent quality, advertising, and producer identity, and lets us isolate the effect of the first rating on subsequent ratings. In a case study, we focus on beers as products and jointly study two beer rating sites, but our method applies to any pair of sites across which products can be matched. We find clear evidence of herding in beer ratings. For instance, if a beer receives a very high first rating, its second rating is on average half a standard deviation higher, compared to a situation where the identical beer receives a very low first rating. Moreover, herding effects tend to last a long time and are noticeable even after 20 or more ratings. Our results have important implications for the design of better rating systems.
【Keywords】: herding; natural experiments; observational studies; product ratings; product reviews; social influence
【Paper Link】 【Pages】:803-812
【Authors】: Takeshi Kurashima ; Tim Althoff ; Jure Leskovec
【Abstract】: Mobile health applications, including those that track activities such as exercise, sleep, and diet, are becoming widely used. Accurately predicting human actions in the real world is essential for targeted recommendations that could improve our health and for personalization of these applications. However, making such predictions is extremely difficult due to the complexities of human behavior, which consists of a large number of potential actions that vary over time, depend on each other, and are periodic. Previous work has not jointly modeled these dynamics and has largely focused on item consumption patterns instead of broader types of behaviors such as eating, commuting or exercising. In this work, we develop a novel statistical model, called TIPAS, for Time-varying, Interdependent, and Periodic Action Sequences. Our approach is based on personalized, multivariate temporal point processes that model time-varying action propensities through a mixture of Gaussian intensities. Our model captures short-term and long-term periodic interdependencies between actions through Hawkes process-based self-excitations. We evaluate our approach on two activity logging datasets comprising 12 million real-world actions (e.g., eating, sleep, and exercise) taken by 20 thousand users over 17 months. We demonstrate that our approach allows us to make successful predictions of future user actions and their timing. Specifically, TIPAS improves predictions of actions, and their timing, over existing methods across multiple datasets by up to 156%, and up to 37%, respectively. Performance improvements are particularly large for relatively rare and periodic actions such as walking and biking, improving over baselines by up to 256%. This demonstrates that explicit modeling of dependencies and periodicities in real-world behavior enables successful predictions of future actions, with implications for modeling human behavior, app personalization, and targeting of health interventions.
【Keywords】: activity logging; activity tracking; human action sequence; mobile health; periodic behavior; point process; quantified self; real-world actions; real-world behavior; user modeling
【Paper Link】 【Pages】:813-821
【Authors】: Ben Miroglio ; David Zeber ; Jofish Kaye ; Rebecca Weiss
【Abstract】: Web users are increasingly turning to ad blockers to avoid ads, which are often perceived as annoying or an invasion of privacy. While there has been significant research into the factors driving ad blocker adoption and the detrimental effect to ad publishers on the Web, the resulting effects of ad blocker usage on Web users' browsing experience is not well understood. To approach this problem, we conduct a retrospective natural field experiment using Firefox browser usage data, with the goal of estimating the effect of adblocking on user engagement with the Web. We focus on new users who installed an ad blocker after a baseline observation period, to avoid comparing different populations. Their subsequent browser activity is compared against that of a control group, whose members do not use ad blockers, over a corresponding observation period, controlling for prior baseline usage. In order to estimate causal effects, we employ propensity score matching on a number of other features recorded during the baseline period. In the group that installed an ad blocker, we find significant increases in both active time spent in the browser (+28% over control) and the number of pages viewed (+15% over control), while seeing no change in the number of searches. Additionally, by reapplying the same methodology to other popular Firefox browser extensions, we show that these effects are specific to ad blockers. We conclude that ad blocking has a positive impact on user engagement with the Web, suggesting that any costs of using ad blockers to users' browsing experience are largely drowned out by the utility that they offer.
【Keywords】: ad blocking; matching; natural field experiment; propensity scoring; user studies
【Paper Link】 【Pages】:823-831
【Authors】: Nabeel Gillani ; Ann Yuan ; Martin Saveski ; Soroush Vosoughi ; Deb Roy
【Abstract】: Homophily - our tendency to surround ourselves with others who share our perspectives and opinions about the world - is both a part of human nature and an organizing principle underpinning many of our digital social networks. However, when it comes to politics or culture, homophily can amplify tribal mindsets and produce "echo chambers" that degrade the quality, safety, and diversity of discourse online. While several studies have empirically proven this point, few have explored how making users aware of the extent and nature of their political echo chambers influences their subsequent beliefs and actions. In this paper, we introduce Social Mirror, a social network visualization tool that enables a sample of Twitter users to explore the politically-active parts of their social network. We use Social Mirror to recruit Twitter users with a prior history of political discourse to a randomized experiment where we evaluate the effects of different treatments on participants' i) beliefs about their network connections, ii) the political diversity of who they choose to follow, and iii) the political alignment of the URLs they choose to share. While we see no effects on average political alignment of shared URLs, we find that recommending accounts of the opposite political ideology to follow reduces participants» beliefs in the political homogeneity of their network connections but still enhances their connection diversity one week after treatment. Conversely, participants who enhance their belief in the political homogeneity of their Twitter connections have less diverse network connections 2-3 weeks after treatment. We explore the implications of these disconnects between beliefs and actions on future efforts to promote healthier exchanges in our digital public spheres.
【Keywords】: political polarization; randomized experiment; social networks
【Paper Link】 【Pages】:833-842
【Authors】: Daizong Ding ; Mi Zhang ; Xudong Pan ; Duocai Wu ; Pearl Pu
【Abstract】: Location-based embedding is a fundamental problem to solve in location-based social networks (LBSN). In this paper, we propose a geographical convolutional neural tensor network (GeoCNTN) as a generic embedding model. GeoCNTN first takes the raw location data and extracts from it a more well-conditioned representation by our proposed Geo-CMeans algorithm. We then use a convolutional neural network (CNN) and an embedding structure to extract individual latent structural patterns from the preprocessed data. Finally, we apply a neural tensor network (NTN) to craft the implicitly related features we have obtained into a unified geographical feature. The advantages of our GeoCNTN mainly come from its novel neural network structure, which intrinsically offers a mechanism to extract latent structural features from the geographical data, as well as its wide applicability in various LBSN-related tasks. From two case studies, i.e. link prediction and entity classification in user-group LBSN, we evaluate the embedding efficacy of our model. Results show that GeoCNTN significantly performs better on at least two tasks, with improvement by 9% w.r.t. NDCG and 11% w.r.t. F1 score respectively, using the Meetup-USA dataset.
【Keywords】: deep learning; feature embedding; location-based social networks
【Paper Link】 【Pages】:843-852
【Authors】: Anna Samoilenko ; Florian Lemmerich ; Maria Zens ; Mohsen Jadidi ; Mathieu Génois ; Markus Strohmaier
【Abstract】: In this paper we present a large-scale quantitative comparison between expert- and crowdsourced writing of history by analysing articles from the English Wikipedia and Britannica. In order to quantify attention to particular periods, we extract mentioned year numbers and utilise them to study historical timelines of nations stretched over the last thousand years. By combining this temporal analysis with lexical analysis of both encyclopedic corpora we can identify distinctive historiographic points of view in each encyclopedia. We find that Britannica focuses on social and cultural phenomena, e.g. religion, as well as the geographical characteristics of states, while Wikipedia puts emphasis on political aspects, concentrating on wars and violent conflicts, and events of high popularity. Finally, both encyclopedias exhibit characteristics of English Academic prose, with Britannica being slightly less readable compared to Wikipedia, according to several readability scores.
【Keywords】: Britannica; collective memory; computational history; focal points; natural language processing; null model; readability; wikipedia
【Paper Link】 【Pages】:853-862
【Authors】: Emmanouil Krasanakis ; Eleftherios Spyromitros Xioufis ; Symeon Papadopoulos ; Yiannis Kompatsiaris
【Abstract】: Machine learning bias and fairness have recently emerged as key issues due to the pervasive deployment of data-driven decision making in a variety of sectors and services. It has often been argued that unfair classifications can be attributed to bias in training data, but previous attempts to 'repair' training data have led to limited success. To circumvent shortcomings prevalent in data repairing approaches, such as those that weight training samples of the sensitive group (e.g. gender, race, financial status) based on their misclassification error, we present a process that iteratively adapts training sample weights with a theoretically grounded model. This model addresses different kinds of bias to better achieve fairness objectives, such as trade-offs between accuracy and disparate impact elimination or disparate mistreatment elimination. We show that, compared to previous fairness-aware approaches, our methodology achieves better or similar trades-offs between accuracy and unfairness mitigation on real-world and synthetic datasets.
【Keywords】: algorithm bias; classification fairness; reweighting
【Paper Link】 【Pages】:863-872
【Authors】: Shan Jiang ; Le Chen ; Alan Mislove ; Christo Wilson
【Abstract】: Ridesharing services such as Uber and Lyft have become an important part of the Vehicle For Hire (VFH) market, which used to be dominated by taxis. Unfortunately, ridesharing services are not required to share data like taxi services, which has made it challenging to compare the competitive dynamics of these services, or assess their impact on cities. In this paper, we comprehensively compare Uber, Lyft, and taxis with respect to key market features (supply, demand, price, and wait time) in San Francisco and New York City. Based on point pattern statistics, we develop novel statistical techniques to validate our measurement methods. Using spatial lag models, we investigate the accessibility of VFH services, and find that transportation infrastructure and socio-economic features have substantial effects on VFH market features.
【Keywords】: Lyft; Uber; ridesharing; sharing economy; taxi; vehicle for hire
【Paper Link】 【Pages】:873-882
【Authors】: Allen Yilun Lin ; Joshua Ford ; Eytan Adar ; Brent J. Hecht
【Abstract】: Data visualizations in news articles (e.g., maps, line graphs, bar charts) greatly enrich the content of news articles and result in well-established improvements to reader comprehension. However, existing systems that generate news data visualiza-tions either require substantial manual effort or are limited to very specific types of data visualizations, thereby greatly re-stricting the number of news articles that can be enhanced. To address this issue, we define a new problem: given a news ar-ticle, retrieve relevant visualizations that already exist on the web. We show that this problem is tractable through a new system, VizByWiki, that mines contextually relevant data visualizations from Wikimedia Commons, the central file reposi-tory for Wikipedia. Using a novel ground truth dataset, we show that VizByWiki can successfully augment as many as 48% of popular online news articles with news visualizations. We also demonstrate that VizByWiki can automatically rank visualizations according to their usefulness with reasonable accuracy ([email protected] of 0.82). To facilitate further advances on our "news visualization retrieval problem", we release our ground truth dataset and make our system and its source code publicly available.
【Keywords】: data visualizations; news articles; peer production; user-generated content; wikimedia commons; wikipedia
【Paper Link】 【Pages】:883-892
【Authors】: Eric Malmi ; Aristides Gionis ; Arno Solin
【Abstract】: Genealogical networks, also known as family trees or population pedigrees, are commonly studied by genealogists wanting to know about their ancestry, but they also provide a valuable resource for disciplines such as digital demography, genetics, and computational social science. These networks are typically constructed by hand through a very time-consuming process, which requires comparing large numbers of historical records manually. We develop computational methods for automatically inferring large-scale genealogical networks. A comparison with human-constructed networks attests to the accuracy of the proposed methods. To demonstrate the applicability of the inferred large-scale genealogical networks, we present a longitudinal analysis on the mating patterns observed in a network. This analysis shows a consistent tendency of people choosing a spouse with a similar socioeconomic status, a phenomenon known as assortative mating. Interestingly, we do not observe this tendency to consistently decrease (nor increase) over our study period of 150 years.
【Keywords】: assortative mating; family tree; genealogy; homogamy; pedigree; population reconstruction; probabilistic record linkage; social stratification
【Paper Link】 【Pages】:893-902
【Authors】: Haewoon Kwak ; Jisun An ; Joni Salminen ; Soon-Gyo Jung ; Bernard J. Jansen
【Abstract】: We investigate the alignment of international attention of news media organizations within 193 countries with the expressed international interests of the public within those same countries from March 7, 2016 to April 14, 2017. We collect fourteen months of longitudinal data of online news from Unfiltered News and web search volume data from Google Trends and build a multiplex network of media attention and public attention in order to study its structural and dynamic properties. Structurally, the media attention and the public attention are both similar and different depending on the resolution of the analysis. For example, we find that 63.2% of the country-specific media and the public pay attention to different countries, but local attention flow patterns, which are measured by network motifs, are very similar. We also show that there are strong regional similarities with both media and public attention that is only disrupted by significantly major worldwide incidents (e.g., Brexit). Using Granger causality, we show that there are a substantial number of countries where media attention and public attention are dissimilar by topical interest. Our findings show that the media and public attention toward specific countries are often at odds, indicating that the public within these countries may be ignoring their country-specific news outlets and seeking other online sources to address their media needs and desires.
【Keywords】: multiplex network; news coverage; web search
【Paper Link】 【Pages】:903-912
【Authors】: Nina Grgic-Hlaca ; Elissa M. Redmiles ; Krishna P. Gummadi ; Adrian Weller
【Abstract】: As algorithms are increasingly used to make important decisions that affect human lives, ranging from social benefit assignment to predicting risk of criminal recidivism, concerns have been raised about the fairness of algorithmic decision making. Most prior works on algorithmic fairness normatively prescribe how fair decisions ought to be made. In contrast, here, we descriptively survey users for how they perceive and reason about fairness in algorithmic decision making. A key contribution of this work is the framework we propose to understand why people perceive certain features as fair or unfair to be used in algorithms. Our framework identifies eight properties of features, such as relevance, volitionality and reliability, as latent considerations that inform people»s moral judgments about the fairness of feature use in decision-making algorithms. We validate our framework through a series of scenario-based surveys with 576 people. We find that, based on a person»s assessment of the eight latent properties of a feature in our exemplar scenario, we can accurately (> 85%) predict if the person will judge the use of the feature as fair. Our findings have important implications. At a high-level, we show that people»s unfairness concerns are multi-dimensional and argue that future studies need to address unfairness concerns beyond discrimination. At a low-level, we find considerable disagreements in people»s fairness judgments. We identify root causes of the disagreements, and note possible pathways to resolve them.
【Keywords】:
【Paper Link】 【Pages】:913-922
【Authors】: Kiran Garimella ; Gianmarco De Francisci Morales ; Aristides Gionis ; Michael Mathioudakis
【Abstract】: Echo chambers, i.e., situations where one is exposed only to opinions that agree with their own, are an increasing concern for the political discourse in many democratic countries. This paper studies the phenomenon of political echo chambers on social media. We identify the two components in the phenomenon: the opinion that is shared, and the »chamber» (i.e., the social network) that allows the opinion to »echo» (i.e., be re-shared in the network) -- and examine closely at how these two components interact. We define a production and consumption measure for social-media users, which captures the political leaning of the content shared and received by them. By comparing the two, we find that Twitter users are, to a large degree, exposed to political opinions that agree with their own. We also find that users who try to bridge the echo chambers, by sharing content with diverse leaning, have to pay a »price of bipartisanship» in terms of their network centrality and content appreciation. In addition, we study the role of »gatekeepers,» users who consume content with diverse leaning but produce partisan content (with a single-sided leaning), in the formation of echo chambers. Finally, we apply these findings to the task of predicting partisans and gatekeepers from social and content features. While partisan users turn out relatively easy to identify, gatekeepers prove to be more challenging.
【Keywords】: echo chambers; filter bubbles; media bias; polarization
【Paper Link】 【Pages】:923-932
【Authors】: Ana-Andreea Stoica ; Christopher J. Riederer ; Augustin Chaintreau
【Abstract】: As social recommendations such as friend suggestions and people to follow become increasingly popular and influential on the growth of social media, we find that prominent social recommendation algorithms can exacerbate the under-representation of certain demographic groups at the top of the social hierarchy. To study this imbalance in online equal opportunities, we leverage new Instagram data and offer for the first time an analysis that studies the effect of gender, homophily and growth dynamics under social recommendations. Our mathematical analysis demonstrates the existence of an algorithmic glass ceiling that exhibits all the properties of the metaphorical social barrier that hinders groups like women or people of color from attaining equal representation. What raises concern is that our proof shows that under fixed minority and homophily parameters the algorithmic effect is systematically larger than the glass ceiling generated by the spontaneous growth of social networks. We discuss ways to address this concern in future design.
【Keywords】: fairness; homophily; random walks; social recommender
【Paper Link】 【Pages】:933-943
【Authors】: Srijan Kumar ; William L. Hamilton ; Jure Leskovec ; Dan Jurafsky
【Abstract】: Users organize themselves into communities on web platforms. These communities can interact with one another, often leading to conflicts and toxic interactions. However, little is known about the mechanisms of interactions between communities and how they impact users. Here we study intercommunity interactions across 36,000 communities on Reddit, examining cases where users of one community are mobilized by negative sentiment to comment in another community. We show that such conflicts tend to be initiated by a handful of communities---less than 1% of communities start 74% of conflicts. While conflicts tend to be initiated by highly active community members, they are carried out by significantly less active members. We find that conflicts are marked by formation of echo chambers, where users primarily talk to other users from their own community. In the long-term, conflicts have adverse effects and reduce the overall activity of users in the targeted communities. Our analysis of user interactions also suggests strategies for mitigating the negative impact of conflicts---such as increasing direct engagement between attackers and defenders. Further, we accurately predict whether a conflict will occur by creating a novel LSTM model that combines graph embeddings, user, community, and text features. This model can be used to create an early-warning system for community moderators to prevent conflicts. Altogether, this work presents a data-driven view of community interactions and conflict, and paves the way towards healthier online communities.
【Keywords】: antisocial behavior; community; conflict; interaction; intercommunity; society; web
【Paper Link】 【Pages】:945-954
【Authors】: Chenhao Tan ; Hao Peng ; Noah A. Smith
【Abstract】: Political speeches and debates play an important role in shaping the images of politicians, and the public often relies on media outlets to select bits of political communication from a large pool of utterances. It is an important research question to understand what factors impact this selection process. To quantitatively explore the selection process, we build a three- decade dataset of presidential debate transcripts and post-debate coverage. We first examine the effect of wording and propose a binary classification framework that controls for both the speaker and the debate situation. We find that crowdworkers can only achieve an accuracy of 60% in this task, indicating that media choices are not entirely obvious. Our classifiers outperform crowdworkers on average, mainly in primary debates. We also compare important factors from crowdworkers» free-form explanations with those from data-driven methods and find interesting differences. Few crowdworkers mentioned that "context matters", whereas our data show that well-quoted sentences are more distinct from the previous utterance by the same speaker than less-quoted sentences. Finally, we examine the aggregate effect of media preferences towards different wordings to understand the extent of fragmentation among media outlets. By analyzing a bipartite graph built from quoting behavior in our data, we observe a decreasing trend in bipartisan coverage.
【Keywords】: conversations; media bias; presidential debates; quotations; wording
【Paper Link】 【Pages】:955-965
【Authors】: Ronald E. Robertson ; David Lazer ; Christo Wilson
【Abstract】: Search engines are a primary means through which people obtain information in today»s connected world. Yet, apart from the search engine companies themselves, little is known about how their algorithms filter, rank, and present the web to users. This question is especially pertinent with respect to political queries, given growing concerns about filter bubbles, and the recent finding that bias or favoritism in search rankings can influence voting behavior. In this study, we conduct a targeted algorithm audit of Google Search using a dynamic set of political queries. We designed a Chrome extension to survey participants and collect the Search Engine Results Pages (SERPs) and autocomplete suggestions that they would have been exposed to while searching our set of political queries during the month after Donald Trump»s Presidential inauguration. Using this data, we found significant differences in the composition and personalization of politically-related SERPs by query type, subjects» characteristics, and date.
【Keywords】: autocomplete search suggestions; filter bubble; political personalization; search engine manipulation; search engine results; search ranking bias
【Paper Link】 【Pages】:967-976
【Authors】: Yang Yang ; Zongtao Liu ; Chenhao Tan ; Fei Wu ; Yueting Zhuang ; Yafeng Li
【Abstract】: In China, 260 million people migrate to cities to realize their urban dreams. Despite that these migrants play an important role in the rapid urbanization process, many of them fail to settle down and eventually leave the city. The integration process of migrants thus raises an important issue for scholars and policymakers. In this paper, we use Shanghai as an example to investigate migrants' behavior in their first weeks and in particular, how their behavior relates to early departure. Our dataset consists of a one-month complete dataset of 698 telecommunication logs between 54 million users, plus a novel and publicly available housing price data for 18K real estates in Shanghai. We find that migrants who end up leaving early tend to neither develop diverse connections in their first weeks nor move around the city. Their active areas also have higher housing prices than that of staying migrants. We formulate a churn prediction problem to determine whether a migrant is going to leave based on her behavior in the first few days. The prediction performance improves as we include data from more days. Interestingly, when using the same features, the classifier trained from only the first few days is already as good as the classifier trained using full data, suggesting that the performance difference mainly lies in the difference between features.
【Keywords】: churn prediction; migrant integration; urban migrants
【Paper Link】 【Pages】:983-992
【Authors】: Xiaoshi Zhong ; Erik Cambria
【Abstract】: We find from four datasets that time expressions are formed by loose structure and the words used to express time information can differentiate time expressions from common text. The findings drive us to design a learning method named TOMN to model time expressions. TOMN defines a time-related tagging scheme named TOMN scheme with four tags, namely \tomnT,\tomnO, \tomnM,and \tomnN, indicating the constituents of time expression, namely \tomnT ime token, \tomnM odifier, \tomnN umeral, and the words \tomnO utside time expression. In modeling, TOMN assigns a word with a TOMN tag under conditional random fields with minimal features. Essentially, our constituent-based TOMN scheme overcomes the problem of inconsistent tag assignment that is caused by the conventional position-based tagging schemes (\eg BIO scheme and BILOU scheme). Experiments show that TOMN is equally or more effective than state-of-the-art methods on various datasets, and much more robust on cross-datasets. Moreover, our analysis can explain many empirical observations in other works about time expression recognition and named entity recognition.
【Keywords】: constituent-based tagging scheme; inconsistent tag assignment; named entity recognition; position-based tagging scheme; time expression recognition
【Paper Link】 【Pages】:993-1002
【Authors】: Yashoteja Prabhu ; Anil Kag ; Shrutendra Harsola ; Rahul Agrawal ; Manik Varma
【Abstract】: This paper develops the Parabel algorithm for extreme multi-label learning where the objective is to learn classifiers that can annotate each data point with the most relevant subset of labels from an extremely large label set. The state-of-the-art 1-vs-All based DiSMEC and PPDSparse algorithms are the most accurate but can take upto months for training and prediction as they learn and apply an independent linear classifier per label. Consequently, they do not scale to large datasets with millions of labels. Parabel addresses both limitations by learning a balanced label hierarchy such that: (a) the 1-vs-All classifiers in the leaf nodes of the label hierarchy can be trained on a small subset of the training set thereby reducing the training time to a few hours on a single core of a standard desktop and (b) novel points can be classified by traversing the learned hierarchy in logarithmic time and applying the 1-vs-All classifiers present in just the leaf thereby reducing the prediction time to a few milliseconds per test point. This allows Parabel to scale to tasks considered infeasible for DiSMEC and PPDSparse such as predicting the subset of 7 million Bing queries that might lead to a click on a given ad-landing page for dynamic search advertising. Experiments on multiple benchmark datasets revealed that Parabel could be almost as accurate as PPDSparse and DiSMEC while being upto 1,000x faster at training and upto 40x-10,000x faster at prediction. Furthermore, Parabel was demonstrated to significantly improve dynamic search advertising on Bing by more than doubling the ad recall and improving the click-through rate by 20%. Source code for Parabel can be downloaded from [1].
【Keywords】: dynamic search advertising; extreme classification
【Paper Link】 【Pages】:1003-1011
【Authors】: Maja R. Rudolph ; David M. Blei
【Abstract】: Word embeddings are a powerful approach for unsupervised analysis of language. Recently, Rudolph et al. developed exponential family embeddings, which cast word embeddings in a probabilistic framework. Here, we develop dynamic embeddings, building on exponential family embeddings to capture how the meanings of words change over time. We use dynamic embeddings to analyze three large collections of historical texts: the U.S. Senate speeches from 1858 to 2009, the history of computer science ACM abstracts from 1951 to 2014, and machine learning papers on the ArXiv from 2007 to 2015. We find dynamic embeddings provide better fits than classical embeddings and capture interesting patterns about how language changes.
【Keywords】: dynamic modeling; exponential family embeddings; probabilistic modeling; semantic change; word embeddings
【Paper Link】 【Pages】:1013-1022
【Authors】: Patrick Ernst ; Amy Siu ; Gerhard Weikum
【Abstract】: Text-based knowledge extraction methods for populating knowledge bases have focused on binary facts: relationships between two entities. However, in advanced domains such as health, it is often crucial to consider ternary and higher-arity relations. An example is to capture which drug is used for which disease at which dosage (e.g. 2.5 mg/day) for which kinds of patients (e.g., children vs. adults). In this work, we present an approach to harvest higher-arity facts from textual sources. Our method is distantly supervised by seed facts, and uses the fact-pattern duality principle to gather fact candidates with high recall. For high precision, we devise a constraint-based reasoning method to eliminate false candidates. A major novelty is in coping with the difficulty that higher-arity facts are often expressed only partially in texts and strewn across multiple sources. For example, one sentence may refer to a drug, a disease and a group of patients, whereas another sentence talks about the drug, its dosage and the target group without mentioning the disease. Our methods cope well with such partially observed facts, at both pattern-learning and constraint-reasoning stages. Experiments with health-related documents and with news articles demonstrate the viability of our method.
【Keywords】: distant supervision; health; higher-arity relation extraction; knowledge base construction; knowledge graphs; partial facts; text-based knowledge harvesting; tree pattern learning
【Paper Link】 【Pages】:1023-1032
【Authors】: Qiao Liu ; Haibin Zhang ; Yifu Zeng ; Ziqi Huang ; Zufeng Wu
【Abstract】: Aspect based sentiment classification is a crucial task for sentiment analysis. Recent advances in neural attention models demonstrate that they can be helpful in aspect based sentiment classification tasks, which can help identify the focus words in human. However, according to our empirical study, prevalent content attention mechanisms proposed for aspect based sentiment classification mostly focus on identifying the sentiment words or shifters, without considering the relevance of such words with respect to the given aspects in the sentence. Therefore, they are usually insufficient for dealing with multi-aspect sentences and the syntactically complex sentence structures. To solve this problem, we propose a novel content attention based aspect based sentiment classification model, with two attention enhancing mechanisms: sentence-level content attention mechanism is capable of capturing the important information about given aspects from a global perspective, whiles the context attention mechanism is responsible for simultaneously taking the order of the words and their correlations into account, by embedding them into a series of customized memories. Experimental results demonstrate that our model outperforms the state-of-the-art, in which the proposed mechanisms play a key role.
【Keywords】: aspect based; attention mechanism; sentiment analysis
【Paper Link】 【Pages】:1033-1042
【Authors】: Tobias Grubenmann ; Abraham Bernstein ; Dmitry Moor ; Sven Seuken
【Abstract】: The World Wide Web is a massive network of interlinked documents. One of the reasons the World Wide Web is so successful is the fact that most content is available free of any charge. Inspired by the success of the World Wide Web, the Web of Data applies the same strategy of interlinking to data. To this point, most of data in the Web of Data is also free of charge. The fact that the data is freely available raises the question of financing these services, however. As we will discuss in this paper, advertisement and donations cannot easily be applied to this new setting. To create incentives to subsidize data providers, we propose that sponsors should pay the providers to promote sponsored data. In return, sponsored data will be privileged over non-sponsored data. Since it is not possible to enforce a certain ordering on the data the user will receive, we propose to split up the data into different batches and deliver these batches with different delays. In this way, we can privilege sponsored data without withholding any non-sponsored data from the user. In this paper, we introduce a new concept of a delayed-answer auction, where sponsors can pay to prioritize their data. We introduce a new model which captures the particular situation when a user access data in the Web of Data. We show how the weighted Vickrey-Clarke-Groves auction mechanism can be applied to our scenario and we discuss how certain parameters can influence the nature of our auction. With our new concept, we build a first step to a free yet financial sustainable Web of Data.
【Keywords】: delay; slot auctions; sponsored search results; vickrey-clarke-groves mechanism; web of data
【Paper Link】 【Pages】:1043-1052
【Authors】: Giorgio Stefanoni ; Boris Motik ; Egor V. Kostylev
【Abstract】: Estimating the cardinality (i.e., the number of answers) of conjunctive queries is particularly difficult in RDF systems: queries over RDF data are navigational and thus tend to involve many joins. We present a new, principled cardinality estimation technique based on graph summarisation. We interpret a summary of an RDF graph using a possible world semantics and formalise the estimation problem as computing the expected cardinality over all RDF graphs represented by the summary, and we present a closed-form formula for computing the expectation of arbitrary queries. We also discuss approaches to RDF graph summarisation. Finally, we show empirically that our cardinality technique is more accurate and more consistent, often by orders of magnitude, than the state of the art.
【Keywords】: RDF; databases; graph summarisation; query cardinality estimation
【Paper Link】 【Pages】:1053-1062
【Authors】: Abdalghani Abujabal ; Rishiraj Saha Roy ; Mohamed Yahya ; Gerhard Weikum
【Abstract】: Translating natural language questions to semantic representations such as SPARQL is a core challenge in open-domain question answering over knowledge bases (KB-QA). Existing methods rely on a clear separation between an offline training phase, where a model is learned, and an online phase where this model is deployed. Two major shortcomings of such methods are that (i) they require access to a large annotated training set that is not always readily available and (ii) they fail on questions from before-unseen domains. To overcome these limitations, this paper presents NEQA, a continuous learning paradigm for KB-QA. Offline, NEQA automatically learns templates mapping syntactic structures to semantic ones from a small number of training question-answer pairs. Once deployed, continuous learning is triggered on cases where templates are insufficient. Using a semantic similarity function between questions and by judicious invocation of non-expert user feedback, NEQA learns new templates that capture previously-unseen syntactic structures. This way, NEQA gradually extends its template repository. NEQA periodically re-trains its underlying models, allowing it to adapt to the language used after deployment. Our experiments demonstrate NEQA's viability, with steady improvement in answering quality over time, and the ability to answer questions from new domains.
【Keywords】: never-ending learning; question answering; user feedback
【Paper Link】 【Pages】:1063-1072
【Authors】: Hao Peng ; Jianxin Li ; Yu He ; Yaopeng Liu ; Mengjiao Bao ; Lihong Wang ; Yangqiu Song ; Qiang Yang
【Abstract】: Text classification to a hierarchical taxonomy of topics is a common and practical problem. Traditional approaches simply use bag-of-words and have achieved good results. However, when there are a lot of labels with different topical granularities, bag-of-words representation may not be enough. Deep learning models have been proven to be effective to automatically learn different levels of representations for image data. It is interesting to study what is the best way to represent texts. In this paper, we propose a graph-CNN based deep learning model to first convert texts to graph-of-words, and then use graph convolution operations to convolve the word graph. Graph-of-words representation of texts has the advantage of capturing non-consecutive and long-distance semantics. CNN models have the advantage of learning different level of semantics. To further leverage the hierarchy of labels, we regularize the deep architecture with the dependency among labels. Our results on both RCV1 and NYTimes datasets show that we can significantly improve large-scale hierarchical text classification over traditional hierarchical text classification and existing deep models.
【Keywords】: convolutional neural networks; deep learning; graph-of-words; hierarchical text classification; recursive regularization
【Paper Link】 【Pages】:1073-1081
【Authors】: Kaja Zupanc ; Jesse Davis
【Abstract】: Currently, there are many large, automatically constructed knowledge bases (KBs). One interesting task is learning from a knowledge base to generate new knowledge either in the form of inferred facts or rules that define regularities. One challenge for learning is that KBs are necessarily open world: we cannot assume anything about the truth values of tuples not included in the KB. When a KB only contains facts (i.e., true statements), which is typically the case, we lack negative examples, which are often needed by learning algorithms. To address this problem, we propose a novel score function for evaluating the quality of a first-order rule learned from a KB. Our metric attempts to include information about the tuples not in the KB when evaluating the quality of a potential rule. Empirically, we find that our metric results in more precise predictions than previous approaches.
【Keywords】: ILP; knowledge base; open world assumption; rule mining
【Paper Link】 【Pages】:1083-1093
【Authors】: Thijs Scheepers ; Evangelos Kanoulas ; Efstratios Gavves
【Abstract】: We present an in-depth analysis of four popular word embeddings (Word2Vec, GloVe, fastText and Paragram) in terms of their semantic compositionality. In addition, we propose a method to tune these embeddings towards better compositionality. We find that training the existing embeddings to compose lexicographic definitions improves their performance in this task significantly, while also getting similar or better performance in both word similarity and sentence embedding evaluations. Our method tunes word embeddings using a simple neural network architecture with definitions and lemmas from WordNet. Since dictionary definitions are semantically similar to their associated lemmas, they are the ideal candidate for our tuning method, as well as evaluating for compositionality. Our architecture allows for the embeddings to be composed using simple arithmetic operations, which makes these embeddings specifically suitable for production applications such as web search and data mining. We also explore more elaborate and involved compositional models. In our analysis, we evaluate original embeddings, as well as tuned embeddings, using existing word similarity and sentence embedding evaluation methods. Aside from these evaluation methods used in related work, we also evaluate embeddings using a ranking method which tests composed vectors using the lexicographic definitions already mentioned. In contrast to other evaluation methods, ours is not invariant to the magnitude of the embedding vector, which we show is important for composition. We consider this new evaluation method, called CompVecEval, to be a key contribution.
【Keywords】: compositionality; dictionaries; lexicographic definitions; semantic composition; word embeddings
【Paper Link】 【Pages】:1095-1104
【Authors】: Ruslan R. Fayzrakhmanov ; Emanuel Sallinger ; Ben Spencer ; Tim Furche ; Georg Gottlob
【Abstract】: Most modern web scrapers use an embedded browser to render web pages and to simulate user actions. Such scrapers (or wrappers) are therefore expensive to execute, in terms of time and network traffic. In contrast, it is magnitudes more resource-efficient to use a "browserless" wrapper which directly accesses a web server through HTTP requests, and takes the desired data directly from the raw replies. However, creating and maintaining browserless wrappers of high precision requires specialists, and is prohibitively labor-intensive at scale. In this paper, we demonstrate the principal feasibility of automatically translating browser-based wrappers into "browserless" wrappers. We present the first algorithm and system performing such an automated translation on suitably restricted types of web sites. This system works in the vast majority of test cases and produces very fast and extremely resource-efficient wrappers. We discuss research challenges for extending our approach to a general method applicable to a yet larger number of cases.
【Keywords】: HTTP; ajax; deep web; scraping; web data extraction
【Paper Link】 【Pages】:1105-1114
【Authors】: Tian Shi ; Kyeongpil Kang ; Jaegul Choo ; Chandan K. Reddy
【Abstract】: Being a prevalent form of social communications on the Internet, billions of short texts are generated everyday. Discovering knowledge from them has gained a lot of interest from both industry and academia. The short texts have a limited contextual information, and they are sparse, noisy and ambiguous, and hence, automatically learning topics from them remains an important challenge. To tackle this problem, in this paper, we propose a semantics-assisted non-negative matrix factorization (SeaNMF) model to discover topics for the short texts. It effectively incorporates the word-context semantic correlations into the model, where the semantic relationships between the words and their contexts are learned from the skip-gram view of the corpus. The SeaNMF model is solved using a block coordinate descent algorithm. We also develop a sparse variant of the SeaNMF model which can achieve a better model interpretability. Extensive quantitative evaluations on various real-world short text datasets demonstrate the superior performance of the proposed models over several other state-of-the-art methods in terms of topic coherence and classification accuracy. The qualitative semantic analysis demonstrates the interpretability of our models by discovering meaningful and consistent topics. With a simple formulation and the superior performance, SeaNMF can be an effective standard topic model for short texts.
【Keywords】: non-negative matrix factorization; short texts; topic modeling; word embedding
【Paper Link】 【Pages】:1115-1124
【Authors】: Jonathan Lajus ; Fabian M. Suchanek
【Abstract】: An attribute is obligatory for a class in a Knowledge Base (KB), if all instances of the class have the attribute in the real world. For example, hasBirthDate is an obligatory attribute for the class Person, while has Spouse is not. In this paper, we propose a new way to model incompleteness in KBs. From this model, we derive a method to automatically determine obligatory attributes -- using only the data from the KB. Our algorithm can detect such attributes with a precision of up to 90%.
【Keywords】: attributes; classes; completeness; knowledge bases
【Paper Link】 【Pages】:1125-1134
【Authors】: Jacob Levy Abitbol ; Márton Karsai ; Jean-Philippe Magué ; Jean-Pierre Chevrot ; Eric Fleury
【Abstract】: Our usage of language is not solely reliant on cognition but is arguably determined by myriad external factors leading to a global variability of linguistic patterns. This issue, which lies at the core of sociolinguistics and is backed by many small-scale studies on face-to-face communication, is addressed here by constructing a dataset combining the largest French Twitter corpus to date with detailed socioeconomic maps obtained from national census in France. We show how key linguistic variables measured in individual Twitter streams depend on factors like socioeconomic status, location, time, and the social network of individuals. We found that (i) people of higher socioeconomic status, active to a greater degree during the daytime, use a more standard language; (ii) the southern part of the country is more prone to use more standard language than the northern one, while locally the used variety or dialect is determined by the spatial distribution of socioeconomic status; and (iii) individuals connected in the social network are closer linguistically than disconnected ones, even after the effects of status homophily have been removed. Our results inform sociolinguistic theory and may inspire novel learning methods for the inference of socioeconomic status of people from the way they tweet.
【Keywords】: computational sociolinguistics; social network analysis; socioeconomic status inference; spatiotemporal data; twitter data
【Paper Link】 【Pages】:1135-1144
【Authors】: Chenwei Ran ; Wei Shen ; Jianyong Wang
【Abstract】: The rapid expansion of Twitter has attracted worldwide attention. With more than 500 million tweets posted per day, Twitter becomes an invaluable information and knowledge source. Many Twitter related tasks have been studied, such as event extraction, hashtag recommendation, and topic detection. A critical step in understanding and mining information from Twitter is to disambiguate entities in tweets, i.e., tweet entity linking. It is a challenging task because tweets are short, noisy, and fresh. Many tweet-specific signals have been found to solve the tweet entity linking problem, such as user interest, temporal popularity, location information and so on. However, two common weaknesses exist in previous work. First, most proposed models are not flexible and extendable to fit new signals. Second, their scalability is not good enough to handle the large-scale social network like Twitter. In this work, we formalize the tweet entity linking problem into a factor graph model which has shown its effectiveness and efficiency in many other applications. We also propose selective attention over entities to increase the scalability of our model, which brings linear complexity. To adopt the attention mechanism in the factor graph, we propose a new type of nodes called pseudo-variable nodes to solve the asymmetry attention problem caused by the undirected characteristic of the factor graph. We evaluated our model on two different manually annotated tweet datasets. The experimental results show that our model achieves better performance in terms of both effectiveness and efficiency compared with the state-of-the-art approaches.
【Keywords】: attention-based model; entity linking; factor graph model; knowledge graph; twitter
【Paper Link】 【Pages】:1145-1154
【Authors】: Yunhao Jiao ; Cheng Li ; Fei Wu ; Qiaozhu Mei
【Abstract】: How to improve the quality of conversations in online communities has attracted considerable attention recently. Having engaged, civil, and reactive online conversations has a critical effect on the social life of Internet users. In this study, we are particularly interested in identifying a post in a multi-party conversation that is unlikely to be further replied to, which therefore kills that thread of the conversation. For this purpose, we propose a deep learning model called the ConverNet. ConverNet is attractive due to its capability of modeling the internal structure of a long conversation and its appropriate encoding of the contextual information of the conversation, through effective integration of attention mechanisms. Empirical experiments on real-world datasets demonstrate the effectiveness of the proposed model. For the widely concerned topic, our analysis also offers implications for how to improve the quality and user experience of online conversations, or how to engage users in a conversation with a chatbot.
【Keywords】: conversation prediction; deep learning; online conversations
【Paper Link】 【Pages】:1155-1164
【Authors】: Olaf Hartig ; Jorge Pérez
【Abstract】: GraphQL is a recently proposed, and increasingly adopted, conceptual framework for providing a new type of data access interface on the Web. The framework includes a new graph query language whose semantics has been specified informally only. This has prevented the formal study of the main properties of the language. We embark on the formalization and study of GraphQL. To this end, we first formalize the semantics of GraphQL queries based on a labeled-graph data model. Thereafter, we analyze the language and show that it admits really efficient evaluation methods. In particular, we prove that the complexity of the GraphQL evaluation problem is NL-complete. Moreover, we show that the enumeration problem can be solved with constant delay. This implies that a server can answer a GraphQL query and send the response byte-by-byte while spending just a constant amount of time between every byte sent. Despite these positive results, we prove that the size of a GraphQL response might be prohibitively large for an internet scenario. We present experiments showing that current practical implementations suffer from this issue. We provide a solution to cope with this problem by showing that the total size of a GraphQL response can be computed in polynomial time. Our results on polynomial-time size computation plus the constant-delay enumeration can help developers to provide more robust GraphQL interfaces on the Web.
【Keywords】: GraphQL; JSON; query language; web queries
【Paper Link】 【Pages】:1165-1174
【Authors】: Yequan Wang ; Aixin Sun ; Jialong Han ; Ying Liu ; Xiaoyan Zhu
【Abstract】: In this paper, we propose RNN-Capsule, a capsule model based on Recurrent Neural Network (RNN) for sentiment analysis. For a given problem, one capsule is built for each sentiment category e.g., 'positive' and 'negative'. Each capsule has an attribute, a state, and three modules: representation module, probability module, and reconstruction module. The attribute of a capsule is the assigned sentiment category. Given an instance encoded in hidden vectors by a typical RNN, the representation module builds capsule representation by the attention mechanism. Based on capsule representation, the probability module computes the capsule's state probability. A capsule's state is active if its state probability is the largest among all capsules for the given instance, and inactive otherwise. On two benchmark datasets (i.e., Movie Review and Stanford Sentiment Treebank) and one proprietary dataset (i.e., Hospital Feedback), we show that RNN-Capsule achieves state-of-the-art performance on sentiment classification. More importantly, without using any linguistic knowledge, RNN-Capsule is capable of outputting words with sentiment tendencies reflecting capsules' attributes. The words well reflect the domain specificity of the dataset.
【Keywords】: attention; capsule; recurrent neural network; sentiment analysis
【Paper Link】 【Pages】:1175-1184
【Authors】: Larry González ; Aidan Hogan
【Abstract】: In this paper, we propose a novel data-driven schema for large-scale heterogeneous knowledge graphs inspired by Formal Concept Analysis (FCA). We first extract the sets of properties associated with individual entities; these property sets (aka. characteristic sets) are annotated with cardinalities and used to induce a lattice based on set-containment relations, forming a natural hierarchical structure describing the knowledge graph. We then propose an algebra over such schema lattices, which allows to compute diffs between lattices (for example, to summarise the changes from one version of a knowledge graph to another), to add lattices (for example, to project future changes), and so forth. While we argue that this lattice structure (and associated algebra) may have various applications, we currently focus on the use-case of modelling and predicting the dynamic behaviour of knowledge graphs. Along those lines, we instantiate and evaluate our methods for analysing how versions of the Wikidata knowledge graph have changed over a period of 11 weeks. We propose algorithms for constructing the lattice-based schema from Wikidata, and evaluate their efficiency and scalability. We then evaluate use of the resulting schema(ta) for predicting how the knowledge graph will evolve in future versions.
【Keywords】: FCA; dynamics; knowledge graph; schema; semantic web
【Paper Link】 【Pages】:1185-1194
【Authors】: Richong Zhang ; Junpeng Li ; Jiajie Mei ; Yongyi Mao
【Abstract】: The knowledge base (KB) completion problem is usually formulated as a link prediction problem. Such formulation is incapable of capturing certain application scenarios when the KB contains multi-fold relations. In this paper, we present a new formulation of KB completion, called instance reconstruction. Unlike its link-prediction counterpart, which has linear complexity in the size of the KB, this problem has its complexity behave as a high-degree polynomial. This presents a significant challenge in developing scalable instance reconstruction algorithms. In this paper, we present a novel knowledge embedding model (RAE) and build on it an instance reconstruction algorithm (SIR). The SIR algorithm utilizes schema-based filtering as well as "relatedness" filtering for complexity reduction. Here relatedness refers to the likelihood that two entities co-participate in a common instance, and the relatedness metric is learned from the RAE model. We show experimentally that SIR significantly reduces computation complexity without sacrificing reconstruction performance. The complexity reduction corresponds to reducing the KB size by 100 to 1000 folds.
【Keywords】: knowledge base representation; link prediction; multi-fold relation
【Paper Link】 【Pages】:1195-1204
【Authors】: Andrew T. Schneider ; Arjun Mukherjee ; Eduard C. Dragut
【Abstract】: Many data-intensive applications collect (structured) data from a variety of sources. A key task in this process is record linkage, which is the problem of determining the records from these sources that refer to the same real-world entities. Traditional approaches use the record representation of entities to accomplish this task. With the nascence of social media, entities on the Web are now accompanied by user generated content. We present a method for record linkage that uses this hitherto untapped source of entity information. We use document-based distances, with an emphasis on word embedding document distances, to determine if two entities match. Our rationale is that user evaluations of entities converge in semantic content, and hence in the word embedded space, as the number of user evaluations grows. We analyze the effectiveness of the proposed method both as a stand-alone method and in combination with record-based record linkage methods. Experimental results using real-world reviews demonstrate the high effectiveness of our approach. To our knowledge, this is the first work exploring the use of user generated content accompanying entities in the record linkage task.
【Keywords】: record linkage; word embeddings; word mover's distance
【Paper Link】 【Pages】:1205-1214
【Authors】: Yue Zhang ; Arti Ramesh ; Jennifer Golbeck ; Dhanya Sridhar ; Lise Getoor
【Abstract】: Alcoholism, also known as Alcohol Use Disorder (AUD), is a serious problem affecting millions of people worldwide. Recovery from AUD is known to be challenging and often leads to relapse at various points after enrolling in a rehabilitation program such as Alcoholics Anonymous (AA). In this work, we take a structured approach to understand recovery and relapse from AUD using social media data. To do so, we combine linguistic and psychological attributes of users with relational features that capture useful structure in the user interaction network. We evaluate our models on AA-attending users extracted from the Twitter social network and predict recovery at two different points---90 days and 1 year after the user joins AA, respectively. Our experiments reveal that our structured approach is helpful in predicting recovery in these users. We perform extensive quantitative analysis of different groups of features and dependencies among them. Our analysis sheds light on the role of each feature group and how they combine to predict recovery and relapse. Finally, we present a qualitative analysis of the different reasons behind users relapsing to AUD. Our models and analysis are helpful in making meaningful predictions in scenarios where only a subset of features are available and can potentially be helpful in identifying and preventing relapse early.
【Keywords】: alcoholics anonymous; probabilistic graphical models; recovery from alcoholism; social media analysis
【Paper Link】 【Pages】:1215-1224
【Authors】: Riccardo Porrini ; Matteo Palmonari ; Isabel F. Cruz
【Abstract】: Faceted interfaces are omnipresent on the web to support data exploration and filtering. A facet is a triple: a domain (e.g., Book), a property (e.g., author, language), and a set of property values (e.g., Austen, Beauvoir, Coelho, Dostoevsky, Eco, Kerouac, Suskind, ..., French, English, German, Italian, Portuguese, Russian, ... ). Given a property (e.g., language), selecting one or more of its values (English and Italian) returns the domain entities (of type Book) that match the given values (the books that are written in English or Italian). To implement faceted interfaces in a way that is scalable to very large datasets, it is necessary to automate facet extraction. Prior work associates a facet domain with a set of homogeneous values, but does not annotate the facet property. In this paper, we annotate the facet property with a predicate from a reference Knowledge Base (KB) so as to maximize the semantic similarity between the property and the predicate. We define semantic similarity in terms of three new metrics: specificity, coverage, and frequency. Our experimental evaluation uses the DBpedia and YAGO KBs and shows that for the facet annotation problem, we obtain better results than a state-of-the-art approach for the annotation of web tables as modified to annotate a set of values.
【Keywords】: data lifting; data semantics; ecommerce; facet annotation; faceted search; table annotation
【Paper Link】 【Pages】:1225-1235
【Authors】: Abhimanyu Dubey ; Esteban Moro ; Manuel Cebrián ; Iyad Rahwan
【Abstract】: The analysis of the creation, mutation, and propagation of social media content on the Internet is an essential problem in computational social science, affecting areas ranging from marketing to political mobilization. A first step towards understanding the evolution of images online is the analysis of rapidly modifying and propagating memetic imagery or "memes". However, a pitfall in proceeding with such an investigation is the current incapability to produce a robust semantic space for such imagery, capable of understanding differences in Image Macros. In this study, we provide a first step in the systematic study of image evolution on the Internet, by proposing an algorithm based on sparse representations and deep learning to decouple various types of content in such images and produce a rich semantic embedding. We demonstrate the benefits of our approach on a variety of tasks pertaining to memes and Image Macros, such as image clustering, image retrieval, topic prediction and virality prediction, surpassing the existing methods on each. In addition to its utility on quantitative tasks, our method opens up the possibility of obtaining the first large-scale understanding of the evolution and propagation of memetic imagery.
【Keywords】: content understanding; embeddings; feature extraction; image macros; image virality; social network analysis; sparse representation
【Paper Link】 【Pages】:1237-1246
【Authors】: Bang Liu ; Ting Zhang ; Fred X. Han ; Di Niu ; Kunfeng Lai ; Yu Xu
【Abstract】: Semantic matching of natural language sentences or identifying the relationship between two sentences is a core research problem underlying many natural language tasks. Depending on whether training data is available, prior research has proposed both unsupervised distance-based schemes and supervised deep learning schemes for sentence matching. However, previous approaches either omit or fail to fully utilize the ordered, hierarchical, and flexible structures of language objects, as well as the interactions between them. In this paper, we propose Hierarchical Sentence Factorization---a technique to factorize a sentence into a hierarchical representation, with the components at each different scale reordered into a "predicate-argument" form. The proposed sentence factorization technique leads to the invention of: 1) a new unsupervised distance metric which calculates the semantic distance between a pair of text snippets by solving a penalized optimal transport problem while preserving the logical relationship of words in the reordered sentences, and 2) new multi-scale deep learning models for supervised semantic training, based on factorized sentence hierarchies. We apply our techniques to text-pair similarity estimation and text-pair relationship classification tasks, based on multiple datasets such as STSbenchmark, the Microsoft Research paraphrase identification (MSRP) dataset, the SICK dataset, etc. Extensive experiments show that the proposed hierarchical sentence factorization can be used to significantly improve the performance of existing unsupervised distance-based metrics as well as multiple supervised deep learning models based on the convolutional neural network (CNN) and long short-term memory (LSTM).
【Keywords】: abstract meaning representation; hierarchical sentence factorization; ordered word mover's distance; semantic matching; sentence reordering
【Paper Link】 【Pages】:1247-1256
【Authors】: Kuldeep Singh ; Arun Sethupat Radhakrishna ; Andreas Both ; Saeedeh Shekarpour ; Ioanna Lytra ; Ricardo Usbeck ; Akhilesh Vyas ; Akmal Khikmatullaev ; Dharmen Punjani ; Christoph Lange ; Maria-Esther Vidal ; Jens Lehmann ; Sören Auer
【Abstract】: Modern question answering (QA) systems need to flexibly integrate a number of components specialised to fulfil specific tasks in a QA pipeline. Key QA tasks include Named Entity Recognition and Disambiguation, Relation Extraction, and Query Building. Since a number of different software components exist that implement different strategies for each of these tasks, it is a major challenge to select and combine the most suitable components into a QA system, given the characteristics of a question. We study this optimisation problem and train classifiers, which take features of a question as input and have the goal of optimising the selection of QA components based on those features. We then devise a greedy algorithm to identify the pipelines that include the suitable components and can effectively answer the given question. We implement this model within Frankenstein, a QA framework able to select QA components and compose QA pipelines. We evaluate the effectiveness of the pipelines generated by Frankenstein using the QALD and LC-QuAD benchmarks. These results not only suggest that Frankenstein precisely solves the QA optimisation problem but also enables the automatic composition of optimised QA pipelines, which outperform the static Baseline QA pipeline. Thanks to this flexible and fully automated pipeline generation process, new QA components can be easily included in Frankenstein, thus improving the performance of the generated pipelines.
【Keywords】: QA framework; question answering; semantic search; semantic web; software reusability
【Paper Link】 【Pages】:1257-1266
【Authors】: Meng Qu ; Xiang Ren ; Yu Zhang ; Jiawei Han
【Abstract】: Extracting relations from text corpora is an important task with wide applications. However, it becomes particularly challenging when focusing on weakly-supervised relation extraction, that is, utilizing a few relation instances (i.e., a pair of entities and their relation) as seeds to extract from corpora more instances of the same relation. Existing distributional approaches leverage the corpus-level co-occurrence statistics of entities to predict their relations, and require a large number of labeled instances to learn effective relation classifiers. Alternatively, pattern-based approaches perform boostrapping or apply neural networks to model the local contexts, but still rely on a large number of labeled instances to build reliable models. In this paper, we study the integration of distributional and pattern-based methods in a weakly-supervised setting such that the two kinds of methods can provide complementary supervision for each other to build an effective, unified model. We propose a novel co-training framework with a distributional module and a pattern module. During training, the distributional module helps the pattern module discriminate between the informative patterns and other patterns, and the pattern module generates some highly-confident instances to improve the distributional module. The whole framework can be effectively optimized by iterating between improving the pattern module and updating the distributional module. We conduct experiments on two tasks: knowledge base completion with text corpora and corpus-level relation extraction. Experimental results prove the effectiveness of our framework over many competitive baselines.
【Keywords】: co-training; relation extraction
【Paper Link】 【Pages】:1267-1276
【Authors】: Marius Pasca
【Abstract】: A lightweight method distinguishes articles within Wikipedia that are classes (Novel, Book) from other articles (Three Men in a Boat, Diary of a Pilgrimage). It exploits clues available within the article text and within categories associated with articles in Wikipedia, while not requiring any linguistic preprocessing tools such as part of speech taggers, named entity recognizers or syntactic parsers. Experimental results show that classes can be identified among Wikipedia articles in multiple languages, at aggregate precision and recall generally above 0.9 and 0.6 respectively.
【Keywords】: classes; concepts; knowledge acquisition; open-domain information extraction; topic classification; unstructured text
【Paper Link】 【Pages】:1277-1286
【Authors】: Wei Zhang ; Wen Wang ; Jun Wang ; Hongyuan Zha
【Abstract】: Popularity prediction for the growing social images has opened unprecedented opportunities for wide commercial applications, such as precision advertising and recommender system. While a few studies have explored this significant task, little research has addressed its unstructured properties of both visual and textual modalities, and further considered to learn effective representation from multi-modalities for popularity prediction. To this end, we propose a model named User-guided Hierarchical Attention Network (UHAN) with two novel user-guided attention mechanisms to hierarchically attend both visual and textual modalities. It is capable of not only learning effective representation for each modality, but also fusing them to obtain an integrated multi-modal representation under the guidance of user embedding. As no benchmark dataset exists, we extend a publicly available social image dataset by adding the descriptions of images. The comprehensive experiments have demonstrated the rationality of our proposed UHAN and its better performance than several strong alternatives.
【Keywords】: attention network; multi-modal analysis; social image popularity
【Paper Link】 【Pages】:1287-1296
【Authors】: Ehsan Kamalloo ; Davood Rafiei
【Abstract】: Toponym Resolution, the task of assigning a location mention in a document to a geographic referent (i.e., latitude/longitude), plays a pivotal role in analyzing location-aware content. However, the ambiguities of natural language and a huge number of possible interpretations for toponyms constitute insurmountable hurdles for this task. In this paper, we study the problem of toponym resolution with no additional information other than a gazetteer and no training data. We demonstrate that a dearth of large enough annotated data makes supervised methods less capable of generalizing. Our proposed method estimates the geographic scope of documents and leverages the connections between nearby place names as evidence to resolve toponyms. We explore the interactions between multiple interpretations of mentions and the relationships between different toponyms in a document to build a model that finds the most coherent resolution. Our model is evaluated on three news corpora, two from the literature and one collected and annotated by us; then, we compare our methods to the state-of-the-art unsupervised and supervised techniques. We also examine three commercial products including Reuters OpenCalais, Yahoo! YQL Placemaker, and Google Cloud Natural Language API. The evaluation shows that our method outperforms the unsupervised technique as well as Reuters OpenCalais and Google Cloud Natural Language API on all three corpora; also, our method shows a performance close to that of the state-of-the art supervised method and outperforms it when the test data has 40% or more toponyms that are not seen in the training data.
【Keywords】: context-bound hypotheses; geolocation extraction; spatial hierarchies; toponym resolution; unsupervised disambiguation
【Paper Link】 【Pages】:1297-1306
【Authors】: Nicolas Tempelmeier ; Elena Demidova ; Stefan Dietze
【Abstract】: Embedded markup of Web pages has seen widespread adoption throughout the past years driven by standards such as RDFa and Microdata and initiatives such as schema.org, where recent studies show an adoption by 39% of all Web pages already in 2016. While this constitutes an important information source for tasks such as Web search, Web page classification or knowledge graph augmentation, individual markup nodes are usually sparsely described and often lack essential information. For instance, from 26 million nodes describing events within the Common Crawl in 2016, 59% of nodes provide less than six statements and only 257,000 nodes (0.96%) are typed with more specific event subtypes. Nevertheless, given the scale and diversity of Web markup data, nodes that provide missing information can be obtained from the Web in large quantities, in particular for categorical properties. Such data constitutes potential training data for inferring missing information to significantly augment sparsely described nodes. In this work, we introduce a supervised approach for inferring missing categorical properties in Web markup. Our experiments, conducted on properties of events and movies, show a performance of 79% and 83% F1 score correspondingly, significantly outperforming existing baselines.
【Keywords】: information inferring; supervised learning; web markup
【Paper Link】 【Pages】:1307-1316
【Authors】: Matteo Cannaviccio ; Denilson Barbosa ; Paolo Merialdo
【Abstract】: Tables and structured lists on Web pages are a potential source of valuable information, and several methods have been proposed to annotate them with semantics that can be leveraged for search, question answering and information extraction. This paper is concerned with the specific problem of finding and ranking relations from a given Knowledge Graph (KG) that hold over pairs of entities juxtaposed in a table or structured list. The state-of-the-art for this task is to attempt to link the entities mentioned in the table cells to objects in the KG and rank the relations that hold for those linked objects. As a result, these methods are hampered by the incompleteness and uneven coverage in even the best knowledge graphs available today. The alternative described here does not require entity linking, relying instead on ranking relations using generative language models derived from Web-scale corpora. As such, it can produce quality results even when the entities in the table are missing in the KG. The experimental validation, designed to expose the challenges posed by KG incompleteness, shows that our approach is robust and effective in practice.
【Keywords】: generative language models; knowledge graphs; web table understanding
【Paper Link】 【Pages】:1317-1327
【Authors】: Shikhar Vashishth ; Prince Jain ; Partha Talukdar
【Abstract】: Open Information Extraction (OpenIE) methods extract (noun phrase, relation phrase, noun phrase) triples from text, resulting in the construction of large Open Knowledge Bases (Open KBs). The noun phrases (NPs) and relation phrases in such Open KBs are not canonicalized, leading to the storage of redundant and ambiguous facts. Recent research has posed canonicalization of Open KBs as clustering over manually-defined feature spaces. Manual feature engineering is expensive and often sub-optimal. In order to overcome this challenge, we propose Canonicalization using Embeddings and Side Information (CESI) -- a novel approach which performs canonicalization over learned embeddings of Open KBs. CESI extends recent advances in KB embedding by incorporating relevant NP and relation phrase side information in a principled manner. Through extensive experiments on multiple real-world datasets, we demonstrate CESI's effectiveness.
【Keywords】: canonicalization; knowledge graph embeddings; knowledge graphs; open knowledge bases
【Paper Link】 【Pages】:1329-1338
【Authors】: Patrick Hummel ; Uri Nadav
【Abstract】: This paper analyzes a mechanism for selling items in auctions in which the auctioneer specifies a cap on the ratio between the maximum and minimum bids that bidders may use in the different auctions. Such a mechanism is widely used in online advertising through the caps that companies impose on the minimum and maximum bid multipliers that advertisers may use in targeting. When bidders» values are independent and identically distributed, using this mechanism results in higher revenue than allowing bidders to condition their bids on the targeting information in an arbitrary way and also almost always results in higher revenue than not allowing bidders to target. Choosing the optimal cap on the ratio between the maximum bid and the minimum bid can also be more important than introducing additional competition in the auction. However, if bidders» values are not identically distributed, pure-strategy equilibria may fail to exist.
【Keywords】: bid multipliers; mechanism design; online auctions; targeting
【Paper Link】 【Pages】:1339-1348
【Authors】: Qingpeng Cai ; Aris Filos-Ratsikas ; Pingzhong Tang ; Yiwei Zhang
【Abstract】: We study the problem of allocating impressions to sellers in e-commerce websites, such as Amazon, eBay or Taobao, aiming to maximize the total revenue generated by the platform. We employ a general framework of reinforcement mechanism design, which uses deep reinforcement learning to design efficient algorithms, taking the strategic behaviour of the sellers into account. Specifically, we model the impression allocation problem as a Markov decision process, where the states encode the history of impressions, prices, transactions and generated revenue and the actions are the possible impression allocations in each round. To tackle the problem of continuity and high-dimensionality of states and actions, we adopt the ideas of the DDPG algorithm to design an actor-critic policy gradient algorithm which takes advantage of the problem domain in order to achieve convergence and stability. We evaluate our proposed algorithm, coined IA(GRU), by comparing it against DDPG, as well as several natural heuristics, under different rationality models for the sellers - we assume that sellers follow well-known no-regret type strategies which may vary in their degree of sophistication. We find that IA(GRU) outperforms all algorithms in terms of the total revenue.
【Keywords】: e-commerce; impression allocation; mechanism design; reinforcement learning
【Paper Link】 【Pages】:1349-1357
【Authors】: Junwei Pan ; Jian Xu ; Alfonso Lobos Ruiz ; Wenliang Zhao ; Shengjun Pan ; Yu Sun ; Quan Lu
【Abstract】: Click-through rate (CTR) prediction is a critical task in online display advertising. The data involved in CTR prediction are typically multi-field categorical data, i.e., every feature is categorical and belongs to one and only one field. One of the interesting characteristics of such data is that features from one field often interact differently with features from different other fields. Recently, Field-aware Factorization Machines (FFMs) have been among the best performing models for CTR prediction by explicitly modeling such difference. However, the number of parameters in FFMs is in the order of feature number times field number, which is unacceptable in the real-world production systems. In this paper, we propose Field-weighted Factorization Machines (FwFMs) to model the different feature interactions between different fields in a much more memory-efficient way. Our experimental evaluations show that FwFMs can achieve competitive prediction performance with only as few as 4% parameters of FFMs. When using the same number of parameters, FwFMs can bring 0.92% and 0.47% AUC lift over FFMs on two real CTR prediction data sets.
【Keywords】: ctr prediction; display advertising; factorization machines
【Paper Link】 【Pages】:1359-1368
【Authors】: Vahab S. Mirrokni ; Renato Paes Leme ; Rita Ren ; Song Zuo
【Abstract】: Dynamic mechanisms are a powerful technique in designing revenue-maximizing repeated auctions. Despite their strength, these types of mechanisms have not been widely adopted in practice for several reasons, e.g., for their complexity, and for their sensitivity to the accuracy of predicting buyers» value distributions. In this paper, we aim to address these shortcomings and develop simple dynamic mechanisms that can be implemented efficiently, and provide theoretical guidelines for decreasing the sensitivity of dynamic mechanisms on prediction accuracy of buyers» value distributions. We prove that the dynamic mechanism we propose is provably dynamic incentive compatible, and introduce a notion of buyers» regret in dynamic mechanisms, and show that our mechanism achieves bounded regret while improving revenue and social welfare compared to a static reserve pricing policy. Finally, we confirm our theoretical analysis via an extensive empirical study of our dynamic auction on real data sets from online adverting. For example, we show our dynamic mechanisms can provide a +17% revenue lift with relative regret less than 0.2%.
【Keywords】: bank account mechanisms; dynamic auctions; dynamic mechanism design; dynamic second price auction
【Paper Link】 【Pages】:1369-1378
【Authors】: Alessandro Epasto ; Mohammad Mahdian ; Vahab S. Mirrokni ; Song Zuo
【Abstract】: In a typical learning problem, one key step is to use training data to pick one model from a collection of models that optimizes an objective function. In many multi-agent settings, the training data is generated through the actions of the agents, and the model is used to make a decision (e.g., how to sell an item) that affects the agents. An illustrative example of this is the problem of learning the reserve price in an auction. In such cases, the agents have an incentive to influence the training data (e.g., by manipulating their bids in the case of an auction) to game the system and achieve a more favorable outcome. In this paper, we study such incentive-aware learning problem in a general setting and show that it is possible to approximately optimize the objective function under two assumptions: (i) each individual agent is a "small" (part of the market); and (ii) there is a cost associated with manipulation. For our illustrative application, this nicely translates to a mechanism for setting approximately optimal reserve prices in auctions where no individual agent has significant market share. For this application, we also show that the second assumption (that manipulations are costly) is not necessary since we can "perturb" any auction to make it costly for the agents to manipulate.
【Keywords】: ad auctions; incentive-aware learning; large markets; learning with strategic agents
【Paper Link】 【Pages】:1379-1388
【Authors】: Yakov Babichenko ; Oren Dean ; Moshe Tennenholtz
【Abstract】: Our work bridges the literature on incentive-compatible mechanism design and the literature on diffusion algorithms. We introduce the study of finding an incentive-compatible (strategy-proof) mechanism for selecting an influential vertex in a directed graph (e.g. Twitter»s network). The goal is to devise a mechanism with a bounded ratio between the maximal influence and the influence of the selected user, and in which no user can improve its probability of being selected by following or unfollowing other users. We introduce the Two Path mechanism which is based on the idea of selecting the vertex that is the first intersection of two independent random walks in the network. The Two Path mechanism is incentive compatible on directed acyclic graphs (DAGs), and has a finite approximation ratio on natural subfamilies of DAGs. Simulations indicate that this mechanism is suitable for practical uses.
【Keywords】: diffusion; strategy-proof
【Paper Link】 【Pages】:1389-1398
【Authors】: Lily Hu ; Yiling Chen
【Abstract】: The persistence of racial inequality in the U.S. labor market against a general backdrop of formal equality of opportunity is a troubling phenomenon that has significant ramifications on the design of hiring policies. In this paper, we show that current group disparate outcomes may be immovable even when hiring decisions are bound by an input-output notion of "individual fairness." Instead, we construct a dynamic reputational model of the labor market that illustrates the reinforcing nature of asymmetric outcomes resulting from groups» divergent accesses to resources and as a result, investment choices. To address these disparities, we adopt a dual labor market composed of a Temporary Labor Market (TLM), in which firms» hiring strategies are constrained to ensure statistical parity of workers granted entry into the pipeline, and a Permanent Labor Market (PLM), in which firms hire top performers as desired. Individual worker reputations produce externalities for their group; the corresponding feedback loop raises the collective reputation of the initially disadvantaged group via a TLM fairness intervention that need not be permanent. We show that such a restriction on hiring practices induces an equilibrium that, under particular market conditions, Pareto-dominates those arising from strategies that statistically discriminate or employ a "group-blind" criterion. The enduring nature of equilibria that are both inequitable and Pareto suboptimal suggests that fairness interventions beyond procedural checks of hiring decisions will be of critical importance in a world where machines play a greater role in the employment process.
【Keywords】: dynamic game; fairness; game theory; labor market
【Paper Link】 【Pages】:1399-1408
【Authors】: Florin Constantin ; Christopher Harris ; Samuel Ieong ; Aranyak Mehta ; Xi Tan
【Abstract】: In-app advertising is a complex market worth billions of dollars per year, yet it has been studied significantly less than traditional web display ads. In this paper we study an important but often overlooked feature of ads in mobile apps (mostly absent in traditional web ads), that of ad refreshes : A user is shown a stream of banner ads during the app session, in which each ad is displayed in the ad slot for a certain amount of time (the refresh rate) before the ad-slot is refreshed to the next ad. Data analysis on our large-scale experiments that vary refresh rates reveals a surprising result, that cannot be explained by existing user click models: Varying ads» refresh almost preserves total number of clicks. We propose a new, natural, "two-phase" click model for this setting that explains this independence, as well as our measurements of the click-through rate as a function of the impression»s time-on-screen and of ad-repeat counts. The new click model leads to a clean formulation of the problem of auctioning the entire user-session: i.e., determining online, both the sequence of winning ads as well as the amount of time to display each one. We complement the theoretical auction design with results from a live-traffic experiment with its implementation. Our experiments and analysis provide the theoretical foundation for AdMob»s "Google-optimized refresh rate" feature, used by many mobile apps for better monetization of ads shown to millions of users.
【Keywords】: mobile app advertising; refresh rate; user behavior
【Paper Link】 【Pages】:1409-1418
【Authors】: Weili Chen ; Zibin Zheng ; Jiahui Cui ; Edith Ngai ; Peilin Zheng ; Yuren Zhou
【Abstract】: Blockchain technology becomes increasingly popular. It also attracts scams, for example, Ponzi scheme, a classic fraud, has been found making a notable amount of money on Blockchain, which has a very negative impact. To help dealing with this issue, this paper proposes an approach to detect Ponzi schemes on blockchain by using data mining and machine learning methods. By verifying smart contracts on Ethereum, we first extract features from user accounts and operation codes of the smart contracts and then build a classification model to detect latent Ponzi schemes implemented as smart contracts. The experimental results show that the proposed approach can achieve high accuracy for practical use. More importantly, the approach can be used to detect Ponzi schemes even at the moment of its creation. By using the proposed approach, we estimate that there are more than 400 Ponzi schemes running on Ethereum. Based on these results, we propose to build a uniform platform to evaluate and monitor every created smart contract for early warning of scams.
【Keywords】: blockchain; ethereum; ponzi schemes; smart contract
【Paper Link】 【Pages】:1419-1428
【Authors】: Sébastien Lahaie ; Andrés Muñoz Medina ; Balasubramanian Sivan ; Sergei Vassilvitskii
【Abstract】: Consider a buyer participating in a repeated auction, such as those prevalent in display advertising. How would she test whether the auction is incentive compatible? To bid effectively, she is interested in whether the auction is single-shot incentive compatible---a pure second-price auction, with fixed reserve price---and also dynamically incentive compatible---her bids are not used to set future reserve prices. In this work we develop tests based on simple bid perturbations that a buyer can use to answer these questions, with a focus on dynamic incentive compatibility. There are many potential A/B testing setups that one could use, but we find that many natural experimental designs are, in fact, flawed. For instance, we show that additive perturbations can lead to paradoxical results, where higher bids lead to lower optimal reserve prices. We precisely characterize this phenomenon and show that reserve prices are only guaranteed to be monotone for distributions satisfying the Monotone Hazard Rate (MHR) property. The experimenter must also decide how to split traffic to apply systematic perturbations. It is tempting to have this split be randomized, but we demonstrate empirically that unless the perturbations are aligned with the partitions used by the seller to compute reserve prices, the results are guaranteed to be inconclusive. We validate our results with experiments on real display auction data and show that a buyer can quantify both single-shot and dynamic incentive compatibility even under realistic conditions where only the cost of the impression is observed (as opposed to the exact reserve price). We analyze the cost of running such experiments, exposing trade-offs between test accuracy, cost, and underlying market dynamics.
【Keywords】: ad auctions; incentive compatibility
【Paper Link】 【Pages】:1429-1438
【Authors】: Amy Greenwald ; Takehiro Oyakawa ; Vasilis Syrgkanis
【Abstract】: We study an optimal contest design problem where contributors abilities are private, their costs are convex as a function of their effort, and the designer seeks to maximize their total productivity. We address the design of approximately-optimal mechanisms that are robust, in that they are independent of the ability distribution and the precise form of the cost function. We show that a very simple all-pay contest where the prize is distributed equally among the top quartile of contributors is a constant-factor approximation to the optimal, for a large class of convex cost functions, when the number of contributors is larger than some constant. This result stands in contrast to contests with linear costs, where awarding a prize to a single top contributor ("winner-takes-all»») is approximately-optimal; when costs are convex, winner-takes-all is far from optimal. We validate the performance of our approximately-optimal contest designs via simulation experiments, which uncover much better empirical performance than the worst-case guarantees. Our results are enabled by novel results in the space of optimal mechanism design with convex costs, which could be of independent interest.
【Keywords】: contest; prior-free; simple vs optimal
【Paper Link】 【Pages】:1439-1448
【Authors】: Chen Luo ; Anshumali Shrivastava
【Abstract】: Anomaly detection is one of the frequent and important subroutines deployed in large-scale data processing applications. Even being a well-studied topic, existing techniques for unsupervised anomaly detection require storing significant amounts of data, which is prohibitive from memory, latency and privacy perspectives, especially for small mobile devices which has ultra-low memory budget and limited computational power. In this paper, we propose ACE (Arrays of (locality-sensitive) Count Estimators) algorithm that can be 60x faster than most state-of-the-art unsupervised anomaly detection algorithms. In addition, ACE has appealing privacy properties. Our experiments show that ACE algorithm has significantly smaller memory footprints (∠ 4MB in our experiments) which can exploit Level 3 cache of any modern processor. At the core of the ACE algorithm, there is a novel statistical estimator which is derived from the sampling view of Locality Sensitive Hashing (LSH). This view is significantly different and efficient than the widely popular view of LSH for near-neighbor search. We show the superiority of ACE algorithm over 11 popular baselines on 3 benchmark datasets, including the KDD-Cup99 data which is the largest available public benchmark comprising of more than half a million entries with ground truth anomaly labels.
【Keywords】: anomaly detection; locality-sensitive hashing; lsh sampling
【Paper Link】 【Pages】:1449-1458
【Authors】: John P. Rula ; James Newman ; Fabián E. Bustamante ; Arash Molavi Kakhki ; David R. Choffnes
【Abstract】: In-Flight Communication (IFC), available on a growing number of commercial flights, is often received by consumers with both awe for its mere availability and harsh criticism for its poor performance. Indeed, IFC provides Internet connectivity in some of the most challenging conditions with aircraft traveling at speeds in excess of 500 mph at 30,000 feet above the ground. Yet, while existing services do provide basic Internet \em accessibility, anecdotal reports rank their quality of service as, at best, poor. In this paper, we present the first characterization of deployed IFC systems. Using over 45 flight-hours of measurements, we profile the performance of IFC across the two dominant access technologies -- direct air-to-ground communication (DA2GC) and mobile satellite service (MSS). We show that IFC QoS is in large part determined by the high latencies inherent to DA2GC and MSS, with RTTs averaging 200ms and 750ms, respectively, and that these high latencies directly impact the performance of common applications such as web browsing. While each IFC technology is based on well studied wireless communication technologies, our findings reveal that IFC links experience further degraded link performance than their technological antecedents. We find median loss rates of 7%, and nearly 40% loss at the 90th percentile for MSS, 6.8x larger than recent characterizations of residential satellite networks. We extend our IFC study exploring the potential of the newly released HTTP/2 and QUIC protocols in an emulated IFC environment, finding that QUIC is able to improve page load times by as much as 7.9 times. In addition, we find that HTTP/2»s use of multiplexing multiple requests onto a single TCP connection performs up to 4.8x \em worse than HTTP/1.1 when faced with large numbers of objects. We use network emulation to explore proposed technological improvements to existing IFC systems finding that high link losses, and not bandwidth, account for the largest factor of performance degradation with applications such as web browsing.
【Keywords】: in-flight conenctivity
【Paper Link】 【Pages】:1459-1468
【Authors】: Jie Feng ; Yong Li ; Chao Zhang ; Funing Sun ; Fanchao Meng ; Ang Guo ; Depeng Jin
【Abstract】: Human mobility prediction is of great importance for a wide spectrum of location-based applications. However, predicting mobility is not trivial because of three challenges: 1) the complex sequential transition regularities exhibited with time-dependent and high-order nature; 2) the multi-level periodicity of human mobility; and 3) the heterogeneity and sparsity of the collected trajectory data. In this paper, we propose DeepMove, an attentional recurrent network for mobility prediction from lengthy and sparse trajectories. In DeepMove, we first design a multi-modal embedding recurrent neural network to capture the complicated sequential transitions by jointly embedding the multiple factors that govern the human mobility. Then, we propose a historical attention model with two mechanisms to capture the multi-level periodicity in a principle way, which effectively utilizes the periodicity nature to augment the recurrent neural network for mobility prediction. We perform experiments on three representative real-life mobility datasets, and extensive evaluation results demonstrate that our model outperforms the state-of-the-art models by more than 10%. Moreover, compared with the state-of-the-art neural network models, DeepMove provides intuitive explanations into the prediction and sheds light on interpretable mobility prediction.
【Keywords】: attention; human mobility; recurrent neural network
【Paper Link】 【Pages】:1469-1478
【Authors】: Yun Ma ; Ziniu Hu ; Yunxin Liu ; Tao Xie ; Xuanzhe Liu
【Abstract】: Compared to the Web where each web page has a global URL for external access, a specific 'page' inside a mobile app cannot be easily accessed unless the user performs several steps from the landing page of this app. Recently, the concept of 'deep link' is expected to be a promising solution and has been advocated by major service providers to enable targeting and opening a specific page of an app externally with an accessible uniform resource identifier. In this paper, we present a large-scale empirical study to investigate how deep links are really adopted, over 25,000 Android apps. To our surprise, we find that deep links have quite low coverage, e.g., more than 70% and 90% of the apps do not have deep links on app stores Wandoujia and Google Play, respectively. One underlying reason is the mandatory and non-trivial manual efforts of app developers to provide APIs for deep links. We then propose the Aladdin approach along with its supporting tool to help developers practically automate the release of deep-link APIs to access locations inside their apps. Aladdin includes a novel cooperative framework by synthesizing the static analysis and the dynamic analysis while minimally engaging developers» inputs and configurations, without requiring any coding efforts or additional deployment efforts. We evaluate Aladdin with 579 popular apps and demonstrate its effectiveness and performance.
【Keywords】: android apps; deep link; program analysis
【Paper Link】 【Pages】:1479-1489
【Authors】: Panagiotis Papadopoulos ; Nicolas Kourtellis ; Evangelos P. Markatos
【Abstract】: Digital advertisements are delivered in the form of static images, animations or videos, with the goal to promote a product, a service or an idea to desktop or mobile users. Thus, the advertiser pays a monetary cost to buy ad-space in a content provider»s medium (e.g., website) to place their advertisement in the consumer»s display. However, is it only the advertiser who pays for the ad delivery? Unlike traditional advertisements in mediums such as newspapers, TV or radio, in the digital world, the end-users are also paying a cost for the advertisement delivery. Whilst the cost on the advertiser»s side is clearly monetary, on the end-user, it includes both quantifiable costs, such as network requests and transferred bytes, and qualitative costs such as privacy loss to the ad ecosystem. In this study, we aim to increase user awareness regarding the hidden costs of digital advertisement in mobile devices, and compare the user and advertiser views. Specifically, we built OpenDAMP, a transparency tool that passively analyzes users» web traffic and estimates the costs in both sides. We use a year-long dataset of 1270 real mobile users and by juxtaposing the costs of both sides, we identify a clear imbalance: the advertisers pay several times less to deliver ads, than the cost paid by the users to download them. In addition, the majority of users experience a significant privacy loss, through the personalized ad delivery mechanics.
【Keywords】: cost of mobile advertising; personalized advertising; user privacy
【Paper Link】 【Pages】:1491-1500
【Authors】: Aravindh Raman ; Gareth Tyson ; Nishanth Sastry
【Abstract】: The era of live-broadcast is back but with two major changes. First, unlike traditional TV broadcasts, content is now streamed over the Internet enabling it to reach a wider audience. Second, due to various user-generated content platforms it has become possible for anyone to get involved, streaming their own content to the world. This emerging trend of going live usually happens via social platforms, where users perform live social broadcasts predominantly from their mobile devices, allowing their friends (and the general public) to engage with the stream in real-time. With the growing popularity of such platforms, the burden on the current Internet infrastructure is therefore expected to multiply. With this in mind, we explore one such prominent platform - Facebook Live. We gather 3TB of data, representing one month of global activity and explore the characteristics of live social broadcast. From this, we derive simple yet effective principles which can decrease the network burden. We then dissect global and hyper-local properties of the video while on-air, by capturing the geography of the broadcasters or the users who produce the video and the viewers or the users who interact with it. Finally, we study the social engagement while the video is live and distinguish the key aspects when the same video goes on-demand. A common theme throughout the paper is that, despite its name, many attributes of Facebook Live deviate from both the concepts of live and broadcast.
【Keywords】: edge computing; social live broadcast; user behaviour
【Paper Link】 【Pages】:1501-1511
【Authors】: Zhiyuan Lin ; Tim Althoff ; Jure Leskovec
【Abstract】: Mobile health applications that track activities, such as exercise, sleep, and diet, are becoming widely used. While these activity tracking applications have the potential to improve our health, user engagement and retention are critical factors for their success. However, long-term user engagement patterns in real-world activity tracking applications are not yet well understood. Here we study user engagement patterns within a mobile physical activity tracking application consisting of 115 million logged activities taken by over a million users over 31 months. Specifically, we show that over 75% of users return and re-engage with the application after prolonged periods of inactivity, no matter the duration of the inactivity. We find a surprising result that the re-engagement usage patterns resemble those of the start of the initial engagement period, rather than being a simple continuation of the end of the initial engagement period. This evidence points to a conceptual model of multiple lives of user engagement, extending the prevalent single life view of user activity. We demonstrate that these multiple lives occur because the users have a variety of different primary intents or goals for using the app. These primary intents are associated with how long each life lasts and how likely the user is to re-engage for a new life. We find evidence for users being more likely to stop using the app once they achieved their primary intent or goal (e.g., weight loss). However, these users might return once their original intent resurfaces (e.g., wanting to lose newly gained weight). We discuss implications of the multiple life paradigm and propose a novel prediction task of predicting the number of lives of a user. Based on insights developed in this work, including a marker of improved primary intent performance, our prediction models achieve 71% ROC AUC. Overall, our research has implications for modeling user re-engagement in health activity tracking applications and has consequences for how notifications, recommendations as well as gamification can be used to increase engagement.
【Keywords】: activity logging; activity tracking; lifespan; lifetime; mobile health; multiple lives; quantified self; reengagement; user engagement; user modeling
【Paper Link】 【Pages】:1513-1522
【Authors】: Xin Wang ; Wenwu Zhu ; Chun Chen ; Martin Ester
【Abstract】: The problem of social event organization (SEO) rises with the advent of online web services and plays an important role in helping users discover new offline events. Existing work on SEO only assumes that different users have different preferences towards different events, ignoring the fact that each event (its organizer) may have a separate preference towards every user. In this paper, we investigate joint user- and event- driven SEO by simultaneously considering user preferences (towards events) and event preferences (towards users). A risen challenging problem is that this joint consideration may suffer instabilities between users and events which are NP-hard to handle in SEO. Stability is a desired property that needs to be maintained in SEO, otherwise participants will incline towards changing to other events and trust less the organizer. To the best of our knowledge, we are the first to study SEO with both preferences from users to events and preferences from events to users being considered. In this work, we formulate the stable social event organization (Stable-SEO) problem and prove its NP-hardness. We then propose an efficient greedy heuristic algorithm to solve Stable-SEO, taking both user preferences and event preferences into account. The proposed approach is able to find an assignment of a group of users to a set of events for an event organization, in which there is a minimized number of users involved in user-event pairs such that both the user and event in one such pair will be better off when reassigning the user to this event. Our experiments on two real-world datasets demonstrate the strong superiority of our proposed approach over existing methods.
【Keywords】: event organization; organizer happiness; stability; user happiness
【Paper Link】 【Pages】:1523-1532
【Authors】: Fuli Feng ; Xiangnan He ; Yiqun Liu ; Liqiang Nie ; Tat-Seng Chua
【Abstract】: Graph-based learning methods explicitly consider the relations between two entities (i.e., vertices) for learning the prediction function. They have been widely used in semi-supervised learning, manifold ranking, and clustering, among other tasks. Enhancing the expressiveness of simple graphs, hypergraphs formulate an edge as a link to multiple vertices, so as to model the higher-order relations among entities. For example, hyperedges in a hypergraph can be used to encode the similarity among vertices. To the best of our knowledge, all existing hypergraph structures represent the hyperedge as an unordered set of vertices, without considering the possible ordering relationship among vertices. In real-world data, ordering relations commonly exist, such as in graded categorical features (e.g., users» ratings on movies) and numerical features (e.g., monthly income of customers). When constructing a hypergraph, ignoring such ordering relations among entities will lead to severe information loss, resulting in suboptimal performance of the subsequent learning algorithms. In this work, we address the inherent limitation of existing hypergraphs by proposing a new data structure named Partial-Order Hypergraph, which specifically injects the partially ordering relations among vertices into a hyperedge. We develop regularization-based learning theories for partial-order hypergraphs, generalizing conventional hypergraph learning by incorporating logical rules that encode the partial-order relations. We apply our proposed method to two applications: university ranking from Web data and popularity prediction of online content. Extensive experiments demonstrate the superiority of our proposed partial-order hypergraphs, which consistently improve over conventional hypergraph methods.
【Keywords】: graph-based learning; hypergraph; partial-order hypergraph; popularity prediction; university ranking
【Paper Link】 【Pages】:1533-1542
【Authors】: Mengyang Liu ; Yiqun Liu ; Jiaxin Mao ; Cheng Luo ; Min Zhang ; Shaoping Ma
【Abstract】: User satisfaction has been paid much attention to in recent Web search evaluation studies. Although satisfaction is often considered as an important symbol of search success, it doesn»t guarantee success in many cases, especially for complex search task scenarios. In this study, we investigate the differences between user satisfaction and search success, and try to adopt the findings to predict search success in complex search tasks. To achieve these research goals, we conduct a laboratory study in which search success and user satisfaction are annotated by domain expert assessors and search users, respectively. We find that both "Satisfaction with Failure" and "Unsatisfied Success" cases happen in these search tasks and together they account for as many as 40.3% of all search sessions. The factors (e.g. document readability and credibility) that lead to the inconsistency of search success and user satisfaction are also investigated and adopted to predict whether one search task is successful. Experimental results show that our proposed prediction method is effective in predicting search success.
【Keywords】: search evaluation; search success; user satisfaction
【Paper Link】 【Pages】:1543-1552
【Authors】: Xiang Wang ; Xiangnan He ; Fuli Feng ; Liqiang Nie ; Tat-Seng Chua
【Abstract】: While collaborative filtering is the dominant technique in personalized recommendation, it models user-item interactions only and cannot provide concrete reasons for a recommendation. Meanwhile, the rich side information affiliated with user-item interactions (e.g., user demographics and item attributes), which provide valuable evidence that why a recommendation is suitable for a user, has not been fully explored in providing explanations. On the technical side, embedding-based methods, such as Wide&Deep and neural factorization machines, provide state-of-the-art recommendation performance. However, they work like a black-box, for which the reasons underlying a prediction cannot be explicitly presented. On the other hand, tree-based methods like decision trees predict by inferring decision rules from data. While being explainable, they cannot generalize to unseen feature interactions thus fail in collaborative filtering applications. In this work, we propose a novel solution named Tree-enhanced Embedding Method that combines the strengths of embedding-based and tree-based models. We first employ a tree-based model to learn explicit decision rules (aka. cross features) from the rich side information. We next design an embedding model that can incorporate explicit cross features and generalize to unseen cross features on user ID and item ID. At the core of our embedding method is an easy-to-interpret attention network, making the recommendation process fully transparent and explainable. We conduct experiments on two datasets of tourist attraction and restaurant recommendation, demonstrating the superior performance and explainability of our solution.
【Keywords】: embedding-based model; explainable recommendation; neural attention network; tree-based model
【Paper Link】 【Pages】:1553-1562
【Authors】: Shuo Zhang ; Krisztian Balog
【Abstract】: We introduce and address the problem of ad hoc table retrieval: answering a keyword query with a ranked list of tables. This task is not only interesting on its own account, but is also being used as a core component in many other table-based information access scenarios, such as table completion or table mining. The main novel contribution of this work is a method for performing semantic matching between queries and tables. Specifically, we (i) represent queries and tables in multiple semantic spaces (both discrete sparse and continuous dense vector representations) and (ii) introduce various similarity measures for matching those semantic representations. We consider all possible combinations of semantic representations and similarity measures and use these as features in a supervised learning model. Using a purpose-built test collection based on Wikipedia tables, we demonstrate significant and substantial improvements over a state-of-the-art baseline.
【Keywords】: semantic matching; semantic representations; semantic similarity; table retrieval; table search
【Paper Link】 【Pages】:1563-1571
【Authors】: Bin Wu ; Chenyan Xiong ; Maosong Sun ; Zhiyuan Liu
【Abstract】: This paper presents Feedback Memory Network (\textttFMN) which models user interactions with the search engine for query suggestion. Besides modeling the queries issued by the user, \textttFMN also considers user feedback on the search results. It converts user browsing and click actions to the attention over the top-ranked documents and combines them into the feedback memories of the query, thus better models the underlying information needs. The feedback memories and the query sequence are then combined to suggest queries by the sequence-to-sequence neural network. Modeling user feedback makes it possible to suggest diverse queries for the same query sequence, if users have preferred different search results that indicate different information needs. Our experiments on the search log from a Chinese commercial search engine showed the stable and robust advantages of \textttFMN. Especially when the feedback is richer or more informative, \textttFMN provides more diverse and accurate suggestions, which is exceptionally helpful for ambiguous sessions where more information is required to infer the search intents.
【Keywords】: feedback memory network; query suggestion; user modeling
【Paper Link】 【Pages】:1573-1582
【Authors】: Vineeth Rakesh ; Weicong Ding ; Aman Ahuja ; Nikhil Rao ; Yifan Sun ; Chandan K. Reddy
【Abstract】: Online reviews have become an inevitable part of a consumer's decision making process, where the likelihood of purchase not only depends on the product's overall rating, but also on the description of its aspects. Therefore, e-commerce websites such as Amazon and Walmart constantly encourage users to write good quality re- views and categorically summarize different facets of the products. However, despite such attempts, it takes a significant effort to skim through thousands of reviews and look for answers that address the query of consumers. For example, a gamer might be interested in buying a monitor with fast refresh rates and support for Gsync and Freesync technologies, while a photographer might be interested in aspects such as color depth and accuracy. To address these chal- lenges, in this paper, we propose a generative aspect summarization model called APSUM that is capable of providing fine-grained sum- maries of online reviews. To overcome the inherent problem of aspect sparsity, we impose dual constraints: (a) a spike-and-slab prior over the document-topic distribution and (b) a linguistic su- pervision over the word-topic distribution. Using a rigorous set of experiments, we show that the proposed model is capable of out- performing the state-of-the-art aspect summarization model over a variety of datasets and deliver intuitive fine-grained summaries that could simplify the purchase decisions of consumers.
【Keywords】: aspect summarization; information re- trieval; probabilistic generative models; topic models
【Paper Link】 【Pages】:1583-1592
【Authors】: Chong Chen ; Min Zhang ; Yiqun Liu ; Shaoping Ma
【Abstract】: Reviews information is dominant for users to make online purchasing decisions in e-commerces. However, the usefulness of reviews is varied. We argue that less-useful reviews hurt model's performance, and are also less meaningful for user's reference. While some existing models utilize reviews for improving the performance of recommender systems, few of them consider the usefulness of reviews for recommendation quality. In this paper, we introduce a novel attention mechanism to explore the usefulness of reviews, and propose a Neural Attentional Regression model with Review-level Explanations (NARRE) for recommendation. Specifically, NARRE can not only predict precise ratings, but also learn the usefulness of each review simultaneously. Therefore, the highly-useful reviews are obtained which provide review-level explanations to help users make better and faster decisions. Extensive experiments on benchmark datasets of Amazon and Yelp on different domains show that the proposed NARRE model consistently outperforms the state-of-the-art recommendation approaches, including PMF, NMF, SVD++, HFT, and DeepCoNN in terms of rating prediction, by the proposed attention model that takes review usefulness into consideration. Furthermore, the selected reviews are shown to be effective when taking existing review-usefulness ratings in the system as ground truth. Besides, crowd-sourcing based evaluations reveal that in most cases, NARRE achieves equal or even better performances than system's usefulness rating method in selecting reviews. And it is flexible to offer great help on the dominant cases in real e-commerce scenarios when the ratings on review-usefulness are not available in the system.
【Keywords】: explainable recommendation; neural attention network; recommender systems; review usefulness
【Paper Link】 【Pages】:1593-1602
【Authors】: Chun-Ta Lu ; Lifang He ; Hao Ding ; Bokai Cao ; Philip S. Yu
【Abstract】: Real-world relations among entities can often be observed and determined by different perspectives/views. For example, the decision made by a user on whether to adopt an item relies on multiple aspects such as the contextual information of the decision, the item»s attributes, the user»s profile and the reviews given by other users. Different views may exhibit multi-way interactions among entities and provide complementary information. In this paper, we introduce a multi-tensor-based approach that can preserve the underlying structure of multi-view data in a generic predictive model. Specifically, we propose structural factorization machines (SFMs) that learn the common latent spaces shared by multi-view tensors and automatically adjust the importance of each view in the predictive model. Furthermore, the complexity of SFMs is linear in the number of parameters, which make SFMs suitable to large-scale problems. Extensive experiments on real-world datasets demonstrate that the proposed SFMs outperform several state-of-the-art methods in terms of prediction accuracy and computational cost.
【Keywords】: multi-view learning; multi-way interaction; tensor factorization
【Paper Link】 【Pages】:1603-1612
【Authors】: Xin Luo ; Ye Wu ; Xin-Shun Xu
【Abstract】: Supervised hashing methods have attracted much attention in these years. However, most existing supervised hashing algorithms have some of the following problems. First, most of them leverage the pairwise similarity matrix, whose size is quadratic to the number of training samples, to supervise the learning of hash codes. Thus, they are not scalable when dealing with large data. Second, most of them relax the discrete constraints for easy optimization and then quantize the learnt real-valued solution to binary hash codes. Therefore, the quantization error caused by the relaxation may lead to a decline of retrieval performance. To address these issues and make the supervised method scalable to large datasets, we present a novel hashing method, named Scalable Supervised Discrete Hashing (SSDH). Specifically, based on a new loss function, SSDH bypasses the direct optimization on the n by n pairwise similarity matrix. In addition, SSDH adopts no relaxation optimization scheme in the learning procedure and avoids the large quantization error problem. Moreover, during learning, it leverages both the pairwise similarity matrix and label matrix; thus, more semantic information can be embedded to the learning of hash codes. Extensive experiments are conducted on six benchmark datasets including two large-scale datasets, i.e., NUS-WIDE and ImageNet. The results show that SSDH can outperform state-of-the-art baselines on these datasets, demonstrating its effectiveness and efficiency.
【Keywords】: discrete optimization; learning-to-hash; scalable search; supervised hashing
【Paper Link】 【Pages】:1613-1622
【Authors】: Zemin Liu ; Vincent W. Zheng ; Zhou Zhao ; Hongxia Yang ; Kevin Chen-Chuan Chang ; Minghui Wu ; Jing Ying
【Abstract】: Semantic user search is an important task on heterogeneous social networks. Its core problem is to measure the proximity between two user objects in the network w.r.t. certain semantic user relation. State-of-the-art solutions often take a path-based approach, which uses the sequences of objects connecting a query user and a target user to measure their proximity. Despite their success, we assert that path as a low-order structure is insufficient to capture the rich semantics between two users. Therefore, in this paper we introduce a new concept of subgraph-augmented path for semantic user search. Specifically, we consider sampling a set of object paths from a query user to a target user; then in each object path, we replace the linear object sequence between its every two neighboring users with their shared subgraph instances. Such subgraph-augmented paths are expected to leverage both path»s distance awareness and subgraph»s high-order structure. As it is non-trivial to model such subgraph-augmented paths, we develop a Subgraph-augmented Path Embedding (SPE) framework to accomplish the task. We evaluate our solution on six semantic user relations in three real-world public data sets, and show that it outperforms the baselines.
【Keywords】: heterogeneous network; subgraph-augmented path embedding
【Paper Link】 【Pages】:1623-1632
【Authors】: Denghao Ma ; Yueguo Chen ; Kevin Chen-Chuan Chang ; Xiaoyong Du ; Chuanfei Xu ; Yi Chang
【Abstract】: Ad-hoc entity search, which is to retrieve a ranked list of relevant entities in response to a query of natural language question, has been widely studied. It has been shown that category matching of entities, especially when matching to fine-grained entity types/categories, is critical to the performance of entity search. However, the potentials of the fine-grained Wikipedia entity categories, has not been well exploited by existing studies. Based on the observation of how people describe entities of a specific type, we propose a headword-and-modifier model to deeply interpret both queries and fine-grained entity types/categories. Probabilistic generative models are designed to effectively estimate the relevance of headwords and modifiers as a pattern-based matching problem, taking the Wikipedia type taxonomy as an important input to address the ad-hoc representations of concepts/entities in queries. Extensive experimental results on three widely-used test sets: INEX-XER 2009, SemSearch-LS and TREC-Entity, show that our method achieves a significant improvement of the entity search performance over the state-of-the-art methods.
【Keywords】: category matching; entity search; language model
【Paper Link】 【Pages】:1633-1642
【Authors】: Xiao Lin ; Wenpeng Zhang ; Min Zhang ; Wenwu Zhu ; Jian Pei ; Peilin Zhao ; Junzhou Huang
【Abstract】: Factorization Machine (FM) is a supervised learning approach with a powerful capability of feature engineering. It yields state-of-the-art performances in various batch learning tasks where all the training data is made available prior to the training. However, in real-world applications where the data arrives sequentially in a streaming manner, the high cost of re-training with batch learning algorithms has posed formidable challenges in the online learning scenario. The initial challenge is that no prior formulations of FM could directly fulfill the requirements in Online Convex Optimization (OCO) -- the paramount framework for online learning algorithm design. To address this aforementioned challenge, we invent a new convexification scheme leading to a Compact Convexified FM (CCFM) that seamlessly meets the requirements in OCO. However for learning Compact Convexified FM (CCFM) in the online learning settings, most existing algorithms suffer from expensive projection operations. To address this subsequent challenge, we follow the general projection-free algorithmic framework of Online Conditional Gradient and propose an Online Compact Convex Factorization Machine (OCCFM) algorithm that eschews the projection operation with efficient linear optimization steps. In support of the proposed OCCFM in terms of its theoretical foundation, we prove that the developed algorithm achieves a sub-linear regret bound. To evaluate the empirical performance of OCCFM, we conduct extensive experiments on 6 real-world datasets for online regression and online classification tasks. The experimental results show that OCCFM outperforms the state-of-art online learning methods for FM.
【Keywords】: factorization machine; online conditional gradient; online convex optimization; online learning
【Paper Link】 【Pages】:1643-1652
【Authors】: Jundong Li ; Jiliang Tang ; Yilin Wang ; Yali Wan ; Yi Chang ; Huan Liu
【Abstract】: Reciprocity in directed networks points to user»s willingness to return favors in building mutual interactions. High reciprocity has been widely observed in many directed social media networks such as following relations in Twitter and Tumblr. Therefore, reciprocal relations between users are often regarded as a basic mechanism to create stable social ties and play a crucial role in the formation and evolution of networks. Each reciprocity relation is formed by two parasocial links in a back-and-forth manner with a time delay. Hence, understanding the delay can help us gain better insights into the underlying mechanisms of network dynamics. Meanwhile, the accurate prediction of delay has practical implications in advancing a variety of real-world applications such as friend recommendation and marketing campaign. For example, by knowing when will users follow back, service providers can focus on the users with a potential long reciprocal delay for effective targeted marketing. This paper presents the initial investigation of the time delay in reciprocal relations. Our study is based on a large-scale directed network from Tumblr that consists of 62.8 million users and 3.1 billion user following relations with a timespan of multiple years (from 31 Oct 2007 to 24 Jul 2013). We reveal a number of interesting patterns about the delay that motivate the development of a principled learning model to predict the delay in reciprocal relations. Experimental results on the above mentioned dynamic networks corroborate the effectiveness of the proposed delay prediction model.
【Keywords】: dynamic networks; reciprocal relations; time delay
【Paper Link】 【Pages】:1653-1662
【Authors】: Hongshen Chen ; Zhaochun Ren ; Jiliang Tang ; Yihong Eric Zhao ; Dawei Yin
【Abstract】: Dialogue systems help various real applications interact with humans in an intelligent natural way. In dialogue systems, the task of dialogue generation aims to generate utterances given previous utterances as contexts. Among various spectrums of dialogue generation approaches, end-to-end neural generation models have received an increase of attention. These end-to-end neural generation models are capable of generating natural-sounding sentences with a unified neural encoder-decoder network structure. The end-to-end structure sequentially encodes each word in an input context and generates the response word-by-word deterministically during decoding. However, lack of variation and limited ability in capturing long-term dependencies between utterances still challenge existing approaches. In this paper, we propose a novel hierarchical variational memory network (HVMN), by adding the hierarchical structure and the variational memory network into a neural encoder-decoder network. By emulating human-to-human dialogues, our proposed method can capture both the high-level abstract variations and long-term memories during dialogue tracking, which enables the random access of relevant dialogue histories. Extensive experiments conducted on three large real-world datasets verify a significant improvement of our proposed model against state-of-the-art baselines for dialogue generation.
【Keywords】: dialogue generation; hierarchical variational memory network; recurrent encoder-decoder model
【Paper Link】 【Pages】:1663-1672
【Authors】: Sanket Kumar Singh ; Davood Rafiei
【Abstract】: Many applications that use geographical databases (a.k.a. gazetteers) rely on the accuracy of the information in the database. However, poor data quality is an issue when data is integrated from multiple sources with different quality constraints and sometimes with little information about the sources. One major consequence of this is that the geographical scope of a location and/or its position may not be known or may not be accurate. In this paper, we study the problem of detecting the scope of locations in a geographical database and its applications in identifying inconsistencies and improving the quality of a gazetteer. We develop novel strategies, including probabilistic and geometric approaches, to accurately derive the geographical scope of places based on the spatial hierarchy of a gazetteer as well as other public information (such as area) that may be available. We show how the boundary information derived here can be useful in identifying inconsistencies, enhancing the location hierarchy and improving the applications that rely on gazetteers. Our experimental evaluation on two public-domain gazetteers reveals that the proposed approaches significantly outperform, in terms of the accuracy of the geographical bounding boxes, a baseline that is based on the parent-child relationship of a gazetteer. Among applications, we show that the boundary information derived here can move more than 20% of locations in a public gazetteer to better positions in the hierarchy and that the accuracy of those moves is over 90%.
【Keywords】: gazetteer improvement; geographical scoping; geotagging
【Paper Link】 【Pages】:1673-1682
【Authors】: Ning Su ; Yiqun Liu ; Zhao Li ; Yuli Liu ; Min Zhang ; Shaoping Ma
【Abstract】: "Add to Favorites" is a popular function in online shopping sites which helps users to make a record of potentially interesting items for future purchases. It is usually regarded as a type of explicit feedback signal for item popularity and therefore also adopted as a ranking signal by many shopping search engines. With the increasing usage of crowdsourcing platforms, some malicious online sellers also organize crowdturfing activities to increase the numbers of "Add to Favorites" for their items. By this means, they expect the items to gain higher positions in search ranking lists and therefore boost sales. This kind of newly-appeared malicious activity proposes challenges to traditional search spam detection efforts because it involves the participation of many crowd workers who are normal online shopping users in most of the times, and these activities are composed of a series of behaviors including search, browse, click and add to favorites. To shed light on this research question, we are among the first to investigate this particular spamming activity by looking into both the task organization information in crowdsourcing platforms and the user behavior information from online shopping sites. With a comprehensive analysis of some ground truth spamming activities from the perspective of behavior, user and item, we propose a factor graph based model to identify this kind of spamming activity. Experimental results based on data collected in practical shopping search environment show that our model helps detect malicious "Add to Favorites" activities effectively.
【Keywords】: crowdsourcing manipulation; online shopping; spam detection
【Paper Link】 【Pages】:1683-1692
【Authors】: Yashar Moshfeghi ; Frank E. Pollick
【Abstract】: Search is one of the most performed activities on the World Wide Web. Various conceptual models postulate that the search process can be broken down into distinct emotional and cognitive states of searchers while they engage in a search process. These models significantly contribute to our understanding of the search process. However, they are typically based on self-report measures, such as surveys, questionnaire, etc. and therefore, only indirectly monitor the brain activity that supports such a process. With this work, we take one step further and directly measure the brain activity involved in a search process. To do so, we break down a search process into five time periods: a realisation of Information Need, Query Formulation, Query Submission, Relevance Judgment and Satisfaction Judgment. We then investigate the brain activity between these time periods. Using functional Magnetic Resonance Imaging (fMRI), we monitored the brain activity of twenty-four participants during a search process that involved answering questions carefully selected from the TREC-8 and TREC 2001 Q/A Tracks. This novel analysis that focuses on transitions rather than states reveals the contrasting brain activity between time periods - which enables the identification of the distinct parts of the search process as the user moves through them. This work, therefore, provides an important first step in representing the search process based on the transitions between neural states. Discovering more precisely how brain activity relates to different parts of the search process will enable the development of brain-computer interactions that better support search and search interactions, which we believe our study and conclusions advance.
【Keywords】: brain activity; fmri study; functional brain network; information need; neural state; query formulation; query submission; relevance judgment; satisfaction; search process
【Paper Link】 【Pages】:1693-1703
【Authors】: Ziyu Yao ; Daniel S. Weld ; Wei-Peng Chen ; Huan Sun
【Abstract】: Stack Overflow (SO) has been a great source of natural language questions and their code solutions (i.e., question-code pairs), which are critical for many tasks including code retrieval and annotation. In most existing research, question-code pairs were collected heuristically and tend to have low quality. In this paper, we investigate a new problem of systematically mining question-code pairs from Stack Overflow (in contrast to heuristically collecting them). It is formulated as predicting whether or not a code snippet is a standalone solution to a question. We propose a novel Bi-View Hierarchical Neural Network which can capture both the programming content and the textual context of a code snippet (i.e., two views) to make a prediction. On two manually annotated datasets in Python and SQL domain, our framework substantially outperforms heuristic methods with at least 15% higher F1 and accuracy. Furthermore, we present StaQC (Stack Overflow Question-Code pairs), the largest dataset to date of ~148K Python and ~120K SQL question-code pairs, automatically mined from SO using our framework. Under various case studies, we demonstrate that StaQC can greatly help develop data-hungry models for associating natural language with programming language
【Keywords】: deep neural networks; natural language question answering; question-code pairs; stack overflow
【Paper Link】 【Pages】:1705-1714
【Authors】: Branislav Kveton ; S. Muthukrishnan ; Hoa T. Vu ; Yikun Xian
【Abstract】: Modern data streams typically have high dimensionality. For example, digital analytics streams consist of user online activities (e.g., web browsing activity, commercial site activity, apps and social behavior, and response to ads). An important problem is to find frequent joint values (heavy hitters) of subsets of dimensions. Formally, the data stream consists of d-dimensional items and a \em k-dimensional subcube T is a subset of k distinct coordinates. Given a theshold γ, a \em subcube heavy hitter query $\rm Query (T,v)$ outputs YES if $f_T(v) \geq γ$ and NO if $f_T(v) ∠ γ/4$ where $f_T$ is the ratio of the number of stream items whose coordinates T have joint values v. The \em all subcube heavy hitters query $\rm AllQuery(T)$ outputs all joint values v that return YES to $\rm Query (T,v)$. The problem is to answer these queries correctly for all T and v. We present a simple one-pass sampling algorithm to solve the subcube heavy hitters problem in $\tildeO (kd/γ)$ space. $\tildeO(\cdot)$ suppresses polylogarithmic factors. This is optimal up to polylogarithmic factors based on the lower bound of Liberty et al. In the worst case, this bound becomes Θ(d^2/γ)$ which is prohibitive for large d. Our main contribution is to circumvent this quadratic bottleneck via a model-based approach. In particular, we assume that the dimensions are related to each other via the Naive Bayes model. We present a new two-pass, $\tildeO (d/γ)$-space algorithm for our problem, and a fast algorithm for answering $\rm AllQuery (T)$ in $\tO((k/γ)^2)$ time. We demonstrate the effectiveness of our approach on a synthetic dataset as well as real datasets from Adobe and Yandex. Our work shows the potential of model-based approach to data streams.
【Keywords】: algorithms; data streams; graphical models; heavy hitters
【Paper Link】 【Pages】:1715-1724
【Authors】: Gary Ren ; Xiaochuan Ni ; Manish Malik ; Qifa Ke
【Abstract】: Understanding conversations is crucial to enabling conversational search in technologies such as chatbots, digital assistants, and smart home devices that are becoming increasingly popular. Conventional search engines are powerful at answering open domain queries but are mostly capable of stateless search. In this paper, we define a conversational query as a query that depends on the context of the current conversation, and we formulate the conversational query understanding problem as context-aware query reformulation, where the goal is to reformulate the conversational query into a search engine friendly query in order to satisfy users» information needs in conversational settings. Such context-aware query reformulation problem lends itself to sequence to sequence modeling. We present a large scale open domain dataset of conversational queries and various sequence to sequence models that are learned from this dataset. The best model correctly reformulates over half of all conversational queries, showing the potential of sequence to sequence modeling for this task.
【Keywords】: conversations; deep learning; query understanding; search; sequence to sequence
【Paper Link】 【Pages】:1725-1734
【Authors】: Daniel Agun ; Jinjin Shao ; Shiyu Ji ; Stefano Tessaro ; Tao Yang
【Abstract】: This paper proposes a private ranking scheme with linear additive scoring for efficient top K keyword search on modest-sized cloud datasets. This scheme strikes for tradeoffs between privacy and efficiency by proposing single-round client-server collaboration with server-side partial ranking based on blinded feature weights with random masks. Client-side preprocessing includes query decomposition with chunked postings to facilitate earlier range intersection and fast access of server-side key-value stores. Server-side query processing deals with feature vector sparsity through optional feature matching and enables result filtering with query-dependent chunk-wide random masks for queries that yield too many matched documents. This paper provides details on indexing and run-time conjunctive query processing and presents an evaluation that assesses the accuracy, efficiency, and privacy tradeoffs of this scheme through five datasets with various sizes.
【Keywords】: additive rank scoring; privacy-preserving query processing; top k search
【Paper Link】 【Pages】:1735-1744
【Authors】: Shangsong Liang ; Ilya Markov ; Zhaochun Ren ; Maarten de Rijke
【Abstract】: We address the task of fusing ranked lists of documents that are retrieved in response to a query. Past work on this task of rank aggregation often assumes that documents in the lists being fused are independent and that only the documents that are ranked high in many lists are likely to be relevant to a given topic. We propose manifold learning aggregation approaches, ManX and v-ManX, that build on the cluster hypothesis and exploit inter-document similarity information. ManX regularizes document fusion scores, so that documents that appear to be similar within a manifold, receive similar scores, whereas v-ManX first generates virtual adversarial documents and then regularizes the fusion scores of both original and virtual adversarial documents. Since aggregation methods built on the cluster hypothesis are computationally expensive, we adopt an optimization method that uses the top-k documents as anchors and considerably reduces the computational complexity of manifold-based methods, resulting in two efficient aggregation approaches, a-ManX and a-v-ManX. We assess the proposed approaches experimentally and show that they significantly outperform the state-of-the-art aggregation approaches, while a-ManX and a-v-ManX run faster than ManX, v-ManX, respectively.
【Keywords】: ad hoc retrieval; manifold learning; rank aggregation
【Paper Link】 【Pages】:1745-1754
【Authors】: Nir Grinberg
【Abstract】: Prior work established the benefits of server-recorded user engagement measures (e.g. clickthrough rates) for improving the results of search engines and recommendation systems. Client-side measures of post-click behavior received relatively little attention despite the fact that publishers have now the ability to measure how millions of people interact with their content at a fine resolution using client-side logging. In this study, we examine patterns of user engagement in a large, client-side log dataset of over 7.7 million page views (including both mobile and non-mobile devices) of 66,821 news articles from seven popular news publishers. For each page view we use three summary statistics: dwell time, the furthest position the user reached on the page, and the amount of interaction with the page through any form of input (touch, mouse move, etc.). We show that simple transformations on these summary statistics reveal six prototypical modes of reading that range from scanning to extensive reading and persist across sites. Furthermore, we develop a novel measure of information gain in text to capture the development of ideas within the body of articles and investigate how information gain relates to the engagement with articles. Finally, we show that our new measure of information gain is particularly useful for predicting reading of news articles before publication, and that the measure captures unique information not available otherwise.
【Keywords】: information gain; online news; post-click engagement; reading; user engagement
【Paper Link】 【Pages】:1755-1764
【Authors】: Maarten Wijnants ; Robin Marx ; Peter Quax ; Wim Lamotte
【Abstract】: Web performance is a hot topic, as many studies have shown a strong correlation between slow webpages and loss of revenue due to user dissatisfaction. Front and center in Page Load Time (PLT) optimization is the order in which resources are downloaded and processed. The new HTTP/2 specification includes dedicated resource prioritization provisions, to be used in tandem with resource multiplexing over a single, well-filled TCP connection. However, little is yet known about its application by browsers and its impact on page load performance. This article details an extensive survey of modern User Agent implementations, with the conclusion that the major vendors all approach HTTP/2 prioritization in widely different ways, from naive (Safari, IE, Edge) to complex (Chrome, Firefox). We investigate the performance effect of these discrepancies with a full-factorial experimental evaluation involving eight prioritization algorithms, two off-the-shelf User Agents, 40 realistic webpages, and five heterogeneous (emulated) network conditions. We find that in general the complex approaches yield the best results, while naive schemes can lead to over 25% slower median visual load times. Also, prioritization is found to matter most for heavy-weight pages. Finally, it is ascertained that achieving PLT optimizations via generic server-side HTTP/2 re-prioritization schemes is a non-trivial task and that their performance is influenced by the implementation intricacies of individual browsers.
【Keywords】: experimental evaluation; http/2; page load time (plt); prioritization; resource loading; web performance optimization (wpo)
【Paper Link】 【Pages】:1765-1774
【Authors】: Kijung Shin ; Mahdi Shafiei ; Myunghwan Kim ; Aastha Jain ; Hema Raghavan
【Abstract】: User engagement is a key factor for the success of web services. Studying the following questions will help establishing business strategies leading to their success: How do the behaviors of users in a web service evolve over time? To reach a certain engagement level, what are the common stages that many users go through? How can we represent the stage that each individual user lies in? To answer these questions, we propose a behavior model that discovers the progressions of users' behaviors from a given starting point - such as a new subscription or first experience of certain features - to a particular target stage such as a predefined engagement level of interest. Under our model, transitions over stages represent progression of users where each stage in our model is characterized by probability distributions over types of actions, frequencies of actions, and next stages to move. Each user performs actions and moves to a next stage following the probability distributions characterizing the current stage. We also develop a fast and memory-efficient algorithm that fits our model to trillions of behavioral logs. Our algorithm scales linearly with the size of data. Especially, its distributed version implemented in the MapReduce framework successfully handles petabyte-scale data with one trillion actions. Lastly, we show the effectiveness of our model and algorithm by applying them to real-world data from LinkedIn. We discover meaningful stages that LinkedIn users go through leading to predefined target goals. In addition, our trained models are shown to be useful for downstream tasks such as prediction of future actions.
【Keywords】: behavior log; behavior modeling; mapreduce; online social network; progression; stage
【Paper Link】 【Pages】:1775-1784
【Authors】: Chantat Eksombatchai ; Pranav Jindal ; Jerry Zitao Liu ; Yuchen Liu ; Rahul Sharma ; Charles Sugnet ; Mark Ulrich ; Jure Leskovec
【Abstract】: User experience in modern content discovery applications critically depends on high-quality personalized recommendations. However, building systems that provide such recommendations presents a major challenge due to a massive pool of items, a large number of users, and requirements for recommendations to be responsive to user actions and generated on demand in real-time. Here we present Pixie, a scalable graph-based real-time recommender system that we developed and deployed at Pinterest. Given a set of user-specific pins as a query, Pixie selects in real-time from billions of possible pins those that are most related to the query. To generate recommendations, we develop Pixie Random Walk algorithm that utilizes the Pinterest object graph of 3 billion nodes and 17 billion edges. Experiments show that recommendations provided by Pixie lead up to 50% higher user engagement when compared to the previous Hadoop-based production system. Furthermore, we develop a graph pruning strategy at that leads to an additional 58% improvement in recommendations. Last, we discuss system aspects of Pixie, where a single server executes 1,200 recommendation requests per second with 60 millisecond latency. Today, systems backed by Pixie contribute to more than 80% of all user engagement on Pinterest.
【Keywords】:
【Paper Link】 【Pages】:1785-1794
【Authors】: Hong Huang ; Bo Zhao ; Hao Zhao ; Zhou Zhuang ; Zhenxuan Wang ; Xiaoming Yao ; Xinggang Wang ; Hai Jin ; Xiaoming Fu
【Abstract】: The proliferation of mobile devices especially smart phones brings remarkable opportunities for both industry and academia. In particular, the massive data generated from users» usage logs provide the possibilities for stakeholders to know better about consumer behaviors with the aid of data mining. In this paper, we examine the consumer behaviors across multiple platforms based on a large-scale mobile Internet dataset from a major telecom operator, which covers 9.8 million users from two regions among which 1.4 million users have visited e-commerce platforms within one week of our study. We make several interesting observations and examine users» cultural differences from different regions. Our analysis shows among the multiple e-commerce platforms available, most mobile users are loyal to their favorable sites; people (60%) tend to make quick decisions to buy something online, which usually takes less than half an hour. Furthermore, we find that people in residential areas are much easier to perform purchases than in business districts and purchases take place during non-work time. Meanwhile, people with medium socioeconomic status like browsing and purchasing on e-commerce platforms, while people with high and low socioeconomic status are much easier to conduct purchases online. We also show the predictability of cross-platform shopping behaviors with extensive experiments on the basis of our observed data. Our discoveries could be a good guide for e-commerce future strategy making.
【Keywords】: consumer behavior; data mining; e-commerce; mobile usage; predictability
【Paper Link】 【Pages】:1795-1804
【Authors】: Yixuan Cao ; Hongwei Li ; Ping Luo ; Jiaquan Yao
【Abstract】: Verbal descriptions over the numerical relationships among some objective measures widely exist in the published documents on Web, especially in the financial fields. However, due to large volumes of documents and limited time for manual cross-check, these claims might be inconsistent with the original structured data of the related indicators even after official publishing. Such errors can seriously affect investors' assessment of the company and may cause them to undervalue the firm even if the mistakes are made unintentionally instead of deliberately. It creates an opportunity for automated Numerical Cross-Checking (NCC) systems. This paper introduces the key component of such a system, formula extractor, which extracts formulas from verbal descriptions of numerical claims. Specifically, we formulate this task as a DAG-structure prediction problem, and propose an iterative relation extraction model to address it. In our model, we apply a bi-directional LSTM followed by a DAG-structured LSTM to extract formulas layer by layer iteratively. Then, the model is built using a human-labeled dataset of tens of thousands of sentences. The evaluation shows that this model is effective in formula extraction. At the relation level, the model achieves a 97.78% precision and 98.33% recall. At the sentence level, the predictions over 92.02% of sentences are perfect. Overall, the project for NCC has received wide recognition in the Chinese financial community.
【Keywords】: formula extraction; information extraction; iterative relation extraction; numerical cross-checking; relation extraction
【Paper Link】 【Pages】:1805-1814
【Authors】: Bijaya Adhikari ; Parikshit Sondhi ; Wenke Zhang ; Mohit Sharma ; B. Aditya Prakash
【Abstract】: Customer Interaction Networks (CINs) are a natural framework for representing and mining customer interactions with E-Commerce search engines. Customer interactions begin with the submission of a query formulated based on an initial product intent, followed by a sequence of product engagement and query reformulation actions. Engagement with a product (e.g. clicks) indicates its relevance to the customer»s product intent. Reformulation to a new query indicates either dissatisfaction with current results, or an evolution in the customer»s product intent. Analyzing such interactions within and across sessions, enables us to discover various query-query and query-product relationships. In this work, we begin by studying the properties of CINs developed using Walmart.com»s product search logs. We observe that the properties exhibited by CINs make it possible to mine intent relationships between queries based purely on their structural information. We show how these relations can be exploited for a) clustering queries based on intents, b) significantly improve search quality for poorly performing queries, and c) identify the most influential (aka. »critical») queries whose performance have the highest impact on performance of other queries.
【Keywords】: customer interaction networks; e-commerce; query relation mining
【Paper Link】 【Pages】:1815-1824
【Authors】: Yusan Lin ; Peifeng Yin ; Wang-Chien Lee
【Abstract】: The often fierce competition on crowdfunding markets can significantly affect project success. While various factors have been considered in predicting the success of crowdfunding projects, to the best knowledge of the authors, the phenomenon of competition has not been investigated. In this paper, we study the competition on crowdfunding markets through data analysis, and propose a probabilistic generative model, Dynamic Market Competition (DMC) model, to capture the competitiveness of projects in crowdfunding. Through an empirical evaluation using the pledging history of past crowdfunding projects, our approach has shown to capture the competitiveness of projects very well, and significantly outperforms several baseline approaches in predicting the daily collected funds of crowdfunding projects, reducing errors by 31.73% to 45.14%. In addition, our analyses on the correlations between project competitiveness, project design factors, and project success indicate that highly competitive projects, while being winners under various setting of project design factors, are particularly impressive with high pledging goals and high price rewards, comparing to medium and low competitive projects. Finally, the competitiveness of projects learned by DMC is shown to be very useful in applications of predicting final success and days taken to hit pledging goal, reaching 85% accuracy and error of less than 7 days, respectively, with limited information at early pledging stage.
【Keywords】: business competitiveness; crowdfunding; market competition; probabilistic generative model
【Paper Link】 【Pages】:1825-1834
【Authors】: Eunjo Lee ; Jiyoung Woo ; Hyoungshick Kim ; Huy Kang Kim
【Abstract】: Online game involves a very large number of users who are interconnected and interact with each other via the Internet. We studied the characteristics of exchanging virtual goods with real money through the processes called "real money trading (RMT)". This exchange might influence online game user behaviors and cause damage to the reputation of game companies. We examined in-game transactions to reveal RMT by constructing a social graph of virtual goods exchanges in an online game and identifying network communities of users. We analyzed approximately 6,000,000 transactions in a popular online game and inferred RMT transactions by comparing the RMT transactions crawled from an out-game market. Our findings are summarized as follows: (1) the size of the RMT market could be approximately estimated; (2) professional RMT providers typically form a specific network structure (either star-shape or chain) in the trading network, which can be used as a clue for tracing RMT transactions; and (3) the observed RMT market has evolved over time into a monopolized market with a small number of large-sized virtual goods providers.
【Keywords】: community detection; network analysis; online game black market; real money trading
【Paper Link】 【Pages】:1835-1844
【Authors】: Hongwei Wang ; Fuzheng Zhang ; Xing Xie ; Minyi Guo
【Abstract】: Online news recommender systems aim to address the information explosion of news and make personalized recommendation for users. In general, news language is highly condensed, full of knowledge entities and common sense. However, existing methods are unaware of such external knowledge and cannot fully discover latent knowledge-level connections among news. The recommended results for a user are consequently limited to simple patterns and cannot be extended reasonably. To solve the above problem, in this paper, we propose a deep knowledge-aware network (DKN) that incorporates knowledge graph representation into news recommendation. DKN is a content-based deep recommendation framework for click-through rate prediction. The key component of DKN is a multi-channel and word-entity-aligned knowledge-aware convolutional neural network (KCNN) that fuses semantic-level and knowledge-level representations of news. KCNN treats words and entities as multiple channels, and explicitly keeps their alignment relationship during convolution. In addition, to address users» diverse interests, we also design an attention module in DKN to dynamically aggregate a user»s history with respect to current candidate news. Through extensive experiments on a real online news platform, we demonstrate that DKN achieves substantial gains over state-of-the-art deep recommendation models. We also validate the efficacy of the usage of knowledge in DKN.
【Keywords】: attention model; deep neural networks; knowledge graph representation; news recommendation
【Paper Link】 【Pages】:1845-1854
【Authors】: Sergio Pastrana ; Daniel R. Thomas ; Alice Hutchings ; Richard Clayton
【Abstract】: Underground forums allow criminals to interact, exchange knowledge, and trade in products and services. They also provide a pathway into cybercrime, tempting the curious to join those already motivated to obtain easy money. Analysing these forums enables us to better understand the behaviours of offenders and pathways into crime. Prior research has been valuable, but limited by a reliance on datasets that are incomplete or outdated. More complete data, going back many years, allows for comprehensive research into the evolution of forums and their users. We describe CrimeBot, a crawler designed around the particular challenges of capturing data from underground forums. CrimeBot is used to update and maintain CrimeBB, a dataset of more than 48m posts made from 1m accounts in 4 different operational forums over a decade. This dataset presents a new opportunity for large-scale and longitudinal analysis using up-to-date information. We illustrate the potential by presenting a case study using CrimeBB, which analyses which activities lead new actors into engagement with cybercrime. CrimeBB is available to other academic researchers under a legal agreement, designed to prevent misuse and provide safeguards for ethical research.
【Keywords】: crimebb; crimebot; cybercrime; data sharing; ethics; money laundering; underground forums; web crawling
【Paper Link】 【Pages】:1855-1864
【Authors】: Hongchang Gao ; Deguang Kong ; Miao Lu ; Xiao Bai ; Jian Yang
【Abstract】: Click-through rate (CTR) is a critical problem in online advertising. Most existing researches only focus on the user-level CTR prediction. However, advertiser-level CTR forecasting also plays a very important role because advertisers typically decide how much they would like to bid for advertisements to achieve the maximum clicks given their budget based on CTR forecasting. Over-forecasting will make the advertiser to pay more than necessary but get less return on investment (ROI). Under-forecasting will make the advertiser to spend less money on campaigns but they cannot achieve the desired ROI goals. In this paper, we focus on the advertiser-level CTR forecasting and formulate it as a time series forecasting problem based on the historical CTR record. This is a very challenging problem due to the heavy fluctuation and highly non-linearity of time series. Furthermore, advertisers usually provide useful contextual information for their campaigns, such as text descriptions, targeting locations and devices, which has high correlation with CTR but has not yet been used for CTR forecasting. Thus, we propose a novel context-aware attention convolutional neural network (CACNN), which can capture the high non-linearity and local information of the time series, as well as the underlying correlation between the time series of CTR and the contextual information. To the best of our knowledge, this is the first work employing convolutional neural network and incorporating heterogeneous information to perform CTR forecasting at advertiser level. We implement the system on Yahoo TensorFlowOnSpark platform which enables distributed deep learning on a cluster of GPU and CPU servers, and achieves faster learning speed and data access on HDFS when available. The effectiveness of CACNN model has been demonstrated in real-world Yahoo advertising dataset, and therefore deployed in production with daily rolling of the model.
【Keywords】: advertising; attention; click-through rate; convolutional neural network; forecasting; sponsored search; time series
【Paper Link】 【Pages】:1865-1874
【Authors】: Navneet Potti ; James Bradley Wendt ; Qi Zhao ; Sandeep Tata ; Marc Najork
【Abstract】: A vast majority of the emails received by people today are machine-generated by businesses communicating with consumers. While some emails originate as a result of a transaction (e.g., hotel or restaurant reservation confirmations, online purchase receipts, shipping notifications, etc.), a large fraction are commercial emails promoting an offer (a special sale, free shipping, available for a limited time, etc.). The sheer number of these promotional emails makes it difficult for users to read all these emails and decide which ones are actually interesting and actionable. In this paper, we tackle the problem of extracting information from commercial emails promoting an offer to the user. This information enables an email platform to build several new experiences that can unlock the value in these emails without the user having to navigate and read all of them. For instance, we can highlight offers that are expiring soon, or display a notification when there»s an unexpired offer from a merchant if your phone recognizes that you are at that merchant»s store. A key challenge in extracting information from such commercial emails is that they are often image-rich and contain very little text. Training a machine learning (ML) model on a rendered image-rich email and applying it to each incoming email can be prohibitively expensive. In this paper, we describe a cost-effective approach for extracting signals from both the text and image content of commercial emails in the context of Gmail, an email platform that serves over a billion users around the world. The key insight is to leverage the template structure of emails, and use off-the-shelf OCR techniques to obtain the text from images to augment the existing text features offline. Compared to a text-only approach, we show that we are able to identify 9.12% more email templates corresponding to ~5% more emails being identified as offers. Interestingly, our analysis shows that this 5% improvement in coverage is across the board, irrespective of whether the emails were sent by large merchants or small local merchants, allowing us to deliver an improved experience for everyone.
【Keywords】: email; information extraction; wrapper induction
【Paper Link】 【Pages】:1875-1884
【Authors】: Conglong Li ; David G. Andersen ; Qiang Fu ; Sameh Elnikety ; Yuxiong He
【Abstract】: To maximize profit and connect users to relevant products and services, search advertising systems use sophisticated machine learning algorithms to estimate the revenue expectations of thousands of matching ad listings per query. These machine learning computations constitute a substantial part of the operating cost, e.g., 10% to 30% of the total gross revenues. It is desirable to cache and reuse previous computation results to reduce this cost, but caching introduces approximation which comes with potential revenue loss. To maximize cost savings while minimizing the overall revenue impact, an intelligent refresh policy is required to decide when to refresh the cached computation results. The state-of-the-art manually-tuned refresh heuristic uses revenue history to assign different refresh frequencies. Using the gradient boosting regression tree algorithm with well selected features, we introduce a rapid prediction framework that provides refresh decisions at higher accuracy compared to the heuristic. This enables us to build a prediction-based refresh policy and a cache achieving higher profit without manual parameter tuning. Simulations conducted on the logs from a major commercial search advertising system show that our proposed cache design reduces the negative revenue impact (0.07x), and improves the cost savings (1.41x) and the net profit (1.50~1.70x) compared to the state-of-the-art manually-tuned heuristic-based cache design.
【Keywords】: cache systems; machine learning; sponsored search
【Paper Link】 【Pages】:1885-1892
【Authors】: Zachary Nichols ; Adam Stein
【Abstract】: Measuring the causal effect of advertising on driving desired behavior is an important problem to the digital publishing industry (the "attribution" problem). It is common to use observational methods for attribution, due to the high cost and difficulty of employing randomized controlled trials (RCTs). However, recent results have shown that even current sophisticated observational methods may be inaccurate, yielding incorrect estimates of the true effect of advertising. Here, we present a new observational attribution method based on a successful model of neural spiking that learns the temporal interactions between event-based time series. We train this model on data from several RCT marketing experiments, and show that it can accurately recover the true causal attribution.
【Keywords】: digital advertising
【Paper Link】 【Pages】:1893-1899
【Authors】: Ning Ye ; Ariel Fuxman ; Vivek Ramavajjala ; Sergey Nazarov ; J. Patrick McGregor ; Sujith Ravi
【Abstract】: We introduce the problem of automatically suggesting conversational responses to photos and present an intelligent assistant called PhotoReply that solves the problem in the context of a messaging application. For example, when a user receives a photo showing a dog, PhotoReply suggests responses such as "Aww»» and "Cute terrier!»» This simplifies composing responses on constrained mobile keyboards, and delights users with uncanny insights into the photos they receive. PhotoReply is an integral part of the Allo chat application, being its predictive assistance feature with highest click-through rate. We formalize the problem of suggesting responses to images as an instance of multimodal learning which is akin to caption generation models, a topic that has recently received significant attention \citeKarpathy2015,Vinyals2015,Xu2015. We then present a system that "translates»» image pixels to text responses. The system includes a conditioned language model, based on an LSTM, which, given an embedding of image pixels and the previous predicted words, calculates the probability of all words in a vocabulary of being the next word in the generated response and a triggering model trained with image embeddings and concept labels from a large concept taxonomy. We describe training of the models and a thorough experimental evaluation based on crowdsourced datasets and live traffic.
【Keywords】: generative model; image understanding; messaging applications; mobile applications; natural language
【Paper Link】 【Pages】:1901-1908
【Authors】: Carlos Faham ; Grace Tang ; Yuefeng Li ; Edward Goldstein ; Yuji Kosuga ; Jenelle Bray ; David Mandell Freeman
【Abstract】:
【Keywords】:
【Paper Link】 【Pages】:1909-1918
【Authors】: Washington Luiz ; Felipe Viegas ; Rafael Odon de Alencar ; Fernando Mourão ; Thiago Salles ; Dárlinton Carvalho ; Marcos André Gonçalves ; Leonardo C. da Rocha
【Abstract】: In this paper, we propose a general framework that allows developers to filter, summarize and analyze user reviews written about applications on App Stores. Our framework extracts automatically relevant features from reviews of apps (e.g., information about functionalities, bugs, requirements, etc) and analyzes the sentiment associated with each of them. Our framework has three main building blocks, namely, (i) topic modeling, (ii) sentiment analysis and (iii) summarization interface. The topic modeling block aims at finding semantic topics from textual comments, extracting the target features based on the most relevant words of each discovered topic. The sentiment analysis block detects the sentiment associated with each discovered feature. The summarization interface provides to developers an intuitive visualization of the features (i.e., topics) and their associated sentiment, providing richer information than a 'star rating' strategy. Our evaluation shows that the topic modeling block is able to organize information provided by users in subcategories that facilitate the understanding of which features more positively/negatively impact the overall evaluation of the application. Regarding user satisfaction, we can observe that, in spite of the star rating being a good measure of evaluation, the Sentiment Analysis technique is more accurate in capturing the sentiment transmitted by the user by means of a comment.
【Keywords】: analysis of online reviews; sentiment analysis; topic model
【Paper Link】 【Pages】:1919-1928
【Authors】: Su Yan ; Wei Lin ; Tianshu Wu ; Daorui Xiao ; Xu Zheng ; Bo Wu ; Kaipeng Liu
【Abstract】: In most sponsored search platforms, advertisers bid on some keywords for their advertisements (ads). Given a search request, ad retrieval module rewrites the query into bidding keywords, and uses these keywords as keys to select Top N ads through inverted indexes. In this way, an ad will not be retrieved even if queries are related when the advertiser does not bid on corresponding keywords. Moreover, most ad retrieval approaches regard rewriting and ad-selecting as two separated tasks, and focus on boosting relevance between search queries and ads. Recently, in e-commerce sponsored search more and more personalized information has been introduced, such as user profiles, long-time and real-time clicks. Personalized information makes ad retrieval able to employ more elements (e.g. real-time clicks) as search signals and retrieval keys, however it makes ad retrieval more difficult to measure ads retrieved through different signals. To address these problems, we propose a novel ad retrieval framework beyond keywords and relevance in e-commerce sponsored search. Firstly, we employ historical ad click data to initialize a hierarchical network representing signals, keys and ads, in which personalized information is introduced. Then we train a model on top of the hierarchical network by learning the weights of edges. Finally we select the best edges according to the model, boosting RPM/CTR. Experimental results on our e-commerce platform demonstrate that our ad retrieval framework achieves good performance.
【Keywords】: ad retrieval; e-commerce sponsored search; personalization
【Paper Link】 【Pages】:1929-1938
【Authors】: Selin Chun ; Daejin Choi ; Jinyoung Han ; Huy Kang Kim ; Taekyoung Kwon
【Abstract】: Understanding socio-economic systems in MMORPGs can provide an important implication on how people participate in the economy and how people interact with each other. In this paper, we model the socio-economic system of an Aion, a popular MMORPG, as a multi-layer graph. Using the dataset consisting of 94,870 users and their activity records spanning three months, we examine how economic activities are associated with social interactions, and find that social interactions like participating in a party or exchanging messages are highly correlated with the trade activities. We also find that virtual economy in Aion is heavily inclined to a small number of upper-class userswho play a crucial role in virtual economy. Our analysis on the upper-class users reveals that a significant portion of them reach at the max-level and tend to either (i) have many social interactions with others or (ii) play extremely much time with no social activity. We also reveal that there are some low-level upper-class users who gain much money but hardly socialize with others. Lastly, we show how upper-class users who are at low-levels, play the game extremely much more than others, or rarely interact with other users, are associated with the Real Money Trade (RMT), which may be an illegal behavior that gathers in-game money for exchanging into real-world money. We reveal that more than half of total money exchanged through the trade are associated with the upper-class users who involve in the RMT.
【Keywords】: aion; mmorpg; real money trade; socio-economic analysis
【Paper Link】 【Pages】:1939-1948
【Authors】: Jun Feng ; Heng Li ; Minlie Huang ; Shichen Liu ; Wenwu Ou ; Zhirong Wang ; Xiaoyan Zhu
【Abstract】: Ranking is a fundamental and widely studied problem in scenarios such as search, advertising, and recommendation. However, joint optimization for multi-scenario ranking, which aims to improve the overall performance of several ranking strategies in different scenarios, is rather untouched. Separately optimizing each individual strategy has two limitations. The first one is lack of collaboration between scenarios meaning that each strategy maximizes its own objective but ignores the goals of other strategies, leading to a sub-optimal overall performance. The second limitation is the inability of modeling the correlation between scenarios meaning that independent optimization in one scenario only uses its own user data but ignores the context in other scenarios. In this paper, we formulate multi-scenario ranking as a fully cooperative, partially observable, multi-agent sequential decision problem. We propose a novel model named Multi-Agent Recurrent Deterministic Policy Gradient (MA-RDPG) which has a communication component for passing messages, several private actors (agents) for making actions for ranking, and a centralized critic for evaluating the overall performance of the co-working actors. Each scenario is treated as an agent (actor). Agents collaborate with each other by sharing a global action-value function (the critic) and passing messages that encodes historical information across scenarios. The model is evaluated with online settings on a large E-commerce platform. Results show that the proposed model exhibits significant improvements against baselines in terms of the overall performance.
【Keywords】: joint optimization; learning to rank; multi-agent learning; reinforcement learning