29th WWW 2020:Taipei, Taiwan

WWW '20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020. ACM / IW3C2 【DBLP Link

Paper Num: 320 || Session Num: 4

Session: Full Paper 219

1. Relation Adversarial Network for Low Resource Knowledge Graph Completion.

Paper Link】 【Pages】:1-12

【Authors】: Ningyu Zhang ; Shumin Deng ; Zhanlin Sun ; Jiaoyan Chen ; Wei Zhang ; Huajun Chen

【Abstract】: Knowledge Graph Completion (KGC) has been proposed to improve Knowledge Graphs by filling in missing connections via link prediction or relation extraction. One of the main difficulties for KGC is a low resource problem. Previous approaches assume sufficient training triples to learn versatile vectors for entities and relations, or a satisfactory number of labeled sentences to train a competent relation extraction model. However, low resource relations are very common in KGs, and those newly added relations often do not have many known samples for training. In this work, we aim at predicting new facts under a challenging setting where only limited training instances are available. We propose a general framework called Weighted Relation Adversarial Network, which utilizes an adversarial procedure to help adapt knowledge/features learned from high resource relations to different but related low resource relations. Specifically, the framework takes advantage of a relation discriminator to distinguish between samples from different relations, and help learn relation-invariant features more transferable from source relations to target relations. Experimental results show that the proposed approach outperforms previous methods regarding low resource settings for both link prediction and relation extraction.

【Keywords】:

2. Learning to Classify: A Flow-Based Relation Network for Encrypted Traffic Classification.

Paper Link】 【Pages】:13-22

【Authors】: Wenbo Zheng ; Chao Gou ; Lan Yan ; Shaocong Mo

【Abstract】: As the size and source of network traffic increase, so does the challenge of monitoring and analyzing network traffic. The challenging problems of classifying encrypted traffic are the imbalanced property of network data, the generalization on an unseen dataset, and overly dependent on data size. In this paper, we propose an application of a meta-learning approach to address these problems in encrypted traffic classification, named Flow-Based Relation Network (RBRN). The RBRN is an end-to-end classification model that learns representative features from the raw flows and then classifies them in a unified framework. Moreover, we design “hallucinator” to produce additional training samples for the imbalanced classification, and then focus on meta-learning to classify unseen categories from few labeled samples. We validate the effectiveness of the RBRN on the real-world network traffic dataset, and the experimental results demonstrate that the RBRN can achieve an excellent classification performance and outperform the state-of-the-art methods on encrypted traffic classification. What is more interesting, our model trained on the real-world dataset can generalize very well to unseen datasets, outperforming multiple state-of-art methods.

【Keywords】: Computing methodologies; Machine learning

3. FiDo: Ubiquitous Fine-Grained WiFi-based Localization for Unlabelled Users via Domain Adaptation.

Paper Link】 【Pages】:23-33

【Authors】: Xi Chen ; Hang Li ; Chenyi Zhou ; Xue Liu ; Di Wu ; Gregory Dudek

【Abstract】: To fully support the emerging location-aware applications, location information with meter-level resolution (or even higher) is required anytime and anywhere. Unfortunately, most of the current location sources (e.g., GPS and check-in data) either are unavailable indoor or provide only house-level resolutions. To fill the gap, this paper utilizes the ubiquitous WiFi signals to establish a (sub)meter-level localization system, which employs WiFi propagation characteristics as location fingerprints. However, an unsolved issue of these WiFi fingerprints lies in their inconsistency across different users. In other words, WiFi fingerprints collected from one user may not be used to localize another user. To address this issue, we propose a WiFi-based Domain-adaptive system FiDo, which is able to localize many different users with labelled data from only one or two example users. FiDo contains two modules: 1) a data augmenter that introduces data diversity using a Variational Autoencoder (VAE); and 2) a domain-adaptive classifier that adjusts itself to newly collected unlabelled data using a joint classification-reconstruction structure. Compared to the state of the art, FiDo increases average F1 score by 11.8% and improves the worst-case accuracy by 20.2%.

【Keywords】:

4. An Empirical Study of the Use of Integrity Verification Mechanisms for Web Subresources.

Paper Link】 【Pages】:34-45

【Authors】: Bertil Chapuis ; Olamide Omolola ; Mauro Cherubini ; Mathias Humbert ; Kévin Huguenin

【Abstract】: Web developers can (and do) include subresources such as scripts, stylesheets and images in their webpages. Such subresources might be stored on content delivery networks (CDNs). This practice creates security and privacy risks, should a subresource be corrupted. The subresource integrity (SRI) recommendation, released in mid-2016 by the W3C, enables developers to include digests in their webpages in order for web browsers to verify the integrity of subresources before loading them. In this paper, we conduct the first large-scale longitudinal study of the use of SRI on the Web by analyzing massive crawls (≈ 3B URLs) of the Web over the last 3.5 years. Our results show that the adoption of SRI is modest (≈), but grows at an increasing rate and is highly influenced by the practices of popular library developers (e.g., Bootstrap) and CDN operators (e.g., jsDelivr). We complement our analysis about SRI with a survey of web developers (N=): It shows that a substantial proportion of developers know SRI and understand its basic functioning, but most of them ignore important aspects of the recommendation. The results of the survey also show that the integration of SRI by developers is mostly manual – hence not scalable and error prone. This calls for a better integration of SRI in build tools.

【Keywords】:

5. Power-Law Graphs Have Minimal Scaling of Kemeny Constant for Random Walks.

Paper Link】 【Pages】:46-56

【Authors】: Wanyue Xu ; Yibin Sheng ; Zuobai Zhang ; Haibin Kan ; Zhongzhi Zhang

【Abstract】: The mean hitting time from a node i to a node j selected randomly according to the stationary distribution of random walks is called the Kemeny constant, which has found various applications. It was proved that over all graphs with N vertices, complete graphs have the exact minimum Kemeny constant, growing linearly with N. Here we study numerically or analytically the Kemeny constant on many sparse real-world and model networks with scale-free small-world topology, and show that their Kemeny constant also behaves linearly with N. Thus, sparse networks with scale-free and small-world topology are favorable architectures with optimal scaling of Kemeny constant. We then present a theoretically guaranteed estimation algorithm, which approximates the Kemeny constant for a graph in nearly linear time with respect to the number of edges. Extensive numerical experiments on model and real networks show that our approximation algorithm is both efficient and accurate.

【Keywords】: Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms

6. Metric Learning with Equidistant and Equidistributed Triplet-based Loss for Product Image Search.

Paper Link】 【Pages】:57-65

【Authors】: Furong Xu ; Wei Zhang ; Yuan Cheng ; Wei Chu

【Abstract】: Product image search in E-commerce systems is a challenging task, because of a huge number of product classes, low intra-class similarity and high inter-class similarity. Deep metric learning, based on paired distances independent of the number of classes, aims to minimize intra-class variances and inter-class similarity in feature embedding space. Most existing approaches strictly restrict the distance between samples with fixed values to distinguish different classes of samples. However, the distance of paired samples has various magnitudes during different training stages. Therefore, it is difficult to directly restrict absolute distances with fixed values. In this paper, we propose a novel Equidistant and Equidistributed Triplet-based (EET) loss function to adjust the distance between samples with relative distance constraints. By optimizing the loss function, the algorithm progressively maximizes intra-class similarity and inter-class variances. Specifically, 1) the equidistant loss pulls the matched samples closer by adaptively constraining two samples of the same class to be equally distant from another one of a different class in each triplet, 2) the equidistributed loss pushes the mismatched samples farther away by guiding different classes to be uniformly distributed while keeping intra-class structure compact in embedding space. Extensive experimental results on product search benchmarks verify the improved performance of our method. We also achieve improvements on other retrieval datasets, which show superior generalization capacity of our method in image search.

【Keywords】:

7. "What Apps Did You Use?": Understanding the Long-term Evolution of Mobile App Usage.

Paper Link】 【Pages】:66-76

【Authors】: Tong Li ; Mingyang Zhang ; Hancheng Cao ; Yong Li ; Sasu Tarkoma ; Pan Hui

【Abstract】: The prevalence of smartphones has promoted the popularity of mobile apps in recent years. Although significant effort has been made to understand mobile app usage, existing studies are based primarily on short-term datasets with limited time span, e.g., a few months. Therefore, many basic facts about the long-term evolution of mobile app usage are unknown. In this paper, we study how mobile app usage evolves over a long-term period. We first introduce an app usage collection platform named carat, from which we have gathered app usage records of 1,465 users from 2012 to 2017. We then conduct the first study on the long-term evolution processes on a macro-level, i.e., app-category, and micro-level, i.e., individual app. We discover that, on both levels, there is a growth stage enabled by the introduction of new technologies. Then there is a plateau stage caused by high correlations between app categories and a pareto effect in individual app usage, respectively. Additionally, the evolution of individual app usage undergoes an elimination stage due to fierce intra-category competition. Nevertheless, the diverseness of app-category and individual app usage exhibit opposing trends: app-category usage assimilates while individual app usage diversifies. Our study provides useful implications for app developers, market intermediaries, and service providers.

【Keywords】:

8. OutfitNet: Fashion Outfit Recommendation with Attention-Based Multiple Instance Learning.

Paper Link】 【Pages】:77-87

【Authors】: Yusan Lin ; Maryam Moosaei ; Hao Yang

【Abstract】: Recommending fashion outfits to users presents several challenges. First of all, an outfit consists of multiple fashion items, and each user emphasizes different parts of an outfit when considering whether they like it or not. Secondly, a user’s liking for a fashion outfit considers not only the aesthetics of each item but also the compatibility among them. Lastly, fashion outfit data is often sparse in terms of the relationship between users and fashion outfits. Not to mention, we can only obtain what the users like, but not what they dislike.

【Keywords】:

9. Towards Fine-grained Flow Forecasting: A Graph Attention Approach for Bike Sharing Systems.

Paper Link】 【Pages】:88-98

【Authors】: Suining He ; Kang G. Shin

【Abstract】: As a healthy, efficient and green alternative to motorized urban travel, bike sharing has been increasingly popular, leading to wide deployment and use of bikes instead of cars. Accurate bike-flow prediction at the individual station level is essential for bike sharing service. Due to the spatial and temporal complexities of traffic networks and the lack of data-driven design for bike stations, existing methods cannot predict the fine-grained bike flows to/from each station.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

10. Reinforced Negative Sampling over Knowledge Graph for Recommendation.

Paper Link】 【Pages】:99-109

【Authors】: Xiang Wang ; Yaokun Xu ; Xiangnan He ; Yixin Cao ; Meng Wang ; Tat-Seng Chua

【Abstract】: Properly handling missing data is a fundamental challenge in recommendation. Most present works perform negative sampling from unobserved data to supply the training of recommender models with negative signals. Nevertheless, existing negative sampling strategies, either static or adaptive ones, are insufficient to yield high-quality negative samples — both informative to model training and reflective of user real needs.

【Keywords】: Information systems; Information retrieval; Retrieval tasks and goals; Recommender systems

11. Crowd Teaching with Imperfect Labels.

Paper Link】 【Pages】:110-121

【Authors】: Yao Zhou ; Arun Reddy Nelakurthi ; Ross Maciejewski ; Wei Fan ; Jingrui He

【Abstract】: The need for annotated labels to train machine learning models led to a surge in crowdsourcing - collecting labels from non-experts. Instead of annotating from scratch, given an imperfect labeled set, how can we leverage the label information obtained from amateur crowd workers to improve the data quality? Furthermore, is there a way to teach the amateur crowd workers using this imperfect labeled set in order to improve their labeling performance? In this paper, we aim to answer both questions via a novel interactive teaching framework, which uses visual explanations to simultaneously teach and gauge the confidence level of the crowd workers.

【Keywords】: Computing methodologies; Machine learning; Learning paradigms; Supervised learning; Supervised learning by classification

12. Directional and Explainable Serendipity Recommendation.

Paper Link】 【Pages】:122-132

【Authors】: Xueqi Li ; Wenjun Jiang ; Weiguang Chen ; Jie Wu ; Guojun Wang ; Kenli Li

【Abstract】: Serendipity recommendation has attracted more and more attention in recent years; it is committed to providing recommendations which could not only cater to users’ demands but also broaden their horizons. However, existing approaches usually measure user-item relevance with a scalar instead of a vector, ignoring user preference direction, which increases the risk of unrelated recommendations. In addition, reasonable explanations increase users’ trust and acceptance, but there is no work to provide explanations for serendipitous recommendations. To address these limitations, we propose a Directional and Explainable Serendipity Recommendation method named DESR. Specifically, we extract users’ long-term preferences with an unsupervised method based on GMM (Gaussian Mixture Model) and capture their short-term demands with the capsule network at first. Then, we propose the serendipity vector to combine long-term preferences with short-term demands and generate directionally serendipitous recommendations with it. Finally, a back-routing scheme is exploited to offer explanations. Extensive experiments on real-world datasets show that DESR could effectively improve the serendipity and explainability, and give impetus to the diversity, compared with existing serendipity-based methods.

【Keywords】:

13. Dynamic Flow Distribution Prediction for Urban Dockless E-Scooter Sharing Reconfiguration.

Paper Link】 【Pages】:133-143

【Authors】: Suining He ; Kang G. Shin

【Abstract】: Thanks to recent progresses in mobile payment, IoT, electric motors, batteries and location-based services, Dockless E-scooter Sharing (DES) has become a popular means of last-mile commute for a growing number of (smart) cities. As e-scooters are getting deployed dynamically and flexibly across city regions that expand and/or shrink, with subsequent social, commercial and environmental evaluation, accurate prediction of the distribution of e-scooters given reconfigured regions becomes essential for the city planners and service providers.

【Keywords】:

14. Graph Attention Topic Modeling Network.

Paper Link】 【Pages】:144-154

【Authors】: Liang Yang ; Fan Wu ; Junhua Gu ; Chuan Wang ; Xiaochun Cao ; Di Jin ; Yuanfang Guo

【Abstract】: Existing topic modeling approaches possess several issues, including the overfitting issue of Probablistic Latent Semantic Indexing (pLSI), the failure of capturing the rich topical correlations among topics in Latent Dirichlet Allocation (LDA), and high inference complexity. In this paper, we provide a new method to overcome the overfitting issue of pLSI by using the amortized inference with word embedding as input, instead of the Dirichlet prior in LDA. For generative topic model, the large number of free latent variables is the root of overfitting. To reduce the number of parameters, the amortized inference replaces the inference of latent variable with a function which possesses the shared (amortized) learnable parameters. The number of the shared parameters is fixed and independent of the scale of the corpus. To overcome the limited application of amortized inference to independent and identically distributed (i.i.d) data, a novel graph neural network, Graph Attention TOpic Network (GATON), is proposed to model the topic structure of non-i.i.d documents according to the following two observations. First, pLSI can be interpreted as stochastic block model (SBM) on a specific bi-partite graph. Second, graph attention network (GAT) can be explained as the semi-amortized inference of SBM, which relaxes the i.i.d data assumption of vanilla amortized inference. GATON provides a novel scheme, i.e. graph convolution operation based scheme, to integrate word similarity and word co-occurrence structure. Specifically, the bag-of-words document representation is modeled as a bi-partite graph topology. Meanwhile, word embedding, which captures the word similarity, is modeled as attribute of the word node and the term frequency vector is adopted as the attribute of the document node. Based on the weighted (attention) graph convolution operation, the word co-occurrence structure and word similarity patterns are seamlessly integrated for topic identification. Extensive experiments demonstrate that the effectiveness of GATON on topic identification not only benefits the document classification, but also significantly refines the input word embedding.

【Keywords】:

15. Measurements, Analyses, and Insights on the Entire Ethereum Blockchain Network.

Paper Link】 【Pages】:155-166

【Authors】: Xi Tong Lee ; Arijit Khan ; Sourav Sen Gupta ; Yu Hann Ong ; Xuan Liu

【Abstract】: Blockchains are increasingly becoming popular due to the prevalence of cryptocurrencies and decentralized applications. Ethereum is a distributed public blockchain network that focuses on running code (smart contracts) for decentralized applications. More simply, it is a platform for sharing information in a global state that cannot be manipulated or changed. Ethereum blockchain introduces a novel ecosystem of human users and autonomous agents (smart contracts). In this network, we are interested in all possible interactions: user-to-user, user-to-contract, contract-to-user, and contract-to-contract. This requires us to construct interaction networks from the entire Ethereum blockchain data, where vertices are accounts (users, contracts) and arcs denote interactions. Our analyses on the networks reveal new insights by combining information from the four networks. We perform an in-depth study of these networks based on several graph properties consisting of both local and global properties, discuss their similarities and differences with social networks and the Web, draw interesting conclusions, and highlight important, future research directions.

【Keywords】:

16. The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing.

Paper Link】 【Pages】:167-178

【Authors】: David Zeber ; Sarah Bird ; Camila Oliveira ; Walter Rudametkin ; Ilana Segall ; Fredrik Wollsén ; Martin Lopatka

【Abstract】: Large-scale Web crawls have emerged as the state of the art for studying characteristics of the Web. In particular, they are a core tool for online tracking research. Web crawling is an attractive approach to data collection, as crawls can be run at relatively low infrastructure cost and don’t require handling sensitive user data such as browsing histories. However, the biases introduced by using crawls as a proxy for human browsing data have not been well studied. Crawls may fail to capture the diversity of user environments, and the snapshot view of the Web presented by one-time crawls does not reflect its constantly evolving nature, which hinders reproducibility of crawl-based studies. In this paper, we quantify the repeatability and representativeness of Web crawls in terms of common tracking and fingerprinting metrics, considering both variation across crawls and divergence from human browser usage. We quantify baseline variation of simultaneous crawls, then isolate the effects of time, cloud IP address vs. residential, and operating system. This provides a foundation to assess the agreement between crawls visiting a standard list of high-traffic websites and actual browsing behaviour measured from an opt-in sample of over 50,000 users of the Firefox Web browser. Our analysis reveals differences between the treatment of stateless crawling infrastructure and generally stateful human browsing, showing, for example, that crawlers tend to experience higher rates of third-party activity than human browser users on loading pages from the same domains.

【Keywords】:

17. Client Insourcing: Bringing Ops In-House for Seamless Re-engineering of Full-Stack JavaScript Applications.

Paper Link】 【Pages】:179-189

【Authors】: Kijin An ; Eli Tilevich

【Abstract】: Modern web applications are distributed across a browser-based client and a cloud-based server. Distribution provides access to remote resources, accessed over the web and shared by clients. Much of the complexity of inspecting and evolving web applications lies in their distributed nature. Also, the majority of mature program analysis and transformation tools works only with centralized software. Inspired by business process re-engineering, in which remote operations can be insourced back in house to restructure and outsource anew, we bring an analogous approach to the re-engineering of web applications. Our target domain are full-stack JavaScript applications that implement both the client and server code in this language. Our approach is enabled by Client Insourcing, a novel automatic refactoring that creates a semantically equivalent centralized version of a distributed application. This centralized version is then inspected, modified, and redistributed to meet new requirements. After describing the design and implementation of Client Insourcing, we demonstrate its utility and value in addressing changes in security, reliability, and performance requirements. By reducing the complexity of the non-trivial program inspection and evolution tasks performed to meet these requirements, our approach can become a helpful aid in the re-engineering of web applications in this domain.

【Keywords】:

18. Privacy-preserving AI Services Through Data Decentralization.

Paper Link】 【Pages】:190-200

【Authors】: Christian Meurisch ; Bekir Bayrak ; Max Mühlhäuser

【Abstract】: User services increasingly base their actions on AI models, e.g., to offer personalized and proactive support. However, the underlying AI algorithms require a continuous stream of personal data—leading to privacy issues, as users typically have to share this data out of their territory. Current privacy-preserving concepts are either not applicable to such AI-based services or to the disadvantage of any party. This paper presents PrivAI, a new decentralized and privacy-by-design platform for overcoming the need for sharing user data to benefit from personalized AI services. In short, PrivAI complements existing approaches to personal data stores, but strictly enforces the confinement of raw user data. PrivAI further addresses the resulting challenges by (1) dividing AI algorithms into cloud-based general model training, subsequent local personalization, and community-based sharing of model updates for new users; by (2) loading confidential AI models into a trusted execution environment, and thus, protecting provider’s intellectual property (IP). Our experiments show the feasibility and effectiveness of PrivAI with comparable performance as currently-practiced approaches.

【Keywords】:

19. ASER: A Large-scale Eventuality Knowledge Graph.

Paper Link】 【Pages】:201-211

【Authors】: Hongming Zhang ; Xin Liu ; Haojie Pan ; Yangqiu Song ; Cane Wing-Ki Leung

【Abstract】: Understanding human’s language requires complex world knowledge. However, existing large-scale knowledge graphs mainly focus on knowledge about entities while ignoring knowledge about activities, states, or events, which are used to describe how entities or things act in the real world. To fill this gap, we develop ASER (activities, states, events, and their relations), a large-scale eventuality knowledge graph extracted from more than 11-billion-token unstructured textual data. ASER contains 15 relation types belonging to five categories, 194-million unique eventualities, and 64-million unique edges among them. Both intrinsic and extrinsic evaluations demonstrate the quality and effectiveness of ASER.

【Keywords】: Computing methodologies; Artificial intelligence; Knowledge representation and reasoning; Natural language processing

20. Nowhere to Hide: Cross-modal Identity Leakage between Biometrics and Devices.

Paper Link】 【Pages】:212-223

【Authors】: Chris Xiaoxuan Lu ; Yang Li ; Yuanbo Xiangli ; Zhengxiong Li

【Abstract】: Along with the benefits of Internet of Things (IoT) come potential privacy risks, since billions of the connected devices are granted permission to track information about their users and communicate it to other parties over the Internet. Of particular interest to the adversary is the user identity which constantly plays an important role in launching attacks. While the exposure of a certain type of physical biometrics or device identity is extensively studied, the compound effect of leakage from both sides remains unknown in multi-modal sensing environments. In this work, we explore the feasibility of the compound identity leakage across cyber-physical spaces and unveil that co-located smart device IDs (e.g., smartphone MAC addresses) and physical biometrics (e.g., facial/vocal samples) are side channels to each other. It is demonstrated that our method is robust to various observation noise in the wild and an attacker can comprehensively profile victims in multi-dimension with nearly zero analysis effort. Two real-world experiments on different biometrics and device IDs show that the presented approach can compromise more than 70% of device IDs and harvests multiple biometric clusters with purity at the same time.

【Keywords】:

21. Facebook Ads Monitor: An Independent Auditing System for Political Ads on Facebook.

Paper Link】 【Pages】:224-234

【Authors】: Márcio Silva ; Lucas Santos de Oliveira ; Athanasios Andreou ; Pedro Olmo Stancioli Vaz de Melo ; Oana Goga ; Fabrício Benevenuto

【Abstract】: The 2016 United States presidential election was marked by the abuse of targeted advertising on Facebook. Concerned with the risk of the same kind of abuse to happen in the 2018 Brazilian elections, we designed and deployed an independent auditing system to monitor political ads on Facebook in Brazil. To do that we first adapted a browser plugin to gather ads from the timeline of volunteers using Facebook. We managed to convince more than 2000 volunteers to help our project and install our tool. Then, we use a Convolution Neural Network (CNN) to detect political Facebook ads using word embeddings. To evaluate our approach, we manually label a data collection of 10k ads as political or non-political and then we provide an in-depth evaluation of proposed approach for identifying political ads by comparing it with classic supervised machine learning methods. Finally, we deployed a real system that shows the ads identified as related to politics. We noticed that not all political ads we detected were present in the Facebook Ad Library for political ads. Our results emphasize the importance of enforcement mechanisms for declaring political ads and the need for independent auditing platforms.

【Keywords】:

Paper Link】 【Pages】:235-245

【Authors】: Yuxuan Shi ; Gong Cheng ; Evgeny Kharlamov

【Abstract】: Keyword search is a prominent approach to querying Web data. For graph-structured data, a widely accepted semantics for keywords is based on group Steiner trees. For this NP-hard problem, existing algorithms with provable quality guarantees have prohibitive run time on large graphs. In this paper, we propose practical approximation algorithms with a guaranteed quality of computed answers and very low run time. Our algorithms rely on Hub Labeling (HL), a structure that labels each vertex in a graph with a list of vertices reachable from it, which we use to compute distances and shortest paths. We devise two HLs: a conventional static HL that uses a new heuristic to improve pruned landmark labeling, and a novel dynamic HL that inverts and aggregates query-relevant static labels to more efficiently process vertex sets. Our approach allows to compute a reasonably good approximation of answers to keyword queries in milliseconds on million-scale knowledge graphs.

【Keywords】:

23. AutoMAP: Diagnose Your Microservice-based Web Applications Automatically.

Paper Link】 【Pages】:246-258

【Authors】: Meng Ma ; Jingmin Xu ; Yuan Wang ; Pengfei Chen ; Zonghua Zhang ; Ping Wang

【Abstract】: The high complexity and dynamics of the microservice architecture make its application diagnosis extremely challenging. Static troubleshooting approaches may fail to obtain reliable model applies for frequently changing situations. Even if we know the calling dependency of services, we lack a more dynamic diagnosis mechanism due to the existence of indirect fault propagation. Besides, algorithm based on single metric usually fail to identify the root cause of anomaly, as single type of metric is not enough to characterize the anomalies occur in diverse services. In view of this, we design a novel tool, named AutoMAP, which enables dynamic generation of service correlations and automated diagnosis leveraging multiple types of metrics. In AutoMAP, we propose the concept of anomaly behavior graph to describe the correlations between services associated with different types of metrics. Two binary operations, as well as a similarity function on behavior graph are defined to help AutoMAP choose appropriate diagnosis metric in any particular scenario. Following the behavior graph, we design a heuristic investigation algorithm by using forward, self, and backward random walk, with an objective to identify the root cause services. To demonstrate the strengths of AutoMAP, we develop a prototype and evaluate it in both simulated environment and real-work enterprise cloud system. Experimental results clearly indicate that AutoMAP achieves over 90% precision, which significantly outperforms other selected baseline methods. AutoMAP can be quickly deployed in a variety of microservice-based systems without any system knowledge. It also supports introduction of various expert knowledge to improve accuracy.

【Keywords】:

24. Graph Representation Learning via Graphical Mutual Information Maximization.

Paper Link】 【Pages】:259-270

【Authors】: Zhen Peng ; Wenbing Huang ; Minnan Luo ; Qinghua Zheng ; Yu Rong ; Tingyang Xu ; Junzhou Huang

【Abstract】: The richness in the content of various information networks such as social networks and communication networks provides the unprecedented potential for learning high-quality expressive representations without external supervision. This paper investigates how to preserve and extract the abundant information from graph-structured data into embedding space in an unsupervised manner. To this end, we propose a novel concept, Graphical Mutual Information (GMI), to measure the correlation between input graphs and high-level hidden representations. GMI generalizes the idea of conventional mutual information computations from vector space to the graph domain where measuring mutual information from two aspects of node features and topological structure is indispensable. GMI exhibits several benefits: First, it is invariant to the isomorphic transformation of input graphs—an inevitable constraint in many existing graph representation learning algorithms; Besides, it can be efficiently estimated and maximized by current mutual information estimation methods such as MINE; Finally, our theoretical analysis confirms its correctness and rationality. With the aid of GMI, we develop an unsupervised learning model trained by maximizing GMI between the input and output of a graph neural encoder. Considerable experiments on transductive as well as inductive node classification and link prediction demonstrate that our method outperforms state-of-the-art unsupervised counterparts, and even sometimes exceeds the performance of supervised ones.

【Keywords】: Information systems; Information systems applications; Data mining

25. Apophanies or Epiphanies? How Crawlers Impact Our Understanding of the Web.

Paper Link】 【Pages】:271-280

【Authors】: Syed Suleman Ahmad ; Muhammad Daniyal Dar ; Muhammad Fareed Zaffar ; Narseo Vallina-Rodriguez ; Rishab Nithyanand

【Abstract】: Data generated by web crawlers has formed the basis for much of our current understanding of the Internet. However, not all crawlers are created equal and crawlers generally find themselves trading off between computational overhead, developer effort, data accuracy, and completeness. Therefore, the choice of crawler has a critical impact on the data generated and knowledge inferred from it. In this paper, we conduct a systematic study of the trade-offs presented by different crawlers and the impact that these can have on various types of measurement studies. We make the following contributions: First, we conduct a survey of all research published since 2015 in the premier security and Internet measurement venues to identify and verify the repeatability of crawling methodologies deployed for different problem domains and publication venues. Next, we conduct a qualitative evaluation of a subset of all crawling tools identified in our survey. This evaluation allows us to draw conclusions about the suitability of each tool for specific types of data gathering. Finally, we present a methodology and a measurement framework to empirically highlight the differences between crawlers and how the choice of crawler can impact our understanding of the web.

【Keywords】:

26. Generating Multi-hop Reasoning Questions to Improve Machine Reading Comprehension.

Paper Link】 【Pages】:281-291

【Authors】: Jianxing Yu ; Xiaojun Quan ; Qinliang Su ; Jian Yin

【Abstract】: This paper focuses on the topic of multi-hop question generation, which aims to generate questions needed reasoning over multiple sentences and relations to derive answers. In particular, we first build an entity graph to integrate various entities scattered over text based on their contextual relations. We then heuristically extract the sub-graph by the evidential relations and type, so as to obtain the reasoning chain and textual related contents for each question. Guided by the chain, we propose a holistic generator-evaluator network to form the questions, where such guidance helps to ensure the rationality of generated questions which need multi-hop deduction to correspond to the answers. The generator is a sequence-to-sequence model, designed with several techniques to make the questions syntactically and semantically valid. The evaluator optimizes the generator network by employing a hybrid mechanism combined of supervised and reinforced learning. Experimental results on HotpotQA data set demonstrate the effectiveness of our approach, where the generated samples can be used as pseudo training data to alleviate the data shortage problem for neural network and assist to learn the state-of-the-arts for multi-hop machine comprehension.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

27. Hierarchical Adaptive Contextual Bandits for Resource Constraint based Recommendation.

Paper Link】 【Pages】:292-302

【Authors】: Mengyue Yang ; Qingyang Li ; Zhiwei Qin ; Jieping Ye

【Abstract】: Contextual multi-armed bandit (MAB) achieves cutting-edge performance on a variety of problems. When it comes to real-world scenarios such as recommendation system and online advertising, however, it is essential to consider the resource consumption of exploration. In practice, there is typically non-zero cost associated with executing a recommendation (arm) in the environment, and hence, the policy should be learned with a fixed exploration cost constraint. It is challenging to learn a global optimal policy directly, since it is a NP-hard problem and significantly complicates the exploration and exploitation trade-off of bandit algorithms. Existing approaches focus on solving the problems by adopting the greedy policy which estimates the expected rewards and costs and uses a greedy selection based on each arm’s expected reward/cost ratio using historical observation until the exploration resource is exhausted. However, existing methods are hard to extend to infinite time horizon, since the learning process will be terminated when there is no more resource. In this paper, we propose a hierarchical adaptive contextual bandit method (HATCH) to conduct the policy learning of contextual bandits with a budget constraint. HATCH adopts an adaptive method to allocate the exploration resource based on the remaining resource/time and the estimation of reward distribution among different user contexts. In addition, we utilize full of contextual feature information to find the best personalized recommendation. Finally, in order to prove the theoretical guarantee, we present a regret bound analysis and prove that HATCH achieves a regret bound as low as . The experimental results demonstrate the effectiveness and efficiency of the proposed method on both synthetic data sets and the real-world applications.

【Keywords】: Computing methodologies; Machine learning

28. Future Data Helps Training: Modeling Future Contexts for Session-based Recommendation.

Paper Link】 【Pages】:303-313

【Authors】: Fajie Yuan ; Xiangnan He ; Haochuan Jiang ; Guibing Guo ; Jian Xiong ; Zhezhao Xu ; Yilin Xiong

【Abstract】: Session-based recommender systems have attracted much attention recently. To capture the sequential dependencies, existing methods resort either to data augmentation techniques or left-to-right style autoregressive training. Since these methods are aimed to model the sequential nature of user behaviors, they ignore the future data of a target interaction when constructing the prediction model for it. However, we argue that the future interactions after a target interaction, which are also available during training, provide valuable signal on user preference and can be used to enhance the recommendation quality.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

29. The Fast and The Frugal: Tail Latency Aware Provisioning for Coping with Load Variations.

Paper Link】 【Pages】:314-326

【Authors】: Adithya Kumar ; Iyswarya Narayanan ; Timothy Zhu ; Anand Sivasubramaniam

【Abstract】: Small and medium sized enterprises use the cloud for running online, user-facing, tail latency sensitive applications with well-defined fixed monthly budgets. For these applications, adequate system capacity must be provisioned to extract maximal performance despite the challenges of uncertainties in load and request-sizes. In this paper, we address the problem of capacity provisioning under fixed budget constraints with the goal of minimizing tail latency.

【Keywords】:

30. Facebook Ads as a Demographic Tool to Measure the Urban-Rural Divide.

Paper Link】 【Pages】:327-338

【Authors】: Daniele Rama ; Yelena Mejova ; Michele Tizzoni ; Kyriaki Kalimeri ; Ingmar Weber

【Abstract】: In the global move toward urbanization, making sure the people remaining in rural areas are not left behind in terms of development and policy considerations is a priority for governments worldwide. However, it is increasingly challenging to track important statistics concerning this sparse, geographically dispersed population, resulting in a lack of reliable, up-to-date data. In this study, we examine the usefulness of the Facebook Advertising platform, which offers a digital “census” of over two billions of its users, in measuring potential rural-urban inequalities. We focus on Italy, a country where about 30% of the population lives in rural areas. First, we show that the population statistics that Facebook produces suffer from instability across time and incomplete coverage of sparsely populated municipalities. To overcome such limitation, we propose an alternative methodology for estimating Facebook Ads audiences that nearly triples the coverage of the rural municipalities from 19% to 55% and makes feasible fine-grained sub-population analysis. Using official national census data, we evaluate our approach and confirm known significant urban-rural divides in terms of educational attainment and income. Extending the analysis to Facebook-specific user “interests” and behaviors, we provide further insights on the divide, for instance, finding that rural areas show a higher interest in gambling. Notably, we find that the most predictive features of income in rural areas differ from those for urban centres, suggesting researchers need to consider a broader range of attributes when examining rural wellbeing. The findings of this study illustrate the necessity of improving existing tools and methodologies to include under-represented populations in digital demographic studies – the failure to do so could result in misleading observations, conclusions, and most importantly, policies.

【Keywords】:

31. Efficient Maximal Balanced Clique Enumeration in Signed Networks.

Paper Link】 【Pages】:339-349

【Authors】: Zi Chen ; Long Yuan ; Xuemin Lin ; Lu Qin ; Jianye Yang

【Abstract】: Clique is one of the most fundamental models for cohesive subgraph mining in network analysis. Existing clique model mainly focuses on unsigned networks. In real world, however, many applications are modeled as signed networks with positive and negative edges. As the signed networks hold their own properties different from the unsigned networks, the existing clique model is inapplicable for the signed networks. Motivated by this, we propose the balanced clique model that considers the most fundamental and dominant theory, structural balance theory, for signed networks, and study the maximal balanced clique enumeration problem which computes all the maximal balanced cliques in a given signed network. We show that the maximal balanced clique enumeration problem is NP-Hard. A straightforward solution for the maximal balanced clique enumeration problem is to treat the signed network as two unsigned networks and leverage the off-the-shelf techniques for unsigned networks. However, such a solution is inefficient for large signed networks. To address this problem, in this paper, we first propose a new maximal balanced clique enumeration algorithm by exploiting the unique properties of signed networks. Based on the new proposed algorithm, we devise two optimization strategies to further improve the efficiency of the enumeration. We conduct extensive experiments on large real and synthetic datasets. The experimental results demonstrate the efficiency, effectiveness and scalability of our proposed algorithms.

【Keywords】: Information systems; Information systems applications; Data mining; Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms

32. Text-to-SQL Generation for Question Answering on Electronic Medical Records.

Paper Link】 【Pages】:350-361

【Authors】: Ping Wang ; Tian Shi ; Chandan K. Reddy

【Abstract】: Electronic medical records (EMR) contain comprehensive patient information and are typically stored in a relational database with multiple tables. Effective and efficient patient information retrieval from EMR data is a challenging task for medical experts. Question-to-SQL generation methods tackle this problem by first predicting the SQL query for a given question about a database, and then, executing the query on the database. However, most of the existing approaches have not been adapted to the healthcare domain due to a lack of healthcare Question-to-SQL dataset for learning models specific to this domain. In addition, wide use of the abbreviation of terminologies and possible typos in questions introduce additional challenges for accurately generating the corresponding SQL queries. In this paper, we tackle these challenges by developing a deep learning based TRanslate-Edit Model for Question-to-SQL (TREQS) generation, which adapts the widely used sequence-to-sequence model to directly generate the SQL query for a given question, and further performs the required edits using an attentive-copying mechanism and task-specific look-up tables. Based on the widely used publicly available electronic medical database, we create a new large-scale Question-SQL pair dataset, named MIMICSQL, in order to perform the Question-to-SQL generation task in healthcare domain. An extensive set of experiments are conducted to evaluate the performance of our proposed model on MIMICSQL. Both quantitative and qualitative experimental results indicate the flexibility and efficiency of our proposed method in predicting condition values and its robustness to random questions with abbreviations and typos.

【Keywords】:

33. Searching for polarization in signed graphs: a local spectral approach.

Paper Link】 【Pages】:362-372

【Authors】: Han Xiao ; Bruno Ordozgoiti ; Aristides Gionis

【Abstract】: Signed graphs have been used to model interactions in social networks, which can be either positive (friendly) or negative (antagonistic). The model has been used to study polarization and other related phenomena in social networks, which can be harmful to the process of democratic deliberation in our society. An interesting and challenging task in this application domain is to detect polarized communities in signed graphs. A number of different methods have been proposed for this task. However, existing approaches aim at finding globally optimal solutions. Instead, in this paper we are interested in finding polarized communities that are related to a small set of seed nodes provided as input. Seed nodes may consist of two sets, which constitute the two sides of a polarized structure.

【Keywords】:

Paper Link】 【Pages】:373-383

【Authors】: David Carmel ; Elad Haramaty ; Arnon Lazerson ; Liane Lewin-Eytan

【Abstract】: Learning a ranking model in product search involves satisfying many requirements such as maximizing the relevance of retrieved products with respect to the user query, as well as maximizing the purchase likelihood of these products. Multi-Objective Ranking Optimization (MORO) is the task of learning a ranking model from training examples while optimizing multiple objectives simultaneously. Label aggregation is a popular solution approach for multi-objective optimization, which reduces the problem into a single objective optimization problem, by aggregating the multiple labels of the training examples, related to the different objectives, to a single label. In this work we explore several label aggregation methods for MORO in product search. We propose a novel stochastic label aggregation method which randomly selects a label per training example according to a given distribution over the labels. We provide a theoretical proof showing that stochastic label aggregation is superior to alternative aggregation approaches, in the sense that any optimal solution of the MORO problem can be generated by a proper parameter setting of the stochastic aggregation process. We experiment on three different datasets: two from the voice product search domain, and one publicly available dataset from the Web product search domain. We demonstrate empirically over these three datasets that MORO with stochastic label aggregation provides a family of ranking models that fully dominates the set of MORO models built using deterministic label aggregation.

【Keywords】:

35. Open Knowledge Enrichment for Long-tail Entities.

Paper Link】 【Pages】:384-394

【Authors】: Ermei Cao ; Difeng Wang ; JiaCheng Huang ; Wei Hu

【Abstract】: Knowledge bases (KBs) have gradually become a valuable asset for many AI applications. While many current KBs are quite large, they are widely acknowledged as incomplete, especially lacking facts of long-tail entities, e.g., less famous persons. Existing approaches enrich KBs mainly on completing missing links or filling missing values. However, they only tackle a part of the enrichment problem and lack specific considerations regarding long-tail entities. In this paper, we propose a full-fledged approach to knowledge enrichment, which predicts missing properties and infers true facts of long-tail entities from the open Web. Prior knowledge from popular entities is leveraged to improve every enrichment step. Our experiments on the synthetic and real-world datasets and comparison with related work demonstrate the feasibility and superiority of the approach.

【Keywords】:

Paper Link】 【Pages】:395-406

【Authors】: Iskander Sánchez-Rola ; Davide Balzarotti ; Christopher Kruegel ; Giovanni Vigna ; Igor Santos

【Abstract】: Web pages have evolved into very complex dynamic applications, which are often very opaque and difficult for non-experts to understand. At the same time, security researchers push for more transparent web applications, which can help users in taking important security-related decisions about which information to disclose, which link to visit, and which online service to trust.

【Keywords】: Social and professional topics; Computing / technology policy; Computer crime

37. Adversarial Bandits Policy for Crawling Commercial Web Content.

Paper Link】 【Pages】:407-417

【Authors】: Shuguang Han ; Michael Bendersky ; Przemek Gajda ; Sergey Novikov ; Marc Najork ; Bernhard Brodowsky ; Alexandrin Popescul

【Abstract】: The rapid growth of commercial web content has driven the development of shopping search services to help users find product offers. Due to the dynamic nature of commercial content, an effective recrawl policy is a key component in a shopping search service; it ensures that users have access to the up-to-date product details. Most of the existing strategies either relied on simple heuristics, or overlooked the resource budgets. To address this, Azar et al. [5] recently proposed an optimization strategy LambdaCrawl aiming to maximize content freshness within a given resource budget. In this paper, we demonstrate that the effectiveness of LambdaCrawl is governed in large part by how well future content change rate can be estimated. By adopting the state-of-the-art deep learning models for change rate prediction, we obtain a substantial increase of content freshness over the common LambdaCrawl implementation with change rate estimated from the past history. Moreover, we demonstrate that while LambdaCrawl is a significant advancement upon existing recrawl strategies, it can be further improved upon by a unified multi-strategy recrawl policy. To this end, we adopt the K-armed adversarial bandits algorithm that can provably optimize the overall freshness by combining multiple strategies. Empirical results over a large-scale production dataset confirm its superiority to LambdaCrawl, especially under tight resource budgets.

【Keywords】:

38. Generating Clarifying Questions for Information Retrieval.

Paper Link】 【Pages】:418-428

【Authors】: Hamed Zamani ; Susan T. Dumais ; Nick Craswell ; Paul N. Bennett ; Gord Lueck

【Abstract】: Search queries are often short, and the underlying user intent may be ambiguous. This makes it challenging for search engines to predict possible intents, only one of which may pertain to the current user. To address this issue, search engines often diversify the result list and present documents relevant to multiple intents of the query. An alternative approach is to ask the user a question to clarify her information need. Asking clarifying questions is particularly important for scenarios with “limited bandwidth” interfaces, such as speech-only and small-screen devices. In addition, our user studies and large-scale online experiments show that asking clarifying questions is also useful in web search. Although some recent studies have pointed out the importance of asking clarifying questions, generating them for open-domain search tasks remains unstudied and is the focus of this paper. Lack of training data even within major search engines for this task makes it challenging. To mitigate this issue, we first identify a taxonomy of clarification for open-domain search queries by analyzing large-scale query reformulation data sampled from Bing search logs. This taxonomy leads us to a set of question templates and a simple yet effective slot filling algorithm. We further use this model as a source of weak supervision to automatically generate clarifying questions for training. Furthermore, we propose supervised and reinforcement learning models for generating clarifying questions learned from weak supervision data. We also investigate methods for generating candidate answers for each clarifying question, so users can select from a set of pre-defined answers. Human evaluation of the clarifying questions and candidate answers for hundreds of search queries demonstrates the effectiveness of the proposed solutions.

【Keywords】: Information systems; Information retrieval

39. MetaNER: Named Entity Recognition with Meta-Learning.

Paper Link】 【Pages】:429-440

【Authors】: Jing Li ; Shuo Shang ; Ling Shao

【Abstract】: Recent neural architectures in named entity recognition (NER) have yielded state-of-the-art performance on single domain data such as newswires. However, they still suffer from (i) requiring massive amounts of training data to avoid overfitting; (ii) huge performance degradation when there is a domain shift in the data distribution between training and testing. In this paper, we investigate the problem of domain adaptation for NER under homogeneous and heterogeneous settings. We propose MetaNER, a novel meta-learning approach for domain adaptation in NER. Specifically, MetaNER incorporates meta-learning and adversarial training strategies to encourage robust, general and transferable representations for sequence labeling. The key advantage of MetaNER is that it is capable of adapting to new unseen domains with a small amount of annotated data from those domains. We extensively evaluate MetaNER on multiple datasets under homogeneous and heterogeneous settings. The experimental results show that MetaNER achieves state-of-the-art performance against eight baselines. Impressively, MetaNER surpasses the in-domain performance using only 16.17% and 34.76% of target domain data on average for homogeneous and heterogeneous settings, respectively.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing; Machine learning; Machine learning approaches; Neural networks

40. HTML: Hierarchical Transformer-based Multi-task Learning for Volatility Prediction.

Paper Link】 【Pages】:441-451

【Authors】: Linyi Yang ; Tin Lok James Ng ; Barry Smyth ; Ruihai Dong

【Abstract】: The volatility forecasting task refers to predicting the amount of variability in the price of a financial asset over a certain period. It is an important mechanism for evaluating the risk associated with an asset and, as such, is of significant theoretical and practical importance in financial analysis. While classical approaches have framed this task as a time-series prediction one – using historical pricing as a guide to future risk forecasting – recent advances in natural language processing have seen researchers turn to complementary sources of data, such as analyst reports, social media, and even the audio data from earnings calls. This paper proposes a novel hierarchical, transformer, multi-task architecture designed to harness the text and audio data from quarterly earnings conference calls to predict future price volatility in the short and long term. This includes a comprehensive comparison to a variety of baselines, which demonstrates very significant improvements in prediction accuracy, in the range 17% - 49% compared to the current state-of-the-art. In addition, we describe the results of an ablation study to evaluate the relative contributions of each component of our approach and the relative contributions of text and audio data with respect to prediction accuracy.

【Keywords】:

41. Beyond Rank-1: Discovering Rich Community Structure in Multi-Aspect Graphs.

Paper Link】 【Pages】:452-462

【Authors】: Ekta Gujral ; Ravdeep Pasricha ; Evangelos E. Papalexakis

【Abstract】: How are communities in real multi-aspect or multi-view graphs structured? How we can effectively and concisely summarize and explore those communities in a high-dimensional, multi-aspect graph without losing important information? State-of-the-art studies focused on patterns in single graphs, identifying structures in a single snapshot of a large network or in time evolving graphs and stitch them over time.

【Keywords】: Information systems; Information systems applications; Data mining

42. Off-policy Learning in Two-stage Recommender Systems.

Paper Link】 【Pages】:463-473

【Authors】: Jiaqi Ma ; Zhe Zhao ; Xinyang Yi ; Ji Yang ; Minmin Chen ; Jiaxi Tang ; Lichan Hong ; Ed H. Chi

【Abstract】: Many real-world recommender systems need to be highly scalable: matching millions of items with billions of users, with milliseconds latency. The scalability requirement has led to widely used two-stage recommender systems, consisting of efficient candidate generation model(s) in the first stage and a more powerful ranking model in the second stage.

【Keywords】:

43. Selective Weak Supervision for Neural Information Retrieval.

Paper Link】 【Pages】:474-485

【Authors】: Kaitao Zhang ; Chenyan Xiong ; Zhenghao Liu ; Zhiyuan Liu

【Abstract】: This paper democratizes neural information retrieval to scenarios where large scale relevance training signals are not available. We revisit the classic IR intuition that anchor-document relations approximate query-document relevance and propose a reinforcement weak supervision selection method, ReInfoSelect, which learns to select anchor-document pairs that best weakly supervise the neural ranker (action), using the ranking performance on a handful of relevance labels as the reward. Iteratively, for a batch of anchor-document pairs, ReInfoSelect back propagates the gradients through the neural ranker, gathers its NDCG reward, and optimizes the data selection network using policy gradients, until the neural ranker’s performance peaks on target relevance metrics (convergence). In our experiments on three TREC benchmarks, neural rankers trained by ReInfoSelect, with only publicly available anchor data, significantly outperform feature-based learning to rank methods and match the effectiveness of neural rankers trained with private commercial search logs. Our analyses show that ReInfoSelect effectively selects weak supervision signals based on the stage of the neural ranker training, and intuitively picks anchor-document pairs similar to query-document pairs.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

44. TaxoExpan: Self-supervised Taxonomy Expansion with Position-Enhanced Graph Neural Network.

Paper Link】 【Pages】:486-497

【Authors】: Jiaming Shen ; Zhihong Shen ; Chenyan Xiong ; Chi Wang ; Kuansan Wang ; Jiawei Han

【Abstract】: Taxonomies consist of machine-interpretable semantics and provide valuable knowledge for many web applications. For example, online retailers (e.g., Amazon and eBay) use taxonomies for product recommendation, and web search engines (e.g., Google and Bing) leverage taxonomies to enhance query understanding. Enormous efforts have been made on constructing taxonomies either manually or semi-automatically. However, with the fast-growing volume of web content, existing taxonomies will become outdated and fail to capture emerging knowledge. Therefore, in many applications, dynamic expansions of an existing taxonomy are in great demand. In this paper, we study how to expand an existing taxonomy by adding a set of new concepts. We propose a novel self-supervised framework, named TaxoExpan, which automatically generates a set of ⟨query concept, anchor concept⟩ pairs from the existing taxonomy as training data. Using such self-supervision data, TaxoExpan learns a model to predict whether a query concept is the direct hyponym of an anchor concept. We develop two innovative techniques in TaxoExpan: (1) a position-enhanced graph neural network that encodes the local structure of an anchor concept in the existing taxonomy, and (2) a noise-robust training objective that enables the learned model to be insensitive to the label noise in the self-supervision data. Extensive experiments on three large-scale datasets from different domains demonstrate both the effectiveness and the efficiency of TaxoExpan for taxonomy expansion.

【Keywords】:

45. A Generalized and Fast-converging Non-negative Latent Factor Model for Predicting User Preferences in Recommender Systems.

Paper Link】 【Pages】:498-507

【Authors】: Ye Yuan ; Xin Luo ; Mingsheng Shang ; Di Wu

【Abstract】: Recommender systems (RSs) commonly describe its user-item preferences with a high-dimensional and sparse (HiDS) matrix filled with non-negative data. A non-negative latent factor (NLF) model relying on a single latent factor-dependent, non-negative and multiplicative update (SLF-NMU) algorithm is frequently adopted to process such an HiDS matrix. However, an NLF model mostly adopts Euclidean distance for its objective function, which is naturally a special case of α-β-divergence. Moreover, it frequently suffers slow convergence. For addressing these issues, this study proposes a generalized and fast-converging non-negative latent factor (GFNLF) model. Its main idea is two-fold: a) adopting α-β-divergence for its objective function, thereby enhancing its representation ability for HiDS data; b) deducing its momentum-incorporated non-negative multiplicative update (MNMU) algorithm, thereby achieving its fast convergence. Empirical studies on two HiDS matrices emerging from real RSs demonstrate that with carefully-tuned hyperparameters, a GFNLF model outperforms state-of-the-art models in both computational efficiency and prediction accuracy for missing data of an HiDS matrix.

【Keywords】:

46. Deep Adversarial Completion for Sparse Heterogeneous Information Network Embedding.

Paper Link】 【Pages】:508-518

【Authors】: Kai Zhao ; Ting Bai ; Bin Wu ; Bai Wang ; Youjie Zhang ; Yuanyu Yang ; Jian-Yun Nie

【Abstract】: Heterogeneous information network (HIN) contains multiple types of entities and relations. Most of existing HIN embedding methods learn the semantic information based on the heterogeneous structures between different entities, which are implicitly assumed to be complete. However, in real world, it is common that some relations are partially observed due to privacy or other reasons, resulting in a sparse network, in which the structure may be incomplete, and the ”unseen” links may also be positive due to the missing relations in data collection. To address this problem, we propose a novel and principled approach: a Multi-View Adversarial Completion Model (MV-ACM). Each relation space is characterized in a single viewpoint, enabling us to use the topological structural information in each view. Based on the multi-view architecture, an adversarial learning process is utilized to learn the reciprocity (i.e., complementary information) between different relations: In the generator, MV-ACM generates the complementary views by computing the similarity of the semantic representation of the same node in different views; while in the discriminator, MV-ACM discriminates whether the view is complementary by the topological structural similarity. Then we update the node’s semantic representation by aggregating neighborhoods information from the syncretic views. We conduct systematical experiments1 on six real-world networks from varied domains: AMiner, PPI, YouTube, Twitter, Amazon and Alibaba. Empirical results show that MV-ACM significantly outperforms the state-of-the-art approaches for both link prediction and node classification tasks.

【Keywords】: Information systems; Information systems applications; Data mining

47. Learning the Structure of Auto-Encoding Recommenders.

Paper Link】 【Pages】:519-529

【Authors】: Farhan Khawar ; Leonard K. M. Poon ; Nevin L. Zhang

【Abstract】: Autoencoder recommenders have recently shown state-of-the-art performance in the recommendation task due to their ability to model non-linear item relationships effectively. However, existing autoencoder recommenders use fully-connected neural network layers and do not employ structure learning. This can lead to inefficient training, especially when the data is sparse as commonly found in collaborative filtering. The aforementioned results in lower generalization ability and reduced performance. In this paper, we introduce structure learning for autoencoder recommenders by taking advantage of the inherent item groups present in the collaborative filtering domain. Due to the nature of items in general, we know that certain items are more related to each other than to other items. Based on this, we propose a method that first learns groups of related items and then uses this information to determine the connectivity structure of an auto-encoding neural network. This results in a network that is sparsely connected. This sparse structure can be viewed as a prior that guides the network training. Empirically we demonstrate that the proposed structure learning enables the autoencoder to converge to a local optimum with a much smaller spectral norm and generalization error bound than the fully-connected network. The resultant sparse network considerably outperforms the state-of-the-art methods like Mult-vae/Mult-dae on multiple benchmarked datasets even when the same number of parameters and flops are used. It also has a better cold-start performance.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

48. StageNet: Stage-Aware Neural Networks for Health Risk Prediction.

Paper Link】 【Pages】:530-540

【Authors】: Junyi Gao ; Cao Xiao ; Yasha Wang ; Wen Tang ; Lucas M. Glass ; Jimeng Sun

【Abstract】: Deep learning has demonstrated success in health risk prediction especially for patients with chronic and progressing conditions. Most existing works focus on learning disease patterns from longitudinal patient data, but pay little attention to the disease progression stage itself. To fill the gap, we propose a Stage-aware neural Network (StageNet) model to extract disease stage information from patient data and integrate it into risk prediction. StageNet is enabled by (1) a stage-aware long short-term memory (LSTM) module that extracts health stage variations unsupervisedly; (2) a stage-adaptive convolutional module that incorporates stage-related progression patterns into risk prediction. We evaluate StageNet on two real-world datasets and show that StageNet outperforms state-of-the-art models in risk prediction task and patient subtyping task. Compared to the best baseline model, StageNet achieves up to 12% higher AUPRC for risk prediction task on two real-world patient datasets. StageNet also achieves over 58% higher Calinski-Harabasz score (a cluster quality metric) for a patient subtyping task.

【Keywords】: Applied computing; Life and medical sciences; Health informatics

49. Clinical Report Auto-completion.

Paper Link】 【Pages】:541-550

【Authors】: Siddharth Biswal ; Cao Xiao ; Lucas Glass ; M. Brandon Westover ; Jimeng Sun

【Abstract】: Generating clinical reports from raw recordings such as X-rays and electroencephalogram (EEG) is an essential and routine task for doctors. However, it is often time-consuming to write accurate and detailed reports. Most existing methods try to generate the whole reports from the raw input with limited success because 1) generated reports often contain errors that need manual review and correction, 2) it does not save time when doctors want to write additional information into the report, and 3) the generated reports are not customized based on individual doctors’ preference. We propose CLinicAl Report Auto-completion (, an interactive method that generates reports in a sentence by sentence fashion based on doctors’ anchor words and partially completed sentences. earches for most relevant sentences from existing reports as the template for the current report. The retrieved sentences are sequentially modified by combining with the input feature representations to create the final report. In our experimental evaluation chieved 0.393 CIDEr and 0.248 BLEU-4 on X-ray reports and 0.482 CIDEr and 0.491 BLEU-4 for EEG reports for sentence-level generation, which is up to 35% improvement over the best baseline. Also via our qualitative evaluation, s shown to produce reports which have a significantly higher level of approval by doctors in a user study (3.74 out of 5 for s 2.52 out of 5 for the baseline).

【Keywords】:

50. Deep Global and Local Generative Model for Recommendation.

Paper Link】 【Pages】:551-561

【Authors】: Huafeng Liu ; Liping Jing ; Jingxuan Wen ; Zhicheng Wu ; Xiaoyi Sun ; Jiaqi Wang ; Lin Xiao ; Jian Yu

【Abstract】: Deep generative model, especially variational auto-encoder (VAE), has been successfully employed by more and more recommendation systems. The reason is that it combines the flexibility of probabilistic generative model with the powerful non-linear feature representation ability of deep neural networks. The existing VAE-based recommendation models are usually proposed under global assumption by incorporating simple priors, e.g., a single Gaussian, to regularize the latent variables. This strategy, however, is ineffective when the user is simultaneously interested in different kinds of items, i.e., the user’s preference may be highly diverse. In this paper, thus, we propose a Deep Global and Local Generative Model for recommendation to consider both local and global structure among users (DGLGM) under the Wasserstein auto-encoder framework. Besides keeping the global structure like the existing model, DGLGM adopts a non-parametric Mixture Gaussian distribution with several components to capture the diversity of the users’ preferences. Each component is corresponding to one local structure and its optimal size can be determined via the automatic relevance determination technique. These two parts can be seamlessly integrated and enhance each other. The proposed DGLGM can be efficiently inferred by minimizing its penalized upper bound with the aid of local variational optimization technique. Meanwhile, we theoretically analyze its generalization error bounds to guarantee its performance in sparse feedback data with diversity. By comparing with the state-of-the-art methods, the experimental results demonstrate that DGLGM consistently benefits the recommendation system in top-N recommendation task.

【Keywords】:

51. Comparing the Effects of DNS, DoT, and DoH on Web Performance.

Paper Link】 【Pages】:562-572

【Authors】: Austin Hounsel ; Kevin Borgolte ; Paul Schmitt ; Jordan Holland ; Nick Feamster

【Abstract】: Nearly every service on the Internet relies on the Domain Name System (DNS), which translates a human-readable name to an IP address before two endpoints can communicate. Today, DNS traffic is unencrypted, leaving users vulnerable to eavesdropping and tampering. Past work has demonstrated that DNS queries can reveal a user’s browsing history and even what smart devices they are using at home. In response to these privacy concerns, two new protocols have been proposed: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT). Instead of sending DNS queries and responses in the clear, DoH and DoT establish encrypted connections between users and resolvers. By doing so, these protocols provide privacy and security guarantees that traditional DNS (Do53) lacks.

【Keywords】:

52. Flowless: Extracting Densest Subgraphs Without Flow Computations.

Paper Link】 【Pages】:573-583

【Authors】: Digvijay Boob ; Yu Gao ; Richard Peng ; Saurabh Sawlani ; Charalampos E. Tsourakakis ; Di Wang ; Junxing Wang

【Abstract】: The problem of finding dense components of a graph is a major primitive in graph mining and data analysis. The densest subgraph problem (DSP) that asks to find a subgraph with maximum average degree forms a basic primitive in dense subgraph discovery with applications ranging from community detection to unsupervised discovery of biological network modules [16]. The DSP is exactly solvable in polynomial time using maximum flows [14, 17, 22]. Due to the high computational cost of maximum flows, Charikar’s greedy approximation algorithm is usually preferred in practice due to its linear time and linear space complexity [3, 8]. It constitutes a key algorithmic idea in scalable solutions for large-scale dynamic graphs [5, 7]. However, its output density can be a factor 2 off the optimal solution.

【Keywords】: Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms

53. CellRep: Usage Representativeness Modeling and Correction Based on Multiple City-Scale Cellular Networks.

Paper Link】 【Pages】:584-595

【Authors】: Zhihan Fang ; Guang Wang ; Shuai Wang ; Chaoji Zuo ; Fan Zhang ; Desheng Zhang

【Abstract】: Understanding representativeness in cellular web logs at city scale is essential for web applications. Most of the existing work on cellular web analyses or applications is built upon data from a single network in a city, which may not be representative of the overall usage patterns since multiple cellular networks coexist in most cities in the world. In this paper, we conduct the first comprehensive investigation of multiple cellular networks in a city with a 100% user penetration rate. We study web usage pattern (e.g., internet access services) correlation and difference between diverse cellular networks in terms of spatial and temporal dimensions to quantify the representativeness of web usage from a single network in usage patterns of all users in the same city. Moreover, relying on three external datasets, we study the correlation between the representativeness and contextual factors (e.g., Point-of-Interest, population, and mobility) to explain the potential causalities for the representativeness difference. We found that contextual diversity is a key reason for representativeness difference, and representativeness has a significant impact on the performance of real-world applications. Based on the analysis results, we further design a correction model to address the bias of single cellphone networks and improve representativeness by 45.8%.

【Keywords】:

54. Why Do Competitive Markets Converge to First-Price Auctions?

Paper Link】 【Pages】:596-605

【Authors】: Renato Paes Leme ; Balasubramanian Sivan ; Yifeng Teng

【Abstract】: We consider a setting in which bidders participate in multiple auctions run by different sellers, and optimize their bids for the aggregate auction. We analyze this setting by formulating a game between sellers, where a seller’s strategy is to pick an auction to run. Our analysis aims to shed light on the recent change in the Display Ads market landscape: here, ad exchanges (sellers) were mostly running second-price auctions earlier and over time they switched to variants of the first-price auction, culminating in Google’s Ad Exchange moving to a first-price auction in 2019. Our model and results offer an explanation for why the first-price auction occurs as a natural equilibrium in such competitive markets.

【Keywords】: Applied computing; Law, social and behavioral sciences; Economics

55. Frozen Binomials on the Web: Word Ordering and Language Conventions in Online Text.

Paper Link】 【Pages】:606-616

【Authors】: Katherine Van Koevering ; Austin R. Benson ; Jon M. Kleinberg

【Abstract】: There is inherent information captured in the order in which we write words in a list. The orderings of binomials — lists of two words separated by ‘and’ or ‘or’ — has been studied for more than a century. These binomials are common across many areas of speech, in both formal and informal text. In the last century, numerous explanations have been given to describe what order people use for these binomials, from differences in semantics to differences in phonology. These rules describe primarily ‘frozen’ binomials that exist in exactly one ordering and have lacked large-scale trials to determine efficacy.

【Keywords】:

56. Snippext: Semi-supervised Opinion Mining with Augmented Data.

Paper Link】 【Pages】:617-628

【Authors】: Zhengjie Miao ; Yuliang Li ; Xiaolan Wang ; Wang-Chiew Tan

【Abstract】: Online services are interested in solutions to opinion mining, which is the problem of extracting aspects, opinions, and sentiments from text. One method to mine opinions is to leverage the recent success of pre-trained language models which can be fine-tuned to obtain high-quality extractions from reviews. However, fine-tuning language models still requires a non-trivial amount of training data.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing; Machine learning; Machine learning approaches; Neural networks

57. paper2repo: GitHub Repository Recommendation for Academic Papers.

Paper Link】 【Pages】:629-639

【Authors】: Huajie Shao ; Dachun Sun ; Jiahao Wu ; Zecheng Zhang ; Aston Zhang ; Shuochao Yao ; Shengzhong Liu ; Tianshi Wang ; Chao Zhang ; Tarek F. Abdelzaher

【Abstract】: GitHub has become a popular social application platform, where a large number of users post their open source projects. In particular, an increasing number of researchers release repositories of source code related to their research papers in order to attract more people to follow their work. Motivated by this trend, we describe a novel item-item cross-platform recommender system, paper2repo, that recommends relevant repositories on GitHub that match a given paper in an academic search system such as Microsoft Academic. The key challenge is to identify the similarity between an input paper and its related repositories across the two platforms, without the benefit of human labeling. Towards that end, paper2repo integrates text encoding and constrained graph convolutional networks (GCN) to automatically learn and map the embeddings of papers and repositories into the same space, where proximity offers the basis for recommendation. To make our method more practical in real life systems, labels used for model training are computed automatically from features of user actions on GitHub. In machine learning, such automatic labeling is often called distant supervision. To the authors’ knowledge, this is the first distant-supervised cross-platform (paper to repository) matching system. We evaluate the performance of paper2repo on real-world data sets collected from GitHub and Microsoft Academic. Results demonstrate that it outperforms other state of the art recommendation methods.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

58. High Quality Candidate Generation and Sequential Graph Attention Network for Entity Linking.

Paper Link】 【Pages】:640-650

【Authors】: Zheng Fang ; Yanan Cao ; Ren Li ; Zhenyu Zhang ; Yanbing Liu ; Shi Wang

【Abstract】: Entity Linking (EL) is a task for mapping mentions in text to corresponding entities in knowledge base (KB). This task usually includes candidate generation (CG) and entity disambiguation (ED) stages. Recent EL systems based on neural network models have achieved good performance, but they still face two challenges: (i) Previous studies evaluate their models without considering the differences between candidate entities. In fact, the quality (gold recall in particular) of candidate sets has an effect on the EL results. So, how to promote the quality of candidates needs more attention. (ii) In order to utilize the topical coherence among the referred entities, many graph and sequence models are proposed for collective ED. However, graph-based models treat all candidate entities equally which may introduce much noise information. On the contrary, sequence models can only observe previous referred entities, ignoring the relevance between the current mention and its subsequent entities. To address the first problem, we propose a multi-strategy based CG method to generate high recall candidate sets. For the second problem, we design a Sequential Graph Attention Network (SeqGAT) which combines the advantages of graph and sequence methods. In our model, mentions are dealt with in a sequence manner. Given the current mention, SeqGAT dynamically encodes both its previous referred entities and subsequent ones, and assign different importance to these entities. In this way, it not only makes full use of the topical consistency, but also reduce noise interference. We conduct experiments on different types of datasets and compare our method with previous EL system on the open evaluation platform. The comparison results show that our model achieves significant improvements over the state-of-the-art methods.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing

59. Adaptive Probabilistic Word Embedding.

Paper Link】 【Pages】:651-661

【Authors】: Shuangyin Li ; Yu Zhang ; Rong Pan ; Kaixiang Mo

【Abstract】: Word embeddings have been widely used and proven to be effective in many natural language processing and text modeling tasks. It is obvious that one ambiguous word could have very different semantics in various contexts, which is called polysemy. Most existing works aim at generating only one single embedding for each word while a few works build a limited number of embeddings to present different meanings for each word. However, it is hard to determine the exact number of senses for each word as the word meaning is dependent on contexts. To address this problem, we propose a novel Adaptive Probabilistic Word Embedding (APWE) model, where the word polysemy is defined over a latent interpretable semantic space. Specifically, at first each word is represented by an embedding in the latent semantic space and then based on the proposed APWE model, the word embedding can be adaptively adjusted and updated based on different contexts to obtain the tailored word embedding. Empirical comparisons with state-of-the-art models demonstrate the superiority of the proposed APWE model.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing; Machine learning

60. Conversational Contextual Bandit: Algorithm and Application.

Paper Link】 【Pages】:662-672

【Authors】: Xiaoying Zhang ; Hong Xie ; Hang Li ; John C. S. Lui

【Abstract】: Contextual bandit algorithms provide principled online learning solutions to balance the exploitation-exploration trade-off in various applications such as recommender systems. However, the learning speed of the traditional contextual bandit algorithms is often slow due to the need for extensive exploration. This poses a critical issue in applications like recommender systems, since users may need to provide feedbacks on a lot of uninterested items. To accelerate the learning speed, we generalize contextual bandit to conversational contextual bandit. Conversational contextual bandit leverages not only behavioral feedbacks on arms (e.g., articles in news recommendation), but also occasional conversational feedbacks on key-terms from the user. Here, a key-term can relate to a subset of arms, for example, a category of articles in news recommendation. We then design the Conversational UCB algorithm (ConUCB) to address two challenges in conversational contextual bandit: (1) which key-terms to select to conduct conversation, (2) how to leverage conversational feedbacks to accelerate the speed of bandit learning. We theoretically prove that ConUCB can achieve a smaller regret upper bound than the traditional contextual bandit algorithm LinUCB, which implies a faster learning speed. Experiments on synthetic data, as well as real datasets from Yelp and Toutiao, demonstrate the efficacy of the ConUCB algorithm.

【Keywords】: Computing methodologies; Machine learning

61. Adversarial Attacks on Graph Neural Networks via Node Injections: A Hierarchical Reinforcement Learning Approach.

Paper Link】 【Pages】:673-683

【Authors】: Yiwei Sun ; Suhang Wang ; Xianfeng Tang ; Tsung-Yu Hsieh ; Vasant G. Honavar

【Abstract】: Graph Neural Networks (GNN) offer the powerful approach to node classification in complex networks across many domains including social media, E-commerce, and FinTech. However, recent studies show that GNNs are vulnerable to attacks aimed at adversely impacting their node classification performance. Existing studies of adversarial attacks on GNN focus primarily on manipulating the connectivity between existing nodes, a task that requires greater effort on the part of the attacker in real-world applications. In contrast, it is much more expedient on the part of the attacker to inject adversarial nodes, e.g., fake profiles with forged links, into existing graphs so as to reduce the performance of the GNN in classifying existing nodes.

【Keywords】: Computing methodologies; Machine learning

62. Efficient Implicit Unsupervised Text Hashing using Adversarial Autoencoder.

Paper Link】 【Pages】:684-694

【Authors】: Khoa D. Doan ; Chandan K. Reddy

【Abstract】: Searching for documents with semantically similar content is a fundamental problem in the information retrieval domain with various challenges, primarily, in terms of efficiency and effectiveness. Despite the promise of modeling structured dependencies in documents, several existing text hashing methods lack an efficient mechanism to incorporate such vital information. Additionally, the desired characteristics of an ideal hash function, such as robustness to noise, low quantization error and bit balance/uncorrelation, are not effectively learned with existing methods. This is because of the requirement to either tune additional hyper-parameters or optimize these heuristically and explicitly constructed cost functions. In this paper, we propose a Denoising Adversarial Binary Autoencoder (DABA) model which presents a novel representation learning framework that captures structured representation of text documents in the learned hash function. Also, adversarial training provides an alternative direction to implicitly learn a hash function that captures all the desired characteristics of an ideal hash function. Essentially, DABA adopts a novel single-optimization adversarial training procedure that minimizes the Wasserstein distance in its primal domain to regularize the encoder’s output of either a recurrent neural network or a convolutional autoencoder. We empirically demonstrate the effectiveness of our proposed method in capturing the intrinsic semantic manifold of the related documents. The proposed method outperforms the current state-of-the-art shallow and deep unsupervised hashing methods for the document retrieval task on several prominent document collections.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

63. LightRec: A Memory and Search-Efficient Recommender System.

Paper Link】 【Pages】:695-705

【Authors】: Defu Lian ; Haoyu Wang ; Zheng Liu ; Jianxun Lian ; Enhong Chen ; Xing Xie

【Abstract】: Deep recommender systems have achieved remarkable improvements in recent years. Despite its superior ranking precision, the running efficiency and memory consumption turn out to be severe bottlenecks in reality. To overcome both limitations, we propose LightRec, a lightweight recommender system which enjoys fast online inference and economic memory consumption. The backbone of LightRec is a total of B codebooks, each of which is composed of W latent vectors, known as codewords. On top of such a structure, LightRec will have an item represented as additive composition of B codewords, which are optimally selected from each of the codebooks. To effectively learn the codebooks from data, we devise an end-to-end learning workflow, where challenges on the inherent differentiability and diversity are conquered by the proposed techniques. In addition, to further improve the representation quality, several distillation strategies are employed, which better preserves user-item relevance scores and relative ranking orders. LightRec is extensively evaluated with four real-world datasets, which gives rise to two empirical findings: 1) compared with those the state-of-the-art lightweight baselines, LightRec achieves over 11% relative improvements in terms of recall performance; 2) compared to conventional recommendation algorithms, LightRec merely incurs negligible accuracy degradation while leads to more than 27x speedup in top-k recommendation.

【Keywords】:

64. Clustering in graphs and hypergraphs with categorical edge labels.

Paper Link】 【Pages】:706-717

【Authors】: Ilya Amburg ; Nate Veldt ; Austin R. Benson

【Abstract】: Modern graph or network datasets often contain rich structure that goes beyond simple pairwise connections between nodes. This calls for complex representations that can capture, for instance, edges of different types as well as so-called “higher-order interactions” that involve more than two nodes at a time. However, we have fewer rigorous methods that can provide insight from such representations. Here, we develop a computational framework for the problem of clustering hypergraphs with categorical edge labels — or different interaction types — where clusters corresponds to groups of nodes that frequently participate in the same type of interaction.

【Keywords】: Mathematics of computing; Discrete mathematics; Graph theory

65. Few-Sample and Adversarial Representation Learning for Continual Stream Mining.

Paper Link】 【Pages】:718-728

【Authors】: Zhuoyi Wang ; Yigong Wang ; Yu Lin ; Evan Delord ; Latifur Khan

【Abstract】: Deep Neural Networks (DNNs) have primarily been demonstrated to be useful for closed-world classification problems where the number of categories is fixed. However, DNNs notoriously fail when tasked with label prediction in a non-stationary data stream scenario, which has the continuous emergence of the unknown or novel class (categories not in the training set). For example, new topics continually emerge in social media or e-commerce. To solve this challenge, a DNN should not only be able to detect the novel class effectively but also incrementally learn new concepts from limited samples over time. Literature that addresses both problems simultaneously is limited. In this paper, we focus on improving the generalization of the model on the novel classes, and making the model continually learn from only a few samples from the novel categories. Different from existing approaches that rely on abundant labeled instances to re-train/update the model, we propose a new approach based on Few Sample and Adversarial Representation Learning (FSAR). The key novelty is that we introduce the adversarial confusion term into both the representation learning and few-sample learning process, which reduces the over-confidence of the model on the seen classes, further enhance the generalization of the model to detect and learn new categories with only a few samples. We train the FSAR operated in two stages: first, FSAR learns an intra-class compacted and inter-class separated feature embedding to detect the novel classes; next, we collect a few labeled samples belong to the new categories, utilize episode-training to exploit the intrinsic features for few-sample learning. We evaluated FSAR on different datasets, using extensive experimental results from various simulated stream benchmarks to show that FSAR effectively outperforms current state-of-the-art approaches.

【Keywords】: Computing methodologies; Machine learning; Learning paradigms; Supervised learning; Supervised learning by classification; Machine learning approaches; Neural networks

66. Field-aware Calibration: A Simple and Empirically Strong Method for Reliable Probabilistic Predictions.

Paper Link】 【Pages】:729-739

【Authors】: Feiyang Pan ; Xiang Ao ; Pingzhong Tang ; Min Lu ; Dapeng Liu ; Lei Xiao ; Qing He

【Abstract】: It is often observed that the probabilistic predictions given by a machine learning model can disagree with averaged actual outcomes on specific subsets of data, which is also known as the issue of miscalibration. It is responsible for the unreliability of practical machine learning systems. For example, in online advertising, an ad can receive a click-through rate prediction of 0.1 over some population of users where its actual click rate is 0.15. In such cases, the probabilistic predictions have to be fixed before the system can be deployed.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

67. Mining Implicit Entity Preference from User-Item Interaction Data for Knowledge Graph Completion via Adversarial Learning.

Paper Link】 【Pages】:740-751

【Authors】: Gaole He ; Junyi Li ; Wayne Xin Zhao ; Peiju Liu ; Ji-Rong Wen

【Abstract】: The task of Knowledge Graph Completion (KGC) aims to automatically infer the missing fact information in Knowledge Graph (KG). In this paper, we take a new perspective that aims to leverage rich user-item interaction data (user interaction data for short) for improving the KGC task. Our work is inspired by the observation that many KG entities correspond to online items in application systems. However, the two kinds of data sources have very different intrinsic characteristics, and it is likely to hurt the original performance using simple fusion strategy.

【Keywords】:

68. Multiple Knowledge Syncretic Transformer for Natural Dialogue Generation.

Paper Link】 【Pages】:752-762

【Authors】: Xiangyu Zhao ; Longbiao Wang ; Ruifang He ; Ting Yang ; Jinxin Chang ; Ruifang Wang

【Abstract】: Knowledge is essential for intelligent conversation systems to generate informative responses. This knowledge comprises a wide range of diverse modalities such as knowledge graphs (KGs), grounding documents and conversation topics. However, limited abilities in understanding language and utilizing different types of knowledge still challenge existing approaches. Some researchers try to enhance models’ language comprehension ability by employing the pre-trained language models, but they neglect the importance of external knowledge in specific tasks. In this paper, we propose a novel universal transformer-based architecture for dialogue system, the Multiple Knowledge Syncretic Transformer (MKST), which fuses multi-knowledge in open-domain conversation. Firstly, the model is pre-trained on a large-scale corpus to learn commonsense knowledge. Then during fine-tuning, we divide the type of knowledge into two specific categories that are handled in different ways by our model. While the encoder is responsible for encoding dialogue contexts with multifarious knowledge together, the decoder with a knowledge-aware mechanism attentively reads the fusion of multi-knowledge to promote better generation. This is the first attempt that fuses multi-knowledge in one conversation model. The experimental results have been demonstrated that our model achieves significant improvement on knowledge-driven dialogue generation tasks than state-of-the-art baselines. Meanwhile, our new benchmark could facilitate the further study in this research area.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing

69. JSCleaner: De-Cluttering Mobile Webpages Through JavaScript Cleanup.

Paper Link】 【Pages】:763-773

【Authors】: Moumena Chaqfeh ; Yasir Zaki ; Jacinta Hu ; Lakshmi Subramanian

【Abstract】: A significant fraction of the World Wide Web suffers from the excessive usage of JavaScript (JS). Based on an analysis of popular webpages, we observed that a considerable number of JS elements utilized by these pages are not essential for their visual and functional features. In this paper, we propose JSCleaner, a JavaScript de-cluttering engine that aims at simplifying webpages without compromising their content or functionality. JSCleaner relies on a rule-based classification algorithm that classifies JS into three main categories: non-critical, replaceable, and critical. JSCleaner removes non-critical JS from a webpage, translates replaceable JS elements with their HTML outcomes, and preserves critical JS. Our quantitative evaluation of 500 popular webpages shows that JSCleaner achieves around 30% reduction in page load times coupled with a 50% reduction in the number of requests and the page size. In addition, our qualitative user study of 103 evaluators shows that JSCleaner preserves 95% of the page content similarity, while maintaining nearly 88% of the page functionality (the remaining 12% did not have a major impact on the user browsing experience).

【Keywords】:

70. Conquering Cross-source Failure for News Credibility: Learning Generalizable Representations beyond Content Embedding.

Paper Link】 【Pages】:774-784

【Authors】: Yen-Hao Huang ; Ting-Wei Liu ; Ssu-Rui Lee ; Fernando Henrique Calderon Alvarado ; Yi-Shin Chen

【Abstract】: False information on the Internet has caused severe damage to society. Researchers have proposed methods to determine the credibility of news and have obtained good results. As different media sources (publishers) have different content generators (writers) and may focus on different topics or aspects, the word/topic distribution for each media source is divergent from others. We expose a challenge in the generalizability of existing content-based methods to perform consistently when applied to news from media sources non-existing in the training set, namely the cross-source failure. A cross-source setting can cause a decrease beyond in accuracy for current methods; content-sensitive features are considered one of the major causes of cross-source failure for a content-based approach. To overcome this challenge, we propose a syntactic network for news credibility (SYNC), which focuses on function words and syntactic structure to learn generalizable representations for news credibility and further reinforce the cross-source robustness for different media. Experiments with cross-validation on 194 real-world media sources showed that the proposed method could learn the generalizable features and outperformed the state-of-the-art methods on unseen media sources. Extensive analysis on the embedding feature representation represents a strength of the proposed method compared to current content embedding feature approaches. We envision that the proposed method is more robust for real-life application with SYNC on account of its good generalizability.

【Keywords】:

71. Financial Defaulter Detection on Online Credit Payment via Multi-view Attributed Heterogeneous Information Network.

Paper Link】 【Pages】:785-795

【Authors】: Qiwei Zhong ; Yang Liu ; Xiang Ao ; Binbin Hu ; Jinghua Feng ; Jiayu Tang ; Qing He

【Abstract】: Default user detection plays one of the backbones in credit risk forecasting and management. It aims at, given a set of corresponding features, e.g., patterns extracted from trading behaviors, predicting the polarity indicating whether a user will fail to make required payments in the future. Recent efforts attempted to incorporate attributed heterogeneous information network (AHIN) for extracting complex interactive features of users and achieved remarkable success on discovering specific default users such as fraud, cash-out users, etc. In this paper, we consider default users, a more general concept in credit risk, and propose a multi-view attributed heterogeneous information network based approach coined MAHINDER to remedy the special challenges. First, multiple views of user behaviors are adopted to learn personal profile due to the endogenous aspect of financial default. Second, local behavioral patterns are specifically modeled since financial default is adversarial and accumulated. With the real datasets contained 1.38 million users on Alibaba platform, we investigate the effectiveness of MAHINDER, and the experimental results exhibit the proposed approach is able to improve AUC over 2.8% and [email protected]=0.1 over 13.1% compared with the state-of-the-art methods. Meanwhile, MAHINDER has as good interpretability as tree-based methods like GBDT, which buoys the deployment in online platforms.

【Keywords】: Information systems; Information systems applications; Data mining

72. Fast Generating A Large Number of Gumbel-Max Variables.

Paper Link】 【Pages】:796-807

【Authors】: Yiyan Qi ; Pinghui Wang ; Yuanming Zhang ; Junzhou Zhao ; Guangjian Tian ; Xiaohong Guan

【Abstract】: The well-known Gumbel-Max Trick for sampling elements from a categorical distribution (or more generally a nonnegative vector) and its variants have been widely used in areas such as machine learning and information retrieval. To sample a random element i (or a Gumbel-Max variable i) in proportion to its positive weight vi, the Gumbel-Max Trick first computes a Gumbel random variable gi for each positive weight element i, and then samples the element i with the largest value of gi + ln vi. Recently, applications including similarity estimation and graph embedding require to generate k independent Gumbel-Max variables from high dimensional vectors. However, it is computationally expensive for a large k (e.g., hundreds or even thousands) when using the traditional Gumbel-Max Trick. To solve this problem, we propose a novel algorithm, FastGM, that reduces the time complexity from O(kn+) to O(kln k + n+), where n+ is the number of positive elements in the vector of interest. Instead of computing k independent Gumbel random variables directly, we find that there exists a technique to generate these variables in descending order. Using this technique, our method FastGM computes variables gi + ln vi for all positive elements i in descending order. As a result, FastGM significantly reduces the computation time because we can stop the procedure of Gumbel random variables computing for many elements especially for those with small weights. Experiments on a variety of real-world datasets show that FastGM is orders of magnitude faster than state-of-the-art methods without sacrificing accuracy and incurring additional expenses.

【Keywords】:

73. Don't Count Me Out: On the Relevance of IP Address in the Tracking Ecosystem.

Paper Link】 【Pages】:808-815

【Authors】: Vikas Mishra ; Pierre Laperdrix ; Antoine Vastel ; Walter Rudametkin ; Romain Rouvoy ; Martin Lopatka

【Abstract】: Targeted online advertising has become an inextricable part of the way Web content and applications are monetized. At the beginning, online advertising consisted of simple ad-banners broadly shown to website visitors. Over time, it evolved into a complex ecosystem that tracks and collects a wealth of data to learn user habits and show targeted and personalized ads. To protect users against tracking, several countermeasures have been proposed, ranging from browser extensions that leverage filter lists, to features natively integrated into popular browsers like Firefox and Brave to combat more modern techniques like browser fingerprinting. Nevertheless, few browsers offer protections against IP address-based tracking techniques. Notably, the most popular browsers, Chrome, Firefox, Safari and Edge do not offer any.

【Keywords】:

74. Go See a Specialist? Predicting Cybercrime Sales on Online Anonymous Markets from Vendor and Product Characteristics.

Paper Link】 【Pages】:816-826

【Authors】: Rolf van Wegberg ; Fieke Miedema ; Ugur Akyazi ; Arman Noroozian ; Bram Klievink ; Michel van Eeten

【Abstract】: Many cybercriminal entrepreneurs lack the skills and techniques to provision certain parts of their business model, leading them to outsource these parts to specialized criminal vendors. Online anonymous markets, from Silk Road to AlphaBay, have been used to search for these products and contract with their criminal vendors. While one listing of a product generates high sales numbers, another identical listing fails to sell. In this paper, we investigate which factors determine the performance of cybercrime products.

【Keywords】:

75. Adversarial Multimodal Representation Learning for Click-Through Rate Prediction.

Paper Link】 【Pages】:827-836

【Authors】: Xiang Li ; Chao Wang ; Jiwei Tan ; Xiaoyi Zeng ; Dan Ou ; Bo Zheng

【Abstract】: For better user experience and business effectiveness, Click-Through Rate (CTR) prediction has been one of the most important tasks in E-commerce. Although extensive CTR prediction models have been proposed, learning good representation of items from multimodal features is still less investigated, considering an item in E-commerce usually contains multiple heterogeneous modalities. Previous works either concatenate the multiple modality features, that is equivalent to giving a fixed importance weight to each modality; or learn dynamic weights of different modalities for different items through technique like attention mechanism. However, a problem is that there usually exists common redundant information across multiple modalities. The dynamic weights of different modalities computed by using the redundant information may not correctly reflect the different importance of each modality. To address this, we explore the complementarity and redundancy of modalities by considering modality-specific and modality-invariant features differently. We propose a novel Multimodal Adversarial Representation Network (MARN) for the CTR prediction task. A multimodal attention network first calculates the weights of multiple modalities for each item according to its modality-specific features. Then a multimodal adversarial network learns modality-invariant representations where a double-discriminators strategy is introduced. Finally, we achieve the multimodal item representations by combining both modality-specific and modality-invariant representations. We conduct extensive experiments on both public and industrial datasets, and the proposed method consistently achieves remarkable improvements to the state-of-the-art methods. Moreover, the approach has been deployed in an operational E-commerce system and online A/B testing further demonstrates the effectiveness.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

76. Dual Learning for Explainable Recommendation: Towards Unifying User Preference Prediction and Review Generation.

Paper Link】 【Pages】:837-847

【Authors】: Peijie Sun ; Le Wu ; Kun Zhang ; Yanjie Fu ; Richang Hong ; Meng Wang

【Abstract】: In many recommender systems, users express item opinions through two kinds of behaviors: giving preferences and writing detailed reviews. As both kinds of behaviors reflect users’ assessment of items, review enhanced recommender systems leverage these two kinds of user behaviors to boost recommendation performance. On the one hand, researchers proposed to better model the user and item embeddings with additional review information for enhancing preference prediction accuracy. On the other hand, some recent works focused on automatically generating item reviews for recommendation explanations with related user and item embeddings. We argue that, while the task of preference prediction with the accuracy goal is well recognized in the community, the task of generating reviews for explainable recommendation is also important to gain user trust and increase conversion rate. Some preliminary attempts have considered jointly modeling these two tasks, with the user and item embeddings are shared. These studies empirically showed that these two tasks are correlated, and jointly modeling them would benefit the performance of both tasks.

【Keywords】:

77. The Chameleon Attack: Manipulating Content Display in Online Social Media.

Paper Link】 【Pages】:848-859

【Authors】: Aviad Elyashar ; Sagi Uziel ; Abigail Paradise ; Rami Puzis

【Abstract】: Online social networks (OSNs) are ubiquitous attracting millions of users all over the world. Being a popular communication media OSNs are exploited in a variety of cyber-attacks. In this article, we discuss the chameleon attack technique, a new type of OSN-based trickery where malicious posts and profiles change the way they are displayed to OSN users to conceal themselves before the attack or avoid detection. Using this technique, adversaries can, for example, avoid censorship by concealing true content when it is about to be inspected; acquire social capital to promote new content while piggybacking a trending one; cause embarrassment and serious reputation damage by tricking a victim to like, retweet, or comment a message that he wouldn’t normally do without any indication for the trickery within the OSN. An experiment performed with closed Facebook groups of sports fans shows that (1) chameleon pages can pass by the moderation filters by changing the way their posts are displayed and (2) moderators do not distinguish between regular and chameleon pages. We list the OSN weaknesses that facilitate the chameleon attack and propose a set of mitigation guidelines.

【Keywords】: Social and professional topics; Computing / technology policy; Computer crime

78. A Generic Solver Combining Unsupervised Learning and Representation Learning for Breaking Text-Based Captchas.

Paper Link】 【Pages】:860-871

【Authors】: Sheng Tian ; Tao Xiong

【Abstract】: Although there are many alternative captcha schemes available, text-based captchas are still one of the most popular security mechanism to maintain Internet security and prevent malicious attacks, due to the user preferences and ease of design. Over the past decade, different methods of breaking captchas have been proposed, which helps captcha keep evolving and become more robust. However, these previous works generally require heavy expert involvement and gradually become ineffective with the introduction of new security features. This paper proposes a generic solver combining unsupervised learning and representation learning to automatically remove the noisy background of captchas and solve text-based captchas. We introduce a new training scheme for constructing mini-batches, which contain a large number of unlabeled hard examples, to improve the efficiency of representation learning. Unlike existing deep learning algorithms, our method requires significantly fewer labeled samples and surpasses the recognition performance of a fully-supervised model with the same network architecture. Moreover, extensive experiments show that the proposed method outperforms state-of-the-art by delivering a higher accuracy on various captcha schemes. We provide further discussions of potential applications of the proposed unified framework. We hope that our work can inspire the community to enhance the security of text-based captchas.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

79. Dynamic Composition for Conversational Domain Exploration.

Paper Link】 【Pages】:872-883

【Authors】: Idan Szpektor ; Deborah Cohen ; Gal Elidan ; Michael Fink ; Avinatan Hassidim ; Orgad Keller ; Sayali Kulkarni ; Eran Ofek ; Sagie Pudinsky ; Asaf Revach ; Shimi Salant ; Yossi Matias

【Abstract】: We study conversational domain exploration (CODEX), where the user’s goal is to enrich her knowledge of a given domain by conversing with an informative bot. Such conversations should be well grounded in high-quality domain knowledge as well as engaging and open-ended. A CODEX bot should be proactive and introduce relevant information even if not directly asked for by the user. The bot should also appropriately pivot the conversation to undiscovered regions of the domain. To address these dialogue characteristics, we introduce a novel approach termed dynamic composition that decouples candidate content generation from the flexible composition of bot responses. This allows the bot to control the source, correctness and quality of the offered content, while achieving flexibility via a dialogue manager that selects the most appropriate contents in a compositional manner. We implemented a CODEX bot based on dynamic composition and integrated it into the Google Assistant . As an example domain, the bot conversed about the NBA basketball league in a seamless experience, such that users were not aware whether they were conversing with the vanilla system or the one augmented with our CODEX bot. Results are positive and offer insights into what makes for a good conversation. To the best of our knowledge, this is the first real user experiment of open-ended dialogues as part of a commercial assistant system.

【Keywords】:

80. Deconstructing Google's Web Light Service.

Paper Link】 【Pages】:884-893

【Authors】: Ammar Tahir ; Muhammad Tahir Munir ; Shaiq Munir Malik ; Zafar Ayyub Qazi ; Ihsan Ayyub Qazi

【Abstract】: Web Light is a transcoding service introduced by Google to show lighter and faster webpages to users searching on slow mobile clients. The service detects slow clients (e.g., users on 2G) and tries to convert webpages on the fly into a version optimized for these clients. Web Light claims to significantly reduce page load times, save user data, and substantially increase traffic to such webpages. However, there are several concerns around this service, including, its effectiveness in, preserving relevant content on a page, showing third-party advertisements, improving user performance as well as privacy concerns for users and publishers.

【Keywords】:

81. A First Look at Commercial 5G Performance on Smartphones.

Paper Link】 【Pages】:894-905

【Authors】: Arvind Narayanan ; Eman Ramadan ; Jason Carpenter ; Qingxu Liu ; Yu Liu ; Feng Qian ; Zhi-Li Zhang

【Abstract】: We conduct to our knowledge a first measurement study of commercial 5G performance on smartphones by closely examining 5G networks of three carriers (two mmWave carriers, one mid-band carrier) in three U.S. cities. We conduct extensive field tests on 5G performance in diverse urban environments. We systematically analyze the handoff mechanisms in 5G and their impact on network performance. We explore the feasibility of using location and possibly other environmental information to predict the network performance. We also study the app performance (web browsing and HTTP download) over 5G. Our study consumes more than 15 TB of cellular data. Conducted when 5G just made its debut, it provides a “baseline” for studying how 5G performance evolves, and identifies key research directions on improving 5G users’ experience in a cross-layer manner. We have released the data collected from our study (referred to as 5Gophers) at https://fivegophers.umn.edu/www20.

【Keywords】: Networks; Network types; Mobile networks

82. Next Point-of-Interest Recommendation on Resource-Constrained Mobile Devices.

Paper Link】 【Pages】:906-916

【Authors】: Qinyong Wang ; Hongzhi Yin ; Tong Chen ; Zi Huang ; Hao Wang ; Yanchang Zhao ; Nguyen Quoc Viet Hung

【Abstract】: In the modern tourism industry, next point-of-interest (POI) recommendation is an important mobile service as it effectively aids hesitating travelers to decide the next POI to visit. Currently, most next POI recommender systems are built upon a cloud-based paradigm, where the recommendation models are trained and deployed on the powerful cloud servers. When a recommendation request is made by a user via mobile devices, the current contextual information will be uploaded to the cloud servers to help the well-trained models generate personalized recommendation results. However, in reality, this paradigm heavily relies on high-quality network connectivity, and is subject to high energy footprint in the operation and increasing privacy concerns among the public. To bypass these defects, we propose a novel Light Location Recommender System (LLRec) to perform next POI recommendation locally on resource-constrained mobile devices. To make LLRec fully compatible with the limited computing resources and memory space, we leverage FastGRNN, a lightweight but effective gated Recurrent Neural Network (RNN) as its main building block, and significantly compress the model size by adopting the tensor-train composition in the embedding layer. As a compact model, LLRec maintains its robustness via an innovative teacher-student training framework, where a powerful teacher model is trained on the cloud to learn essential knowledge from available contextual data, and the simplified student model LLRec is trained under the guidance of the teacher model. The final LLRec is downloaded and deployed on users’ mobile devices to generate accurate recommendations solely utilizing users’ local data. As a result, LLRec significantly reduces the dependency on cloud servers, thus allowing for next POI recommendation in a stable, cost-effective and secure way. Extensive experiments on two large-scale recommendation datasets further demonstrate the superiority of our proposed solution.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

83. Adversarial Attack on Community Detection by Hiding Individuals.

Paper Link】 【Pages】:917-927

【Authors】: Jia Li ; Honglei Zhang ; Zhichao Han ; Yu Rong ; Hong Cheng ; Junzhou Huang

【Abstract】: It has been demonstrated that adversarial graphs, i.e., graphs with imperceptible perturbations added, can cause deep graph models to fail on node/graph classification tasks. In this paper, we extend adversarial graphs to the problem of community detection which is much more difficult. We focus on black-box attack and aim to hide targeted individuals from the detection of deep graph community detection models, which has many applications in real-world scenarios, for example, protecting personal privacy in social networks and understanding camouflage patterns in transaction networks. We propose an iterative learning framework that takes turns to update two modules: one working as the constrained graph generator and the other as the surrogate community detection model. We also find that the adversarial graphs generated by our method can be transferred to other learning based community detection models.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

84. Modeling Users' Behavior Sequences with Hierarchical Explainable Network for Cross-domain Fraud Detection.

Paper Link】 【Pages】:928-938

【Authors】: Yongchun Zhu ; Dongbo Xi ; Bowen Song ; Fuzhen Zhuang ; Shuai Chen ; Xi Gu ; Qing He

【Abstract】: With the explosive growth of the e-commerce industry, detecting online transaction fraud in real-world applications has become increasingly important to the development of e-commerce platforms. The sequential behavior history of users provides useful information in differentiating fraudulent payments from regular ones. Recently, some approaches have been proposed to solve this sequence-based fraud detection problem. However, these methods usually suffer from two problems: the prediction results are difficult to explain and the exploitation of the internal information of behaviors is insufficient. To tackle the above two problems, we propose a Hierarchical Explainable Network (HEN) to model users’ behavior sequences, which could not only improve the performance of fraud detection but also make the inference process interpretable.

【Keywords】:

85. Valve: Securing Function Workflows on Serverless Computing Platforms.

Paper Link】 【Pages】:939-950

【Authors】: Pubali Datta ; Prabuddha Kumar ; Tristan Morris ; Michael Grace ; Amir Rahmati ; Adam Bates

【Abstract】: Serverless Computing has quickly emerged as a dominant cloud computing paradigm, allowing developers to rapidly prototype event-driven applications using a composition of small functions that each perform a single logical task. However, many such application workflows are based in part on publicly-available functions developed by third-parties, creating the potential for functions to behave in unexpected, or even malicious, ways. At present, developers are not in total control of where and how their data is flowing, creating significant security and privacy risks in growth markets that have embraced serverless (e.g., IoT).

【Keywords】:

86. Experimental Evidence Extraction System in Data Science with Hybrid Table Features and Ensemble Learning.

Paper Link】 【Pages】:951-961

【Authors】: Wenhao Yu ; Wei Peng ; Yu Shu ; Qingkai Zeng ; Meng Jiang

【Abstract】: Data Science has been one of the most popular fields in higher education and research activities. It takes tons of time to read the experimental section of thousands of papers and figure out the performance of the data science techniques. In this work, we build an experimental evidence extraction system to automate the integration of tables (in the paper PDFs) into a database of experimental results. First, it crops the tables and recognizes the templates. Second, it classifies the column names and row names into “method”, “dataset”, or “evaluation metric”, and then unified all the table cells into (method, dataset, metric, score)-quadruples. We propose hybrid features including structural and semantic table features as well as an ensemble learning approach for column/row name classification and table unification. SQL statements can be used to answer questions such as whether a method is the state-of-the-art or whether the reported numbers are conflicting.

【Keywords】: Information systems; Information systems applications; Data mining

87. Identifying Referential Intention with Heterogeneous Contexts.

Paper Link】 【Pages】:962-972

【Authors】: Wenhao Yu ; Mengxia Yu ; Tong Zhao ; Meng Jiang

【Abstract】: Citing, quoting, and forwarding & commenting behaviors are widely seen in academia, news media, and social media. Existing behavior modeling approaches focused on mining content and describing preferences of authors, speakers, and users. However, behavioral intention plays an important role in generating content on the platforms. In this work, we propose to identify the referential intention which motivates the action of using the referred (e.g., cited, quoted, and retweeted) source and content to support their claims. We adopt a theory in sociology to develop a schema of four types of intentions. The challenge lies in the heterogeneity of observed contextual information surrounding the referential behavior, such as referred content (e.g., a cited paper), local context (e.g., the sentence citing the paper), neighboring context (e.g., the former and latter sentences), and network context (e.g., the academic network of authors, affiliations, and keywords). We propose a new neural framework with Interactive Hierarchical Attention (IHA) to identify the intention of referential behavior by properly aggregating the heterogeneous contexts. Experiments demonstrate that the proposed method can effectively identify the type of intention of citing behaviors (on academic data) and retweeting behaviors (on Twitter). And learning the heterogeneous contexts collectively can improve the performance. This work opens a door for understanding content generation from a fundamental perspective of behavior sciences.

【Keywords】:

88. PG2S+: Stack Distance Construction Using Popularity, Gap and Machine Learning.

Paper Link】 【Pages】:973-983

【Authors】: Jiangwei Zhang ; Y. C. Tay

【Abstract】: Stack distance characterizes temporal locality of workloads and plays a vital role in cache analysis since the 1970s. However, exact stack distance calculation is too costly, and impractical for online use. Hence, much work was done to optimize the exact computation, or approximate it through sampling or modeling.

【Keywords】: General and reference; Cross-computing tools and techniques; Performance

89. SMART-KG: Hybrid Shipping for SPARQL Querying on the Web.

Paper Link】 【Pages】:984-994

【Authors】: Amr Azzam ; Javier D. Fernández ; Maribel Acosta ; Martin Beno ; Axel Polleres

【Abstract】: While Linked Data (LD) provides standards for publishing (RDF) and (SPARQL) querying Knowledge Graphs (KGs) on the Web, serving, accessing and processing such open, decentralized KGs is often practically impossible, as query timeouts on publicly available SPARQL endpoints show. Alternative solutions such as Triple Pattern Fragments (TPF) attempt to tackle the problem of availability by pushing query processing workload to the client side, but suffer from unnecessary transfer of irrelevant data on complex queries with large intermediate results. In this paper we present smart-KG, a novel approach to share the load between servers and clients, while significantly reducing data transfer volume, by combining TPF with shipping compressed KG partitions. Our evaluations show that smart-KG outperforms state-of-the-art client-side solutions and increases server-side availability towards more cost-effective and balanced hosting of open and decentralized KGs.

【Keywords】: Information systems; Data management systems; Database management system engines; Database query processing; Theory of computation; Theory and algorithms for application domains; Database theory; Database query processing and optimization (theory)

90. Learning from Cross-Modal Behavior Dynamics with Graph-Regularized Neural Contextual Bandit.

Paper Link】 【Pages】:995-1005

【Authors】: Xian Wu ; Suleyman Cetintas ; Deguang Kong ; Miao Lu ; Jian Yang ; Nitesh V. Chawla

【Abstract】: Contextual multi-armed bandit algorithms have received significant attention in modeling users’ preferences for online personalized recommender systems in a timely manner. While significant progress has been made along this direction, a few major challenges have not been well addressed yet: (i) a vast majority of the literature is based on linear models that cannot capture complex non-linear inter-dependencies of user-item interactions; (ii) existing literature mainly ignores the latent relations among users and non-recommended items: hence may not properly reflect users’ preferences in the real-world; (iii) current solutions are mainly based on historical data and are prone to cold-start problems for new users who have no interaction history.

【Keywords】:

91. Read Between the Lines: An Empirical Measurement of Sensitive Applications of Voice Personal Assistant Systems.

Paper Link】 【Pages】:1006-1017

【Authors】: Faysal Hossain Shezan ; Hang Hu ; Jiamin Wang ; Gang Wang ; Yuan Tian

【Abstract】: Voice Personal Assistant (VPA) systems such as Amazon Alexa and Google Home have been used by tens of millions of households. Recent work demonstrated proof-of-concept attacks against their voice interface to invoke unintended applications or operations. However, there is still a lack of empirical understanding of what type of third-party applications that VPA systems support, and what consequences these attacks may cause. In this paper, we perform an empirical analysis of the third-party applications of Amazon Alexa and Google Home to systematically assess the attack surfaces. A key methodology is to characterize a given application by classifying the sensitive voice commands it accepts. We develop a natural language processing tool that classifies a given voice command from two dimensions: (1) whether the voice command is designed to insert action or retrieve information; (2) whether the command is sensitive or nonsensitive. The tool combines a deep neural network and a keyword-based model, and uses Active Learning to reduce the manual labeling effort. The sensitivity classification is based on a user study (N=404) where we measure the perceived sensitivity of voice commands. A ground-truth evaluation shows that our tool achieves over 95% of accuracy for both types of classifications. We apply this tool to analyze 77,957 Amazon Alexa applications and 4,813 Google Home applications (198,199 voice commands from Amazon Alexa, 13,644 voice commands from Google Home) over two years (2018-2019). In total, we identify 19,263 sensitive “action injection” commands and 5,352 sensitive “information retrieval” commands. These commands are from 4,596 applications (5.55% out of all applications), most of which belong to the “smart home” category. While the percentage of sensitive applications is small, we show the percentage is increasing over time from 2018 to 2019.

【Keywords】:

92. A Kernel of Truth: Determining Rumor Veracity on Twitter by Diffusion Pattern Alone.

Paper Link】 【Pages】:1018-1028

【Authors】: Nir Rosenfeld ; Aron Szanto ; David C. Parkes

【Abstract】: Recent work in the domain of misinformation detection has leveraged rich signals in the text and user identities associated with content on social media. But text can be strategically manipulated and accounts reopened under different aliases, suggesting that these approaches are inherently brittle. In this work, we investigate an alternative modality that is naturally robust: the pattern in which information propagates. Can the veracity of an unverified rumor spreading online be discerned solely on the basis of its pattern of diffusion through the social network?

【Keywords】: Computing methodologies; Machine learning

93. Patient-Trial Matching with Deep Embedding and Entailment Prediction.

Paper Link】 【Pages】:1029-1037

【Authors】: Xingyao Zhang ; Cao Xiao ; Lucas Glass ; Jimeng Sun

【Abstract】: Clinical trials are essential for drug development but often suffer from expensive, inaccurate and insufficient patient recruitment. The core problem of patient-trial matching is to find qualified patients for a trial, where patient information is stored in electronic health records (EHR) while trial eligibility criteria (EC) are described in text documents available on the web. How to represent longitudinal patient EHR? How to extract complex logical rules from EC? Most existing works rely on manual rule-based extraction, which is time consuming and inflexible for complex inference. To address these challenges, we proposed a cross-modal inference learning model to jointly encode enrollment criteria (text) and patients records (tabular data) into a shared latent space for matching inference. pplies a pre-trained Bidirectional Encoder Representations from Transformers(BERT) model to encode clinical trial information into sentence embedding. And uses a hierarchical embedding model to represent patient longitudinal EHR. In addition, s augmented by a numerical information embedding and entailment module to reason over numerical information in both EC and EHR. These encoders are trained jointly to optimize patient-trial matching score. We evaluated n the trial-patient matching task with demonstrated on real world datasets. utperformed the best baseline by up to 12.4% in average F1.

【Keywords】:

94. Weakly Supervised Attention for Hashtag Recommendation using Graph Data.

Paper Link】 【Pages】:1038-1048

【Authors】: Amin Javari ; Zhankui He ; Zijie Huang ; Jeetu Raj ; Kevin Chen-Chuan Chang

【Abstract】: Personalized hashtag recommendation for users could substantially promote user engagement in microblogging websites; users can discover microblogs aligned with their interests. However, user profiling on microblogging websites is challenging because most users tend not to generate content. Our core idea is to build a graph-based profile of users and incorporate it into hashtag recommendation. Indeed, user’s followee/follower links implicitly indicate their interests. Considering that microblogging networks are scale-free networks, to maintain the efficiency and effectiveness of the model, rather than analyzing the entire network, we model users based on their links towards hub nodes. That is, hashtags and hub nodes are projected into a shared latent space. To predict the relevance of a user to a hashtag, a projection of the user is built by aggregating the embeddings of her hub neighbors guided by an attention model and then compared with the hashtag. Classically, attention models can be trained in an end to end manner. However, due to the high complexity of our problem, we propose a novel weak supervision model for the attention component, which significantly improves the effectiveness of the model. We performed extensive experiments on two datasets collected from Twitter and Weibo, and the results confirm that our method substantially outperforms the baselines.

【Keywords】:

95. Real-Time Clustering for Large Sparse Online Visitor Data.

Paper Link】 【Pages】:1049-1059

【Authors】: Gromit Yeuk-Yin Chan ; Fan Du ; Ryan A. Rossi ; Anup B. Rao ; Eunyee Koh ; Cláudio T. Silva ; Juliana Freire

【Abstract】: Online visitor behaviors are often modeled as a large sparse matrix, where rows represent visitors and columns represent behavior. To discover customer segments with different hierarchies, marketers often need to cluster the data in different splits. Such analyses require the clustering algorithm to provide real-time responses on user parameter changes, which the current techniques cannot support. In this paper, we propose a real-time clustering algorithm, sparse density peaks, for large-scale sparse data. It pre-processes the input points to compute annotations and a hierarchy for cluster assignment. While the assignment is only a single scan of the points, a naive pre-processing requires measuring all pairwise distances, which incur a quadratic computation overhead and is infeasible for any moderately sized data. Thus, we propose a new approach based on MinHash and LSH that provides fast and accurate estimations. We also describe an efficient implementation on Spark that addresses data skew and memory usage. Our experiments show that our approach (1) provides a better approximation compared to a straightforward MinHash and LSH implementation in terms of accuracy on real datasets, (2) achieves a 20 × speedup in the end-to-end clustering pipeline, and (3) can maintain computations with a small memory. Finally, we present an interface to explore customer segments from millions of online visitor records in real-time.

【Keywords】: Computing methodologies; Machine learning; Learning paradigms; Unsupervised learning; Cluster analysis

96. Smaller, Faster & Lighter KNN Graph Constructions.

Paper Link】 【Pages】:1060-1070

【Authors】: Rachid Guerraoui ; Anne-Marie Kermarrec ; Olivier Ruas ; François Taïani

【Abstract】: We propose GoldFinger, a new compact and fast-to-compute binary representation of datasets to approximate Jaccard’s index. We illustrate the effectiveness of GoldFinger on the emblematic big data problem of K-Nearest-Neighbor (KNN) graph construction and show that GoldFinger can drastically accelerate a large range of existing KNN algorithms with little to no overhead. As a side effect, we also show that the compact representation of the data protects users’ privacy for free by providing k-anonymity and l-diversity. Our extensive evaluation of the resulting approach on several realistic datasets shows that our approach delivers speedups of up to 78.9% compared to the use of raw data while only incurring a negligible to moderate loss in terms of KNN quality. To convey the practical value of such a scheme, we apply it to item recommendation and show that the loss in recommendation quality is negligible.

【Keywords】:

97. Automatic Boolean Query Formulation for Systematic Review Literature Search.

Paper Link】 【Pages】:1071-1081

【Authors】: Harrisen Scells ; Guido Zuccon ; Bevan Koopman ; Justin Clark

【Abstract】: Formulating Boolean queries for systematic review literature search is a challenging task. Commonly, queries are formulated by information specialists using the protocol specified in the review and interactions with the research team. Information specialists have in-depth experience on how to formulate queries in this domain, but may not have in-depth knowledge about the reviews’ topics. Query formulation requires a significant amount of time and effort, and is performed interactively; specialists repeatedly formulate queries, attempt to validate their results, and reformulate specific Boolean clauses. In this paper, we investigate the possibility of automatically formulating a Boolean query from the systematic review protocol. We propose a novel five-step approach to automatic query formulation, specific to Boolean queries in this domain, which approximates the process by which information specialists formulate queries. In this process, we use syntax parsing to derive the logical structure of high-level concepts in a query, automatically extract and map concepts to entities in order to perform entity expansion, and finally apply post-processing operations (such as stemming and search filters).

【Keywords】: Information systems; Information retrieval; Information retrieval query processing

98. Traffic Flow Prediction via Spatial Temporal Graph Neural Network.

Paper Link】 【Pages】:1082-1092

【Authors】: Xiaoyang Wang ; Yao Ma ; Yiqi Wang ; Wei Jin ; Xin Wang ; Jiliang Tang ; Caiyan Jia ; Jian Yu

【Abstract】: Traffic flow analysis, prediction and management are keystones for building smart cities in the new era. With the help of deep neural networks and big traffic data, we can better understand the latent patterns hidden in the complex transportation networks. The dynamic of the traffic flow on one road not only depends on the sequential patterns in the temporal dimension but also relies on other roads in the spatial dimension. Although there are existing works on predicting the future traffic flow, the majority of them have certain limitations on modeling spatial and temporal dependencies. In this paper, we propose a novel spatial temporal graph neural network for traffic flow prediction, which can comprehensively capture spatial and temporal patterns. In particular, the framework offers a learnable positional attention mechanism to effectively aggregate information from adjacent roads. Meanwhile, it provides a sequential component to model the traffic flow dynamics which can exploit both local and global temporal dependencies. Experimental results on various real traffic datasets demonstrate the effectiveness of the proposed framework.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

99. Personalized Ranking with Importance Sampling.

Paper Link】 【Pages】:1093-1103

【Authors】: Defu Lian ; Qi Liu ; Enhong Chen

【Abstract】: As the task of predicting a personalized ranking on a set of items, item recommendation has become an important way to address information overload. Optimizing ranking loss aligns better with the ultimate goal of item recommendation, so many ranking-based methods were proposed for item recommendation, such as collaborative filtering with Bayesian Personalized Ranking (BPR) loss, and Weighted Approximate-Rank Pairwise (WARP) loss. However, the ranking-based methods can not consistently beat regression-based models with the gravity regularizer. The key challenge in ranking-based optimization is difficult to fully use the limited number of negative samples, particularly when they are not so informative. To this end, we propose a new ranking loss based on importance sampling so that more informative negative samples can be better used. We then design a series of negative samplers from simple to complex, whose informativeness of negative samples is from less to more. With these samplers, the loss function is easy to use and can be optimized by popular solvers. The proposed algorithms are evaluated with five real-world datasets of varying size and difficulty. The results show that they consistently outperform the state-of-the-art item recommendation algorithms, and the relative improvements with respect to [email protected] are more than 19.2% on average. Moreover, the loss function is verified to make better use of negative samples and to require fewer negative samples when they are more informative.

【Keywords】: Information systems; Information retrieval; Retrieval tasks and goals; Document filtering; Information extraction

100. Generalizing Tensor Decomposition for N-ary Relational Knowledge Bases.

Paper Link】 【Pages】:1104-1114

【Authors】: Yu Liu ; Quanming Yao ; Yong Li

【Abstract】: With the rapid development of knowledge bases (KBs), link prediction task, which completes KBs with missing facts, has been broadly studied in especially binary relational KBs (a.k.a knowledge graph) with powerful tensor decomposition related methods. However, the ubiquitous n-ary relational KBs with higher-arity relational facts are paid less attention, in which existing translation based and neural network based approaches have weak expressiveness and high complexity in modeling various relations. Tensor decomposition has not been considered for n-ary relational KBs, while directly extending tensor decomposition related methods of binary relational KBs to the n-ary case does not yield satisfactory results due to exponential model complexity and their strong assumptions on binary relations. To generalize tensor decomposition for n-ary relational KBs, in this work, we propose GETD, a generalized model based on Tucker decomposition and Tensor Ring decomposition. The existing negative sampling technique is also generalized to the n-ary case for GETD. In addition, we theoretically prove that GETD is fully expressive to completely represent any KBs. Extensive evaluations on two representative n-ary relational KB datasets demonstrate the superior performance of GETD, significantly improving the state-of-the-art methods by over 15%. Moreover, GETD further obtains the state-of-the-art results on the benchmark binary relational KB datasets.

【Keywords】:

101. What is Normal, What is Strange, and What is Missing in a Knowledge Graph: Unified Characterization via Inductive Summarization.

Paper Link】 【Pages】:1115-1126

【Authors】: Caleb Belth ; Xinyi Zheng ; Jilles Vreeken ; Danai Koutra

【Abstract】: Knowledge graphs (KGs) store highly heterogeneous information about the world in the structure of a graph, and are useful for tasks such as question answering and reasoning. However, they often contain errors and are missing information. Vibrant research in KG refinement has worked to resolve these issues, tailoring techniques to either detect specific types of errors or complete a KG.

【Keywords】:

102. Intention Modeling from Ordered and Unordered Facets for Sequential Recommendation.

Paper Link】 【Pages】:1127-1137

【Authors】: Xueliang Guo ; Chongyang Shi ; Chuanming Liu

【Abstract】: Recently, sequential recommendation has attracted substantial attention from researchers due to its status as an essential service for e-commerce. Accurately understanding user intention is an important factor to improve the performance of recommendation system. However, user intention is highly time-dependent and flexible, so it is very challenging to learn the latent dynamic intention of users for sequential recommendation. To this end, in this paper, we propose a novel intention modeling from ordered and unordered facets (IMfOU) for sequential recommendation. Specifically, the global and local item embedding (GLIE) we proposed can comprehensively capture the sequential context information in the sequences and highlight the important features that users care about. We further design ordered preference drift learning (OPDL) and unordered purchase motivation learning (UPML) to obtain user’s the process of preference drift and purchase motivation respectively. With combining the users’ dynamic preference and current motivation, it considers not only sequential dependencies between items but also flexible dependencies and models the user purchase intention more accurately from ordered and unordered facets respectively. Evaluation results on three real-world datasets demonstrate that our proposed approach achieves better performance than the state-of-the-art sequential recommendation methods achieving improvement of AUC by an average of 2.26%.

【Keywords】:

103. Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog.

Paper Link】 【Pages】:1138-1148

【Authors】: Shen Gao ; Xiuying Chen ; Chang Liu ; Li Liu ; Dongyan Zhao ; Rui Yan

【Abstract】: Stickers with vivid and engaging expressions are becoming increasingly popular in online messaging apps, and some works are dedicated to automatically select sticker response by matching text labels of stickers with previous utterances. However, due to their large quantities, it is impractical to require text labels for the all stickers. Hence, in this paper, we propose to recommend an appropriate sticker to user based on multi-turn dialog context history without any external labels. Two main challenges are confronted in this task. One is to learn semantic meaning of stickers without corresponding text labels. Another challenge is to jointly model the candidate sticker with the multi-turn dialog context. To tackle these challenges, we propose a sticker response selector (SRS) model. Specifically, SRS first employs a convolutional based sticker image encoder and a self-attention based multi-turn dialog encoder to obtain the representation of stickers and utterances. Next, deep interaction network is proposed to conduct deep matching between the sticker with each utterance in the dialog history. SRS then learns the short-term and long-term dependency between all interaction results by a fusion network to output the the final matching score. To evaluate our proposed method, we collect a large-scale real-world dialog dataset with stickers from one of the most popular online chatting platform. Extensive experiments conducted on this dataset show that our model achieves the state-of-the-art performance for all commonly-used metrics. Experiments also verify the effectiveness of each component of SRS. To facilitate further research in sticker selection field, we release this dataset of 340K multi-turn dialog and sticker pairs1.

【Keywords】:

104. Dynamic Graph Convolutional Networks for Entity Linking.

Paper Link】 【Pages】:1149-1159

【Authors】: Junshuang Wu ; Richong Zhang ; Yongyi Mao ; Hongyu Guo ; Masoumeh Soflaei ; Jinpeng Huai

【Abstract】: Entity linking, which maps named entity mentions in a document into the proper entities in a given knowledge graph, has been shown to be able to significantly benefit from modeling the entity relatedness through Graph Convolutional Networks (GCN). Nevertheless, existing GCN entity linking models fail to take into account the fact that the structured graph for a set of entities not only depends on the contextual information of the given document but also adaptively changes on different aggregation layers of the GCN, resulting in insufficiency in terms of capturing the structural information among entities. In this paper, we propose a dynamic GCN architecture to effectively cope with this challenge. The graph structure in our model is dynamically computed and modified during training. Through aggregating knowledge from dynamically linked nodes, our GCN model can collectively identify the entity mappings between the document and the knowledge graph, and efficiently capture the topical coherence among various entity mentions in the entire document. Empirical studies on benchmark entity linking data sets confirm the superior performance of our proposed strategy and the benefits of the dynamic graph structure.

【Keywords】:

Paper Link】 【Pages】:1160-1170

【Authors】: Corbin Rosset ; Chenyan Xiong ; Xia Song ; Daniel Campos ; Nick Craswell ; Saurabh Tiwary ; Paul N. Bennett

【Abstract】: This paper studies a new scenario in conversational search, conversational question suggestion, which leads search engine users to more engaging experiences by suggesting interesting, informative, and useful follow-up questions. We first establish a novel evaluation metric, usefulness, which goes beyond relevance and measures whether the suggestions provide valuable information for the next step of a user’s journey, and construct a public benchmark for useful question suggestion. Then we develop two suggestion systems, a BERT based ranker and a GPT-2 based generator, both trained with novel weak supervision signals that convey past users’ search behaviors in search sessions. The weak supervision signals help ground the suggestions to users’ information-seeking trajectories: we identify more coherent and informative sessions using encodings, and then weakly supervise our models to imitate how users transition to the next state of search. Our offline experiments demonstrate the crucial role our “next-turn” inductive training plays in improving usefulness over a strong online system. Our online A/B test in Bing shows that our more useful question suggestions receive 8% more user clicks than the previous system.

【Keywords】:

106. De-Kodi: Understanding the Kodi Ecosystem.

Paper Link】 【Pages】:1171-1181

【Authors】: Marc Anthony Warrior ; Yunming Xiao ; Matteo Varvello ; Aleksandar Kuzmanovic

【Abstract】: Free and open source media centers are currently experiencing a boom in popularity for the convenience and flexibility they offer users seeking to remotely consume digital content. This newfound fame is matched by increasing notoriety—for their potential to serve as hubs for illegal content—and a presumably ever-increasing network footprint. It is fair to say that a complex ecosystem has developed around Kodi, composed of millions of users, thousands of “add-ons”—Kodi extensions from 3rd-party developers—and content providers. Motivated by these observations, this paper conducts the first analysis of the Kodi ecosystem. Our approach is to build “crawling” software around Kodi which can automatically install an addon, explore its menu, and locate (video) content. This is challenging for many reasons. First, Kodi largely relies on visual information and user input which intrinsically complicates automation. Second, no central aggregators for Kodi addons exist. Third, the potential sheer size of this ecosystem requires a highly scalable crawling solution. We address these challenges with de-Kodi, a full fledged crawling system capable of discovering and crawling large cross-sections of Kodi’s decentralized ecosystem. With de-Kodi, we discovered and tested over 9,000 distinct Kodi addons. Our results demonstrate de-Kodi, which we make available to the general public, to be an essential asset in studying one of the largest multimedia platforms in the world. Our work further serves as the first ever transparent and repeatable analysis of the Kodi ecosystem at large.

【Keywords】:

107. Attention Please: Your Attention Check Questions in Survey Studies Can Be Automatically Answered.

Paper Link】 【Pages】:1182-1193

【Authors】: Weiping Pei ; Arthur Mayer ; Kaylynn Tu ; Chuan Yue

【Abstract】: Attention check questions have become commonly used in online surveys published on popular crowdsourcing platforms as a key mechanism to filter out inattentive respondents and improve data quality. However, little research considers the vulnerabilities of this important quality control mechanism that can allow attackers including irresponsible and malicious respondents to automatically answer attention check questions for efficiently achieving their goals. In this paper, we perform the first study to investigate such vulnerabilities, and demonstrate that attackers can leverage deep learning techniques to pass attention check questions automatically. We propose AC-EasyPass, an attack framework with a concrete model, that combines convolutional neural network and weighted feature reconstruction to easily pass attention check questions. We construct the first attention check question dataset that consists of both original and augmented questions, and demonstrate the effectiveness of AC-EasyPass. We explore two simple defense methods, adding adversarial sentences and adding typos, for survey designers to mitigate the risks posed by AC-EasyPass; however, these methods are fragile due to their limitations from both technical and usability perspectives, underlining the challenging nature of defense. We hope our work will raise sufficient attention of the research community towards developing more robust attention check mechanisms. More broadly, our work intends to prompt the research community to seriously consider the emerging risks posed by the malicious use of machine learning techniques to the quality, validity, and trustworthiness of crowdsourcing and social computing.

【Keywords】:

108. FairRec: Two-Sided Fairness for Personalized Recommendations in Two-Sided Platforms.

Paper Link】 【Pages】:1194-1204

【Authors】: Gourab K. Patro ; Arpita Biswas ; Niloy Ganguly ; Krishna P. Gummadi ; Abhijnan Chakraborty

【Abstract】: We investigate the problem of fair recommendation in the context of two-sided online platforms, comprising customers on one side and producers on the other. Traditionally, recommendation services in these platforms have focused on maximizing customer satisfaction by tailoring the results according to the personalized preferences of individual customers. However, our investigation reveals that such customer-centric design may lead to unfair distribution of exposure among the producers, which may adversely impact their well-being. On the other hand, a producer-centric design might become unfair to the customers. Thus, we consider fairness issues that span both customers and producers. Our approach involves a novel mapping of the fair recommendation problem to a constrained version of the problem of fairly allocating indivisible goods. Our proposed FairRec algorithm guarantees at least Maximin Share (MMS) of exposure for most of the producers and Envy-Free up to One Good (EF1) fairness for every customer. Extensive evaluations over multiple real-world datasets show the effectiveness of FairRec in ensuring two-sided fairness while incurring a marginal loss in the overall recommendation quality.

【Keywords】:

109. Complex Factoid Question Answering with a Free-Text Knowledge Graph.

Paper Link】 【Pages】:1205-1216

【Authors】: Chen Zhao ; Chenyan Xiong ; Xin Qian ; Jordan L. Boyd-Graber

【Abstract】: We introduce delft, a factoid question answering system which combines the nuance and depth of knowledge graph question answering approaches with the broader coverage of free-text. delft builds a free-text knowledge graph from Wikipedia, with entities as nodes and sentences in which entities co-occur as edges. For each question, delft finds the subgraph linking question entity nodes to candidates using text sentences as edges, creating a dense and high coverage semantic graph. A novel graph neural network reasons over the free-text graph—combining evidence on the nodes via information along edge sentences—to select a final answer. Experiments on three question answering datasets show delft can answer entity-rich questions better than machine reading based models, bert-based answer ranking and memory networks. delft’s advantage comes from both the high coverage of its free-text knowledge graph—more than double that of dbpedia relations—and the novel graph neural network which reasons on the rich but noisy free-text evidence.

【Keywords】:

110. Leveraging Sentiment Distributions to Distinguish Figurative From Literal Health Reports on Twitter.

Paper Link】 【Pages】:1217-1227

【Authors】: Rhys Biddle ; Aditya Joshi ; Shaowu Liu ; Cécile Paris ; Guandong Xu

【Abstract】: Harnessing data from social media to monitor health events is a promising avenue for public health surveillance. A key step is the detection of reports of a disease (referred to as ‘health mention classification’) amongst tweets that mention disease words. Prior work shows that figurative usage of disease words may prove to be challenging for health mention classification. Since the experience of a disease is associated with a negative sentiment, we present a method that utilises sentiment information to improve health mention classification. Specifically, our classifier for health mention classification combines pre-trained contextual word representations with sentiment distributions of words in the tweet. For our experiments, we extend a benchmark dataset of tweets for health mention classification, adding over 14k manually annotated tweets across diseases. We also additionally annotate each tweet with a label that indicates if the disease words are used in a figurative sense. Our classifier outperforms current SOTA approaches in detecting both health-related and figurative tweets that mention disease words. We also show that tweets containing disease words are mentioned figuratively more often than in a health-related context, proving to be challenging for classifiers targeting health-related tweets.

【Keywords】:

111. Reputation Agent: Prompting Fair Reviews in Gig Markets.

Paper Link】 【Pages】:1228-1240

【Authors】: Carlos Toxtli ; Angela Richmond-Fuller ; Saiph Savage

【Abstract】: Our study presents a new tool, Reputation Agent, to promote fairer reviews from requesters (employers or customers) on gig markets. Unfair reviews, created when requesters consider factors outside of a worker’s control, are known to plague gig workers and can result in lost job opportunities and even termination from the marketplace. Our tool leverages machine learning to implement an intelligent interface that: (1) uses deep learning to automatically detect when an individual has included unfair factors into her review (factors outside the worker’s control per the policies of the market); and (2) prompts the individual to reconsider her review if she has incorporated unfair factors. To study the effectiveness of Reputation Agent, we conducted a controlled experiment over different gig markets. Our experiment illustrates that across markets, Reputation Agent, in contrast with traditional approaches, motivates requesters to review gig workers’ performance more fairly. We discuss how tools that bring more transparency to employers about the policies of a gig market can help build empathy thus resulting in reasoned discussions around potential injustices towards workers generated by these interfaces. Our vision is that with tools that promote truth and transparency we can bring fairer treatment to gig workers.

【Keywords】:

112. Becoming the Super Turker: Increasing Wages via a Strategy from High Earning Workers.

Paper Link】 【Pages】:1241-1252

【Authors】: Saiph Savage ; Chun-Wei Chiang ; Susumu Saito ; Carlos Toxtli ; Jeffrey P. Bigham

【Abstract】: Crowd markets have traditionally limited workers by not providing transparency information concerning which tasks pay fairly or which requesters are unreliable. Researchers believe that a key reason why crowd workers earn low wages is due to this lack of transparency. As a result, tools have been developed to provide more transparency within crowd markets to help workers. However, while most workers use these tools, they still earn less than minimum wage. We argue that the missing element is guidance on how to use transparency information. In this paper, we explore how novice workers can improve their earnings by following the transparency criteria of Super Turkers, i.e., crowd workers who earn higher salaries on Amazon Mechanical Turk (MTurk). We believe that Super Turkers have developed effective processes for using transparency information. Therefore, by having novices follow a Super Turker criteria (one that is simple and popular among Super Turkers), we can help novices increase their wages. For this purpose, we: (i) conducted a survey and data analysis to computationally identify a simple yet common criteria that Super Turkers use for handling transparency tools; (ii) deployed a two-week field experiment with novices who followed this Super Turker criteria to find better work on MTurk. Novices in our study viewed over 25,000 tasks by 1,394 requesters. We found that novices who utilized this Super Turkers’ criteria earned better wages than other novices. Our results highlight that tool development to support crowd workers should be paired with educational opportunities that teach workers how to effectively use the tools and their related metrics (e.g., transparency values). We finish with design recommendations for empowering crowd workers to earn higher salaries.

【Keywords】:

113. GraphGen: A Scalable Approach to Domain-agnostic Labeled Graph Generation.

Paper Link】 【Pages】:1253-1263

【Authors】: Nikhil Goyal ; Harsh Vardhan Jain ; Sayan Ranu

【Abstract】: Graph generative models have been extensively studied in the data mining literature. While traditional techniques are based on generating structures that adhere to a pre-decided distribution, recent techniques have shifted towards learning this distribution directly from the data. While learning-based approaches have imparted significant improvement in quality, some limitations remain to be addressed. First, learning graph distributions introduces additional computational overhead, which limits their scalability to large graph databases. Second, many techniques only learn the structure and do not address the need to also learn node and edge labels, which encode important semantic information and influence the structure itself. Third, existing techniques often incorporate domain-specific rules and lack generalizability. Fourth, the experimentation of existing techniques is not comprehensive enough due to either using weak evaluation metrics or focusing primarily on synthetic or small datasets. In this work, we develop a domain-agnostic technique called GraphGen to overcome all of these limitations. GraphGen converts graphs to sequences using minimum DFS codes. Minimum DFS codes are canonical labels and capture the graph structure precisely along with the label information. The complex joint distributions between structure and semantic labels are learned through a novel LSTM architecture. Extensive experiments on million-sized, real graph datasets show GraphGen to be 4 times faster on average than state-of-the-art techniques while being significantly better in quality across a comprehensive set of 11 different metrics. Our code is released at: https://github.com/idea-iitd/graphgen.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks; Information systems; Information systems applications; Data mining

114. A Category-Aware Deep Model for Successive POI Recommendation on Sparse Check-in Data.

Paper Link】 【Pages】:1264-1274

【Authors】: Fuqiang Yu ; Lizhen Cui ; Wei Guo ; Xudong Lu ; Qingzhong Li ; Hua Lu

【Abstract】: As considerable amounts of POI check-in data have been accumulated, successive point-of-interest (POI) recommendation is increasingly popular. Existing successive POI recommendation methods only predict where user will go next, ignoring when this behavior will occur. In this work, we focus on predicting POIs that will be visited by users in the next 24 hours. As check-in data is very sparse, it is challenging to accurately capture user preferences in temporal patterns. To this end, we propose a category-aware deep model CatDM that incorporates POI category and geographical influence to reduce search space to overcome data sparsity. We design two deep encoders based on LSTM to model the time series data. The first encoder captures user preferences in POI categories, whereas the second exploits user preferences in POIs. Considering clock influence in the second encoder, we divide each user’s check-in history into several different time windows and develop a personalized attention mechanism for each window to facilitate CatDM to exploit temporal patterns. Moreover, to sort the candidate set, we consider four specific dependencies: user-POI, user-category, POI-time and POI-user current preferences. Extensive experiments are conducted on two large real datasets. The experimental results demonstrate that our CatDM outperforms the state-of-the-art models for successive POI recommendation on sparse check-in data.

【Keywords】: Information systems; Information systems applications; Data mining

115. Beyond the Front Page: Measuring Third Party Dynamics in the Field.

Paper Link】 【Pages】:1275-1286

【Authors】: Tobias Urban ; Martin Degeling ; Thorsten Holz ; Norbert Pohlmann

【Abstract】: In the modern Web, service providers often rely heavily on third parties to run their services. For example, they make use of ad networks to finance their services, externally hosted libraries to develop features quickly, and analytics providers to gain insights into visitor behavior.

【Keywords】:

116. Friend or Faux: Graph-Based Early Detection of Fake Accounts on Social Networks.

Paper Link】 【Pages】:1287-1297

【Authors】: Adam Breuer ; Roee Eilat ; Udi Weinsberg

【Abstract】: In this paper, we study the problem of early detection of fake user accounts on social networks based solely on their network connectivity with other users. Removing such accounts is a core task for maintaining the integrity of social networks, and early detection helps to reduce the harm that such accounts inflict. However, new fake accounts are notoriously difficult to detect via graph-based algorithms, as their small number of connections are unlikely to reflect a significant structural difference from those of new real accounts. We present the SybilEdge algorithm, which determines whether a new user is a fake account (‘sybil’) by aggregating over (I) her choices of friend request targets and (II) these targets’ respective responses. SybilEdge performs this aggregation giving more weight to a user’s choices of targets to the extent that these targets are preferred by other fakes versus real users, and also to the extent that these targets respond differently to fakes versus real users. We show that SybilEdge rapidly detects new fake users at scale on the Facebook network and outperforms state-of-the-art algorithms. We also show that SybilEdge is robust to label noise in the training data, to different prevalences of fake accounts in the network, and to several different ways fakes can select targets for their friend requests. To our knowledge, this is the first time a graph-based algorithm has been shown to achieve high performance (AUC > 0.9) on new users who have only sent a small number of friend requests.

【Keywords】:

117. Novel Entity Discovery from Web Tables.

Paper Link】 【Pages】:1298-1308

【Authors】: Shuo Zhang ; Edgar Meij ; Krisztian Balog ; Ridho Reinanda

【Abstract】: When working with any sort of knowledge base (KB) one has to make sure it is as complete and also as up-to-date as possible. Both tasks are non-trivial as they require recall-oriented efforts to determine which entities and relationships are missing from the KB. As such they require a significant amount of labor. Tables on the Web on the other hand are abundant and have the distinct potential to assist with these tasks. In particular, we can leverage the content in such tables to discover new entities, properties, and relationships. Because web tables typically only contain raw textual content we first need to determine which cells refer to which known entities—a task we dub table-to-KB matching. This first task aims to infer table semantics by linking table cells and heading columns to elements of a KB. We propose a feature-based method and on two public test collections we demonstrate substantial improvements over the state-of-the-art in terms of precision whilst also improving recall. Then second task builds upon these linked entities and properties to not only identify novel ones in the same table but also to bootstrap their type and additional relationships. We refer to this process as novel entity discovery and, to the best of our knowledge, it is the first endeavor on mining the unlinked cells in web tables. Our method identifies not only out-of-KB (“novel”) information but also novel aliases for in-KB (“known”) entities. When evaluated using three purpose-built test collections, we find that our proposed approaches obtain a marked improvement in terms of precision over our baselines whilst keeping recall stable.

【Keywords】: Information systems; Information retrieval

118. Abstractive Snippet Generation.

Paper Link】 【Pages】:1309-1319

【Authors】: Wei-Fan Chen ; Shahbaz Syed ; Benno Stein ; Matthias Hagen ; Martin Potthast

【Abstract】: An abstractive snippet is an originally created piece of text to summarize a web page on a search engine results page. Compared to the conventional extractive snippets, which are generated by extracting phrases and sentences verbatim from a web page, abstractive snippets circumvent copyright issues; even more interesting is the fact that they open the door for personalization. Abstractive snippets have been evaluated as equally powerful in terms of user acceptance and expressiveness—but the key question remains: Can abstractive snippets be automatically generated with sufficient quality?

【Keywords】:

Paper Link】 【Pages】:1320-1331

【Authors】: Benjamin Eriksson ; Andrei Sabelfeld

【Abstract】: Undesired navigation in browsers powers a significant class of attacks on web applications. In a move to mitigate risks associated with undesired navigation, the security community has proposed a standard that gives control to web pages to restrict navigation. The standard draft introduces a new navigate-to directive of the Content Security Policy (CSP). The directive is currently being implemented by mainstream browsers. This paper is a first evaluation of navigate-to, focusing on security, performance, and automatization of navigation policies. We present new vulnerabilities introduced by the directive into the web ecosystem, opening up for attacks such as probing to detect if users are logged in to other websites or have active shopping carts, bypassing third-party cookie blocking, exfiltrating secrets, as well as leaking browsing history. Unfortunately, the directive triggers vulnerabilities even in websites that do not use the directive in their policies. We identify both specification- and implementation-level vulnerabilities and propose countermeasures to mitigate both. To aid developers in configuring navigation policies, we develop and implement AutoNav1, an automated black-box mechanism to infer navigation policies. AutoNav leverages the benefits of origin-wide policies in order to improve security without degrading performance. We evaluate the viability of navigate-to and AutoNav by an empirical study on Alexa’s top 10,000 websites.

【Keywords】:

120. Learning Contextualized Document Representations for Healthcare Answer Retrieval.

Paper Link】 【Pages】:1332-1343

【Authors】: Sebastian Arnold ; Betty van Aken ; Paul Grundmann ; Felix A. Gers ; Alexander Löser

【Abstract】: We present Contextual Discourse Vectors (CDV), a distributed document representation for efficient answer retrieval from long healthcare documents. Our approach is based on structured query tuples of entities and aspects from free text and medical taxonomies. Our model leverages a dual encoder architecture with hierarchical LSTM layers and multi-task training to encode the position of clinical entities and aspects alongside the document discourse. We use our continuous representations to resolve queries with short latency using approximate nearest neighbor search on sentence level. We apply the CDV model for retrieving coherent answer passages from nine English public health resources from the Web, addressing both patients and medical professionals. Because there is no end-to-end training data available for all application scenarios, we train our model with self-supervised data from Wikipedia. We show that our generalized model significantly outperforms several state-of-the-art baselines for healthcare passage ranking and is able to adapt to heterogeneous domains without additional fine-tuning.

【Keywords】:

121. Broccoli: Sprinkling Lightweight Vocabulary Learning into Everyday Information Diets.

Paper Link】 【Pages】:1344-1354

【Authors】: Roland Aydin ; Lars Klein ; Arnaud Miribel ; Robert West

【Abstract】: The learning of a new language remains to this date a cognitive task that requires considerable diligence and willpower, recent advances and tools notwithstanding. In this paper, we propose Broccoli, a new paradigm aimed at reducing the required effort by seamlessly embedding vocabulary learning into users’ everyday information diets. This is achieved by inconspicuously switching chosen words encountered by the user for their translation in the target language. Thus, by seeing words in context, the user can assimilate new vocabulary without much conscious effort. We validate our approach in a careful user study, finding that the efficacy of the lightweight Broccoli approach is competitive with traditional, memorization-based vocabulary learning. The low cognitive overhead is manifested in a pronounced decrease in learners’ usage of mnemonic learning strategies, as compared to traditional learning. Finally, we establish that language patterns in typical information diets are compatible with spaced-repetition strategies, thus enabling an efficient use of the Broccoli paradigm. Overall, our work establishes the feasibility of a novel and powerful “install-and-forget” approach for embedded language acquisition.

【Keywords】:

122. What is the Human Mobility in a New City: Transfer Mobility Knowledge Across Cities.

Paper Link】 【Pages】:1355-1365

【Authors】: Tianfu He ; Jie Bao ; Ruiyuan Li ; Sijie Ruan ; Yanhua Li ; Li Song ; Hui He ; Yu Zheng

【Abstract】: With the advances of web-of-things, human mobility, e.g., GPS trajectories of vehicles, sharing bikes, and mobile devices, reflects people’s travel patterns and preferences, which are especially crucial for urban applications such as urban planning and business location selection. However, collecting a large set of human mobility data is not easy because of the privacy and commercial concerns, as well as the high cost to deploy sensors and a long time to collect the data, especially in newly developed cities. Realizing this, in this paper, based on the intuition that the human mobility is driven by the mobility intentions reflected by the origin and destination (or OD) features, as well as the preference to select the path between them, we investigate the problem to generate mobility data for a new target city, by transferring knowledge from mobility data and multi-source data of the source cities. Our framework contains three main stages: 1) mobility intention transfer, which learns a latent unified mobility intention distribution across the source cities, and transfers the model of the distribution to the target city; 2) OD generation, which generates the OD pairs in the target city based on the transferred mobility intention model, and 3) path generation, which generates the paths for each OD pair, based on a utility model learned from the real trajectory data in the source cities. Also, a demo of our trajectory generator is publicly available online for two city regions. Extensive experiment results over four regions in China validate the effectiveness of the proposed solution. Besides, an on-field case study is presented in a newly developed region, i.e., Xiongan, China. With the generated trajectories in the new city, many trajectory mining techniques can be applied.

【Keywords】: Information systems; Information systems applications; Data mining; Spatial-temporal systems

Paper Link】 【Pages】:1366-1377

【Authors】: Brit Youngmann ; Elad Yom-Tov ; Ran Gilad-Bachrach ; Danny Karmon

【Abstract】: Search advertising is one of the most commonly-used methods of advertising. Past work has shown that search advertising can be employed to improve health by eliciting positive behavioral change. However, writing effective advertisements requires expertise and (possible expensive) experimentation, both of which may not be available to public health authorities wishing to elicit such behavioral changes, especially when dealing with a public health crises such as epidemic outbreaks.

【Keywords】:

124. Finding large balanced subgraphs in signed networks.

Paper Link】 【Pages】:1378-1388

【Authors】: Bruno Ordozgoiti ; Antonis Matakos ; Aristides Gionis

【Abstract】: Signed networks are graphs whose edges are labelled with either a positive or a negative sign, and can be used to capture nuances in interactions that are missed by their unsigned counterparts. The concept of balance in signed graph theory determines whether a network can be partitioned into two perfectly opposing subsets, and is therefore useful for modelling phenomena such as the existence of polarized communities in social networks. While determining whether a graph is balanced is easy, finding a large balanced subgraph is hard. The few heuristics available in the literature for this purpose are either ineffective or non-scalable. In this paper we propose an efficient algorithm for finding large balanced subgraphs in signed networks. The algorithm relies on signed spectral theory and a novel bound for perturbations of the graph Laplacian. In a wide variety of experiments on real-world data we show that our algorithm can find balanced subgraphs much larger than those detected by existing methods, and in addition, it is faster. We test its scalability on graphs of up to 34 million edges.

【Keywords】: Information systems; Information systems applications; Data mining; Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms

125. Modeling Heterogeneous Statistical Patterns in High-dimensional Data by Adversarial Distributions: An Unsupervised Generative Framework.

Paper Link】 【Pages】:1389-1399

【Authors】: Han Zhang ; Wenhao Zheng ; Charley Chen ; Kevin Gao ; Yao Hu ; Ling Huang ; Wei Xu

【Abstract】: Since the label collecting is prohibitive and time-consuming, unsupervised methods are preferred in applications such as fraud detection. Meanwhile, such applications usually require modeling the intrinsic clusters in high-dimensional data, which usually displays heterogeneous statistical patterns as the patterns of different clusters may appear in different dimensions. Existing methods propose to model the data clusters on selected dimensions, yet globally omitting any dimension may damage the pattern of certain clusters. To address the above issues, we propose a novel unsupervised generative framework called FIRD, which utilizes adversarial distributions to fit and disentangle the heterogeneous statistical patterns. When applying to discrete spaces, FIRD effectively distinguishes the synchronized fraudsters from normal users. Besides, FIRD also provides superior performance on anomaly detection datasets compared with SOTA anomaly detection methods (over 5% average AUC improvement). The significant experiment results on various datasets verify that the proposed method can better model the heterogeneous statistical patterns in high-dimensional data and benefit downstream applications.

【Keywords】: Computing methodologies; Machine learning; Information systems; Information systems applications; Data mining

126. Structural Deep Clustering Network.

Paper Link】 【Pages】:1400-1410

【Authors】: Deyu Bo ; Xiao Wang ; Chuan Shi ; Meiqi Zhu ; Emiao Lu ; Peng Cui

【Abstract】: Clustering is a fundamental task in data analysis. Recently, deep clustering, which derives inspiration primarily from deep learning approaches, achieves state-of-the-art performance and has attracted considerable attention. Current deep clustering methods usually boost the clustering results by means of the powerful representation ability of deep learning, e.g., autoencoder, suggesting that learning an effective representation for clustering is a crucial requirement. The strength of deep clustering methods is to extract the useful representations from the data itself, rather than the structure of data, which receives scarce attention in representation learning. Motivated by the great success of Graph Convolutional Network (GCN) in encoding the graph structure, we propose a Structural Deep Clustering Network (SDCN) to integrate the structural information into deep clustering. Specifically, we design a delivery operator to transfer the representations learned by autoencoder to the corresponding GCN layer, and a dual self-supervised mechanism to unify these two different deep neural architectures and guide the update of the whole model. In this way, the multiple structures of data, from low-order to high-order, are naturally combined with the multiple representations learned by autoencoder. Furthermore, we theoretically analyze the delivery operator, i.e., with the delivery operator, GCN improves the autoencoder-specific representation as a high-order graph regularization constraint and autoencoder helps alleviate the over-smoothing problem in GCN. Through comprehensive experiments, we demonstrate that our propose model can consistently perform better over the state-of-the-art techniques.

【Keywords】: Computing methodologies; Machine learning; Learning paradigms; Unsupervised learning; Cluster analysis; Machine learning approaches; Neural networks

127. Traveling the token world: A graph analysis of Ethereum ERC20 token ecosystem.

Paper Link】 【Pages】:1411-1421

【Authors】: Weili Chen ; Tuo Zhang ; Zhiguang Chen ; Zibin Zheng ; Yutong Lu

【Abstract】: The birth of Bitcoin ushered in the era of cryptocurrency, which has now become a financial market attracted extensive attention worldwide. The phenomenon of startups launching Initial Coin Offerings (ICOs) to raise capital led to thousands of tokens being distributed on blockchains. Many studies have analyzed this phenomenon from an economic perspective. However, little is know about the characteristics of participants in the ecosystem. To fill this gap and considering over 80% of ICOs launched based on ERC20 token on Ethereum, in this paper, we conduct a systematic investigation on the whole Ethereum ERC20 token ecosystem to characterize the token creator, holder, and transfer activity. By downloading the whole blockchain and parsing the transaction records and event logs, we construct three graphs, namely token creator graph, token holder graph, and token transfer graph. We obtain many observations and findings by analyzing these graphs. Besides, we propose an algorithm to discover potential relationships between tokens and other accounts. The reported case shows that our algorithm can effectively reveal entities and the complex relationship between various accounts in the token ecosystem.

【Keywords】:

128. Towards IP-based Geolocation via Fine-grained and Stable Webcam Landmarks.

Paper Link】 【Pages】:1422-1432

【Authors】: Zhihao Wang ; Qiang Li ; Jinke Song ; Haining Wang ; Limin Sun

【Abstract】: IP-based geolocation is essential for various location-aware Internet applications, such as online advertisement, content delivery, and online fraud prevention. Achieving accurate geolocation enormously relies on the number of high-quality (i.e., the fine-grained and stable over time) landmarks. However, the previous efforts of garnering landmarks have been impeded by the limited visible landmarks on the Internet and manual time cost. In this paper, we leverage the availability of numerous online webcams that are used to monitor physical surroundings as a rich source of promising high-quality landmarks for serving IP-based geolocation. In particular, we present a new framework called GeoCAM, which is designed to automatically generate qualified landmarks from online webcams, providing IP-based geolocation services with high accuracy and wide coverage. GeoCAM periodically monitors websites that are hosting live webcams and uses the natural language processing technique to extract the IP addresses and latitude/longitude of webcams for generating landmarks at large-scale. We develop a prototype of GeoCAM and conduct real-world experiments for validating its efficacy. Our results show that GeoCam can detect 282,902 live webcams hosted in webpages with 94.2% precision and 90.4% recall, and then generate 16,863 stable and fine-grained landmarks, which are two orders of magnitude more than the landmarks used in prior works. Thus, by correlating a large scale of landmarks, GeoCAM is able to provide a geolocation service with high accuracy and wide coverage.

【Keywords】:

129. Keeping out the Masses: Understanding the Popularity and Implications of Internet Paywalls.

Paper Link】 【Pages】:1433-1444

【Authors】: Panagiotis Papadopoulos ; Peter Snyder ; Dimitrios Athanasakis ; Benjamin Livshits

【Abstract】: Funding the production of quality online content is a pressing problem for content producers. The most common funding method, online advertising, is rife with well-known performance and privacy harms, and an intractable subject-agent conflict: many users do not want to see advertisements, depriving the site of needed funding.

【Keywords】:

130. Discovering Mathematical Objects of Interest - A Study of Mathematical Notations.

Paper Link】 【Pages】:1445-1456

【Authors】: André Greiner-Petter ; Moritz Schubotz ; Fabian Müller ; Corinna Breitinger ; Howard S. Cohl ; Akiko Aizawa ; Bela Gipp

【Abstract】: Mathematical notation, i.e., the writing system used to communicate concepts in mathematics, encodes valuable information for a variety of information search and retrieval systems. Yet, mathematical notations remain mostly unutilized by today’s systems. In this paper, we present the first in-depth study on the distributions of mathematical notation in two large scientific corpora: the open access arXiv (2.5B mathematical objects) and the mathematical reviewing service for pure and applied mathematics zbMATH (61M mathematical objects). Our study lays a foundation for future research projects on mathematical information retrieval for large scientific corpora. Further, we demonstrate the relevance of our results to a variety of use-cases. For example, to assist semantic extraction systems, to improve scientific search engines, and to facilitate specialized math recommendation systems.

【Keywords】: Information systems; Information retrieval

131. Unsupervised Domain Adaptive Graph Convolutional Networks.

Paper Link】 【Pages】:1457-1467

【Authors】: Man Wu ; Shirui Pan ; Chuan Zhou ; Xiaojun Chang ; Xingquan Zhu

【Abstract】: Graph convolutional networks (GCNs) have achieved impressive success in many graph related analytics tasks. However, most GCNs only work in a single domain (graph) incapable of transferring knowledge from/to other domains (graphs), due to the challenges in both graph representation learning and domain adaptation over graph structures. In this paper, we present a novel approach, unsupervised domain adaptive graph convolutional networks (UDA-GCN), for domain adaptation learning for graphs. To enable effective graph representation learning, we first develop a dual graph convolutional network component, which jointly exploits local and global consistency for feature aggregation. An attention mechanism is further used to produce a unified representation for each node in different graphs. To facilitate knowledge transfer between graphs, we propose a domain adaptive learning module to optimize three different loss functions, namely source classifier loss, domain classifier loss, and target classifier loss as a whole, thus our model can differentiate class labels in the source domain, samples from different domains, the class labels from the target domain, respectively. Experimental results on real-world datasets in the node classification task validate the performance of our method, compared to state-of-the-art graph neural network algorithms.

【Keywords】:

132. Query-Efficient Correlation Clustering.

Paper Link】 【Pages】:1468-1478

【Authors】: David García-Soriano ; Konstantin Kutzkov ; Francesco Bonchi ; Charalampos E. Tsourakakis

【Abstract】: Correlation clustering is arguably the most natural formulation of clustering. Given n objects and a pairwise similarity measure, the goal is to cluster the objects so that, to the best possible extent, similar objects are put in the same cluster and dissimilar objects are put in different clusters.

【Keywords】:

133. Stop tracking me Bro! Differential Tracking of User Demographics on Hyper-Partisan Websites.

Paper Link】 【Pages】:1479-1490

【Authors】: Pushkal Agarwal ; Sagar Joglekar ; Panagiotis Papadopoulos ; Nishanth Sastry ; Nicolas Kourtellis

【Abstract】: Websites with hyper-partisan, left or right-leaning focus offer content that is typically biased towards the expectations of their target audience. Such content often polarizes users, who are repeatedly primed to specific (extreme) content, usually reflecting hard party lines on political and socio-economic topics. Though this polarization has been extensively studied with respect to content, it is still unknown how it associates with the online tracking experienced by browsing users, especially when they exhibit certain demographic characteristics. For example, it is unclear how such websites enable the ad-ecosystem to track users based on their gender or age.

【Keywords】:

134. How Do We Create a Fantabulous Password?

Paper Link】 【Pages】:1491-1501

【Authors】: Simon S. Woo

【Abstract】: Although pronounceability can improve password memorability, most existing password generation approaches have not properly integrated the pronounceability of passwords in their designs. In this work, we demonstrate several shortfalls of current pronounceable password generation approaches, and then propose, ProSemPass, a new method of generating passwords that are pronounceable and semantically meaningful. In our approach, users supply initial input words and our system improves the pronounceability and meaning of the user-provided words by automatically creating a portmanteau. To measure the strength of our approach, we use attacker models, where attackers have complete knowledge of our password generation algorithms. We measure strength in guess numbers and compare those with other existing password generation approaches. Using a large-scale IRB-approved user study with 1,563 Amazon MTurkers over 9 different conditions, our approach achieves a 30% higher recall than those from current pronounceable password approaches, and is stronger than the offline guessing attack limit.

【Keywords】: Security and privacy; Security services; Authentication

135. What Changed Your Mind: The Roles of Dynamic Topics and Discourse in Argumentation Process.

Paper Link】 【Pages】:1502-1513

【Authors】: Jichuan Zeng ; Jing Li ; Yulan He ; Cuiyun Gao ; Michael R. Lyu ; Irwin King

【Abstract】: In our world with full of uncertainty, debates and argumentation contribute to the progress of science and society. Despite of the increasing attention to characterize human arguments, most progress made so far focus on the debate outcome, largely ignoring the dynamic patterns in argumentation processes. This paper presents a study that automatically analyzes the key factors in argument persuasiveness, beyond simply predicting who will persuade whom. Specifically, we propose a novel neural model that is able to dynamically track the changes of latent topics and discourse in argumentative conversations, allowing the investigation of their roles in influencing the outcomes of persuasion. Extensive experiments have been conducted on argumentative conversations on both social media and supreme court. The results show that our model outperforms state-of-the-art models in identifying persuasive arguments via explicitly exploring dynamic factors of topic and discourse. We further analyze the effects of topics and discourse on persuasiveness, and find that they are both useful — topics provide concrete evidence while superior discourse styles may bias participants, especially in social media arguments. In addition, we draw some findings from our empirical results, which will help people better engage in future persuasive conversations.

【Keywords】:

136. Ten Social Dimensions of Conversations and Relationships.

Paper Link】 【Pages】:1514-1525

【Authors】: Minje Choi ; Luca Maria Aiello ; Krisztián Zsolt Varga ; Daniele Quercia

【Abstract】: Decades of social science research identified ten fundamental dimensions that provide the conceptual building blocks to describe the nature of human relationships. Yet, it is not clear to what extent these concepts are expressed in everyday language and what role they have in shaping observable dynamics of social interactions. After annotating conversational text through crowdsourcing, we trained NLP tools to detect the presence of these types of interaction from conversations, and applied them to 160M messages written by geo-referenced Reddit users, 290k emails from the Enron corpus and 300k lines of dialogue from movie scripts. We show that social dimensions can be predicted purely from conversations with an AUC up to 0.98, and that the combination of the predicted dimensions suggests both the types of relationships people entertain (conflict vs. support) and the types of real-world communities (wealthy vs. deprived) they shape.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing

137. Social Interactions or Business Transactions?What customer reviews disclose about Airbnb marketplace.

Paper Link】 【Pages】:1526-1536

【Authors】: Giovanni Quattrone ; Antonino Nocera ; Licia Capra ; Daniele Quercia

【Abstract】: Airbnb is one of the most successful examples of sharing economy marketplaces. With rapid and global market penetration, understanding its attractiveness and evolving growth opportunities is key to plan business decision making. There is an ongoing debate, for example, about whether Airbnb is a hospitality service that fosters social exchanges between hosts and guests, as the sharing economy manifesto originally stated, or whether it is (or is evolving into being) a purely business transaction platform, the way hotels have traditionally operated. To answer these questions, we propose a novel market analysis approach that exploits customers’ reviews. Key to the approach is a method that combines thematic analysis and machine learning to inductively develop a custom dictionary for guests’ reviews. Based on this dictionary, we then use quantitative linguistic analysis on a corpus of 3.2 million reviews collected in 6 different cities, and illustrate how to answer a variety of market research questions, at fine levels of temporal, thematic, user and spatial granularity, such as (i) how the business vs social dichotomy is evolving over the years, (ii) what exact words within such top-level categories are evolving, (iii) whether such trends vary across different user segments and (iv) in different neighbourhoods.

【Keywords】:

138. Correcting Knowledge Base Assertions.

Paper Link】 【Pages】:1537-1547

【Authors】: Jiaoyan Chen ; Xi Chen ; Ian Horrocks ; Erik B. Myklebust ; Ernesto Jiménez-Ruiz

【Abstract】: The usefulness and usability of knowledge bases (KBs) is often limited by quality issues. One common issue is the presence of erroneous assertions, often caused by lexical or semantic confusion. We study the problem of correcting such assertions, and present a general correction framework which combines lexical matching, semantic embedding, soft constraint mining and semantic consistency checking. The framework is evaluated using DBpedia and an enterprise medical KB.

【Keywords】:

139. The POLAR Framework: Polar Opposites Enable Interpretability of Pre-Trained Word Embeddings.

Paper Link】 【Pages】:1548-1558

【Authors】: Binny Mathew ; Sandipan Sikdar ; Florian Lemmerich ; Markus Strohmaier

【Abstract】: We introduce ‘POLAR’ — a framework that adds interpretability to pre-trained word embeddings via the adoption of semantic differentials. Semantic differentials are a psychometric construct for measuring the semantics of a word by analysing its position on a scale between two polar opposites (e.g., cold – hot, soft – hard). The core idea of our approach is to transform existing, pre-trained word embeddings via semantic differentials to a new “polar” space with interpretable dimensions defined by such polar opposites. Our framework also allows for selecting the most discriminative dimensions from a set of polar dimensions provided by an oracle, i.e., an external source. We demonstrate the effectiveness of our framework by deploying it to various downstream tasks, in which our interpretable word embeddings achieve a performance that is comparable to the original word embeddings. We also show that the interpretable dimensions selected by our framework align with human judgement. Together, these results demonstrate that interpretability can be added to word embeddings without compromising performance. Our work is relevant for researchers and engineers interested in interpreting pre-trained word embeddings.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing

140. Designing Fairly Fair Classifiers Via Economic Fairness Notions.

Paper Link】 【Pages】:1559-1569

【Authors】: Safwan Hossain ; Andjela Mladenovic ; Nisarg Shah

【Abstract】: The past decade has witnessed a rapid growth of research on fairness in machine learning. In contrast, fairness has been formally studied for almost a century in microeconomics in the context of resource allocation, during which many general-purpose notions of fairness have been proposed. This paper explore the applicability of two such notions — envy-freeness and equitability — in machine learning. We propose novel relaxations of these fairness notions which apply to groups rather than individuals, and are compelling in a broad range of settings. Our approach provides a unifying framework by incorporating several recently proposed fairness definitions as special cases. We provide generalization bounds for our approach, and theoretically and experimentally evaluate the tradeoff between loss minimization and our fairness guarantees.

【Keywords】:

141. Stable Model Semantics for Recursive SHACL.

Paper Link】 【Pages】:1570-1580

【Authors】: Medina Andresel ; Julien Corman ; Magdalena Ortiz ; Juan L. Reutter ; Ognjen Savkovic ; Mantas Simkus

【Abstract】: SHACL (SHape Constraint Language) is a W3C recommendation for validating graph-based data against a set of constraints (called shapes). Importantly, SHACL allows to define recursive shapes, i.e. a shape may refer to itself, directly of indirectly. The recommendation left open the semantics of recursive shapes, but proposals have emerged recently to extend the official semantics to support recursion. These proposals are based on the principle of possibility (or non-contradiction): a graph is considered valid against a schema if one can assign shapes to nodes in such a way that all constraints are satisfied. This semantics is not constructive, as it does not provide guidelines about how to obtain such an assignment, and it may lead to unfounded assignments, where the only reason to assign a shape to a node is that it allows validating the graph.

【Keywords】: Computing methodologies; Artificial intelligence; Knowledge representation and reasoning; Logic programming and answer set programming; Theory of computation; Logic; Constraint and logic programming

142. Task-Oriented Genetic Activation for Large-Scale Complex Heterogeneous Graph Embedding.

Paper Link】 【Pages】:1581-1591

【Authors】: Zhuoren Jiang ; Zheng Gao ; Jinjiong Lan ; Hongxia Yang ; Yao Lu ; Xiaozhong Liu

【Abstract】: The recent success of deep graph embedding innovates the graphical information characterization methodologies. However, in real-world applications, such a method still struggles with the challenges of heterogeneity, scalability, and multiplex. To address these challenges, in this study, we propose a novel solution, Genetic hEterogeneous gRaph eMbedding (GERM), which enables flexible and efficient task-driven vertex embedding in a complex heterogeneous graph. Unlike prior efforts for this track of studies, we employ a task-oriented genetic activation strategy to efficiently generate the “Edge Type Activated Vector” (ETAV) over the edge types in the graph. The generated ETAV can not only reduce the incompatible noise and navigate the heterogeneous graph random walk at the graph-schema level, but also activate an optimized subgraph for efficient representation learning. By revealing the correlation between the graph structure and task information, the model interpretability can be enhanced as well. Meanwhile, an activated heterogeneous skip-gram framework is proposed to encapsulate both topological and task-specific information of a given heterogeneous graph. Through extensive experiments on both scholarly and e-commerce datasets, we demonstrate the efficacy and scalability of the proposed methods via various search/recommendation tasks. GERM can significantly reduces the running time and remove expert-intervention without sacrificing the performance (or even modestly improve) by comparing with baselines.

【Keywords】:

143. Factoring Fact-Checks: Structured Information Extraction from Fact-Checking Articles.

Paper Link】 【Pages】:1592-1603

【Authors】: Shan Jiang ; Simon Baumgartner ; Abe Ittycheriah ; Cong Yu

【Abstract】: Fact-checking, which investigates claims made in public to arrive at a verdict supported by evidence and logical reasoning, has long been a significant form of journalism to combat misinformation in the news ecosystem. Most of the fact-checks share common structured information (called factors) such as claim, claimant, and verdict. In recent years, the emergence of ClaimReview as the standard schema for annotating those factors within fact-checking articles has led to wide adoption of fact-checking features by online platforms (e.g., Google, Bing). However, annotating fact-checks is a tedious process for fact-checkers and distracts them from their core job of investigating claims. As a result, less than half of the fact-checkers worldwide have adopted ClaimReview as of mid-2019. In this paper, we propose the task of factoring fact-checks for automatically extracting structured information from fact-checking articles. Exploring a public dataset of fact-checks, we empirically show that factoring fact-checks is a challenging task, especially for fact-checkers that are under-represented in the existing dataset. We then formulate the task as a sequence tagging problem and fine-tune the pre-trained BERT models with a modification made from our observations to approach the problem. Through extensive experiments, we demonstrate the performance of our models for well-known fact-checkers and promising initial results for under-represented fact-checkers.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing

144. Keywords Generation Improves E-Commerce Session-based Recommendation.

Paper Link】 【Pages】:1604-1614

【Authors】: Yuanxing Liu ; Zhaochun Ren ; Wei-Nan Zhang ; Wanxiang Che ; Ting Liu ; Dawei Yin

【Abstract】: By exploring fine-grained user behaviors, session-based recommendation predicts a user’s next action from short-term behavior sessions. Most of previous work learns about a user’s implicit behavior by merely taking the last click action as the supervision signal. However, in e-commerce scenarios, large-scale products with elusive click behaviors make such task challenging because of the low inclusiveness problem, i.e., many relevant products that satisfy the user’s shopping intention are neglected by recommenders. Since similar products with different IDs may share the same intention, we argue that the textual information (e.g., keywords of product titles) from sessions can be used as additional supervision signals to tackle above problem through learning more shared intention within similar products. Therefore, to improve the performance of e-commerce session-based recommendation, we explicitly infer the user’s intention by generating keywords entirely from the click sequence in the current session.

【Keywords】:

145. TRAP: Two-level Regularized Autoencoder-based Embedding for Power-law Distributed Data.

Paper Link】 【Pages】:1615-1624

【Authors】: Dongmin Park ; Hwanjun Song ; Minseok Kim ; Jae-Gil Lee

【Abstract】: Recently, autoencoder (AE)-based embedding approaches have achieved state-of-the-art performance in many tasks, especially in top-k recommendation with user embedding or node classification with node embedding. However, we find that many real-world data follow the power-law distribution with respect to the data object sparsity. When learning AE-based embeddings of these data, dense inputs move away from sparse inputs in an embedding space even when they are highly correlated. This phenomenon, which we call polarization, obviously distorts the embedding. In this paper, we propose TRAP that leverages two-level regularizers to effectively alleviate the polarization problem. The macroscopic regularizer generally prevents dense input objects from being distant from other sparse input objects, and the microscopic regularizer individually attracts each object to correlated neighbor objects rather than uncorrelated ones. Importantly, TRAP is a meta-algorithm that can be easily coupled with existing AE-based embedding methods with a simple modification. In extensive experiments on two representative embedding tasks using six-real world datasets, TRAP boosted the performance of the state-of-the-art algorithms by up to 31.53% and 94.99% respectively.

【Keywords】: Information systems; Information systems applications; Data mining

146. An Intent-Based Automation Framework for Securing Dynamic Consumer IoT Infrastructures.

Paper Link】 【Pages】:1625-1636

【Authors】: Vasudevan Nagendra ; Arani Bhattacharya ; Vinod Yegneswaran ; Amir Rahmati ; Samir Ranjan Das

【Abstract】: Consumer IoT networks are characterized by heterogeneous devices with diverse functionality and programming interfaces. This lack of homogeneity makes the integration and secure management of IoT infrastructures a daunting task for users and administrators. In this paper, we introduce VISCR, a Vendor-Independent policy Specification and Conflict Resolution engine that enables intent-based conflict-free policy specification and enforcement in IoT environments. VISCR converts the topology of the IoT infrastructure into a tree-based abstraction and translates existing policies from heterogeneous vendor-specific programming languages, such as Groovy-based SmartThings, OpenHAB, IFTTT-based templates, and MUD-based profiles, into a vendor-independent graph-based specification. These are then used to automatically detect rogue policies, policy conflicts, and automation bugs. We evaluated VISCR using a dataset of 907 IoT apps, programmed using heterogeneous automation specifications, in a simulated smart-building IoT infrastructure. In our experiments, among 907 IoT apps, VISCR exposed 342 of IoT apps as exhibiting one or more violations, while also running 14.2x faster than the state-of-the-art tool (Soteria). VISCR detected 100% of violations reported by Soteria while also detecting new types of violations in 266 additional apps.

【Keywords】:

147. Inferring Passengers' Interactive Choices on Public Transits via MA-AL: Multi-Agent Apprenticeship Learning.

Paper Link】 【Pages】:1637-1647

【Authors】: Mingzhou Yang ; Yanhua Li ; Xun Zhou ; Hui Lu ; Zhihong Tian ; Jun Luo

【Abstract】: Public transports, such as subway lines and buses, offer affordable ride-sharing services and reduce the road network traffic. Extracting passengers’ preferences from their public transit choices is important to city planners but technically non-trivial. When traveling by taking public transits, passengers make sequences of transit choices, and their rewards are usually influenced by other passengers’ choices. This process can be modeled as a Markov Game (MG). In this paper, we make the first effort to model travelers’ preferences of making transit choices using MGs. Based on the discovery that passengers usually do not change their policies, we propose novel algorithms to extract reward functions from the observed deterministic equilibrium joint policy of all agents in a general-sum MG to infer travelers’ preferences. First, we assume we have the access to the entire joint policy. We characterize the set of all reward functions for which the given joint policy is a Nash equilibrium policy. In order to remove the degeneracy of the solution, we then attempt to pick reward functions so as to maximize the sum of the deviation between the the observed policy and the sub-optimal policy of each agent. This results in a skillfully solvable linear programming algorithm for the multi-agent inverse reinforcement learning (MA-IRL) problem. Then, we deal with the case where we have access to the equilibrium joint policy through a set of actual trajectories. We propose an iterative algorithm inspired by single-agent apprenticeship learning algorithms and the cyclic coordinate descent approach. We evaluate the proposed algorithms on both a simple Grid Game and a unique real-world dataset (from Shenzhen, China). Results show that when we have access to the full policy, our algorithm can efficiently recover most of the reward structure, especially the interaction of agents. In the case where we only have access to a set of sampled expert trajectories, our algorithm can provide an explanation of the expert trajectories. Measured with respect to the experts’ unknown reward function, the performance of the policy output by our algorithm is close to that of the expert policy.

【Keywords】:

148. Personalized Employee Training Course Recommendation with Career Development Awareness.

Paper Link】 【Pages】:1648-1659

【Authors】: Chao Wang ; Hengshu Zhu ; Chen Zhu ; Xi Zhang ; Enhong Chen ; Hui Xiong

【Abstract】: As a major component of strategic talent management, learning and development (L&D) aims at improving the individual and organization performances through planning tailored training for employees to increase and improve their skills and knowledge. While many companies have developed the learning management systems (LMSs) for facilitating the online training of employees, a long-standing important issue is how to achieve personalized training recommendations with the consideration of their needs for future career development. To this end, in this paper, we propose an explainable personalized online course recommender system for enhancing employee training and development. A unique perspective of our system is to jointly model both the employees’ current competencies and their career development preferences in an explainable way. Specifically, the recommender system is based on a novel end-to-end hierarchical framework, namely Demand-aware Collaborative Bayesian Variational Network (DCBVN). In DCBVN, we first extract the latent interpretable representations of the employees’ competencies from their skill profiles with autoencoding variational inference based topic modeling. Then, we develop an effective demand recognition mechanism for learning the personal demands of career development for employees. In particular, all the above processes are integrated into a unified Bayesian inference view for obtaining both accurate and explainable recommendations. Finally, extensive experimental results on real-world data clearly demonstrate the effectiveness and the interpretability of DCBVN, as well as its robustness on sparse and cold-start scenarios.

【Keywords】:

Paper Link】 【Pages】:1660-1670

【Authors】: Quanming Yao ; Xiangning Chen ; James T. Kwok ; Yong Li ; Cho-Jui Hsieh

【Abstract】: In collaborative filtering (CF), interaction function (IFC) play the important role of capturing interactions among items and users. The most popular IFC is the inner product, which has been successfully used in low-rank matrix factorization. However, interactions in real-world applications can be highly complex. Thus, other operations (such as plus and concatenation), which may potentially offer better performance, have been proposed. Nevertheless, it is still hard for existing IFCs to have consistently good performance across different application scenarios. Motivated by the recent success of automated machine learning (AutoML), we propose in this paper the search for simple neural interaction functions (SIF) in CF. By examining and generalizing existing CF approaches, an expressive SIF search space is designed and represented as a structured multi-layer perceptron. We propose an one-shot search algorithm that simultaneously updates both the architecture and learning parameters. Experimental results demonstrate that the proposed method can be much more efficient than popular AutoML approaches, can obtain much better prediction performance than state-of-the-art CF approaches, and can discover distinct IFCs for different data sets and tasks.1

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

150. Early Detection of User Exits from Clickstream Data: A Markov Modulated Marked Point Process Model.

Paper Link】 【Pages】:1671-1681

【Authors】: Tobias Hatt ; Stefan Feuerriegel

【Abstract】: Most users leave e-commerce websites with no purchase. Hence, it is important for website owners to detect users at risk of exiting and intervene early (e. g., adapting website content or offering price promotions). Prior approaches make widespread use of clickstream data; however, state-of-the-art algorithms only model the sequence of web pages visited and not the time spent on them.

【Keywords】: Mathematics of computing; Probability and statistics; Probabilistic representations; Markov networks; Stochastic processes; Markov processes; Theory of computation; Theory and algorithms for application domains; Machine learning theory; Markov decision processes

151. Filter List Generation for Underserved Regions.

Paper Link】 【Pages】:1682-1692

【Authors】: Alexander Sjösten ; Peter Snyder ; Antonio Pastor ; Panagiotis Papadopoulos ; Benjamin Livshits

【Abstract】: Filter lists play a large and growing role in protecting and assisting web users. The vast majority of popular filter lists are crowd-sourced, where a large number of people manually label resources related to undesirable web resources (e.g. ads, trackers, paywall libraries), so that they can be blocked by browsers and extensions.

【Keywords】:

152. Improving Learning Outcomes with Gaze Tracking and Automatic Question Generation.

Paper Link】 【Pages】:1693-1703

【Authors】: Rohail Syed ; Kevyn Collins-Thompson ; Paul N. Bennett ; Mengqiu Teng ; Shane Williams ; Wendy W. Tay ; Shamsi T. Iqbal

【Abstract】: As AI technology advances, it offers promising opportunities to improve educational outcomes when integrated with an overall learning experience. We investigate forward-looking interactive reading experiences that leverage both automatic question generation and analysis of attention signals, such as gaze tracking, to improve short- and long-term learning outcomes. We aim to expand the known pedagogical benefits of adjunct questions to more general reading scenarios, by investigating the benefits of adjunct questions generated after participants attend to passages in an article, based on their gaze behavior. We also compare the effectiveness of manually-written questions with those produced by Automatic Question Generation (AQG). We further investigate gaze and reading patterns indicative of low vs. high learning in both short- and long-term scenarios (one-week followup). We show AQG-generated adjunct questions have promise as a way to scale to a wide variety of reading material where the cost of manually curating questions may be prohibitive.

【Keywords】:

153. REST: Robust and Efficient Neural Networks for Sleep Monitoring in the Wild.

Paper Link】 【Pages】:1704-1714

【Authors】: Rahul Duggal ; Scott Freitas ; Cao Xiao ; Duen Horng Chau ; Jimeng Sun

【Abstract】: In recent years, significant attention has been devoted towards integrating deep learning technologies in the healthcare domain. However, to safely and practically deploy deep learning models for home health monitoring, two significant challenges must be addressed: the models should be (1) robust against noise; and (2) compact and energy-efficient. We propose Rest , a new method that simultaneously tackles both issues via 1) adversarial training and controlling the Lipschitz constant of the neural network through spectral regularization while 2) enabling neural network compression through sparsity regularization. We demonstrate that Rest produces highly-robust and efficient models that substantially outperform the original full-sized models in the presence of noise. For the sleep staging task over single-channel electroencephalogram (EEG), the Rest model achieves a macro-F1 score of 0.67 vs. 0.39 achieved by a state-of-the-art model in the presence of Gaussian noise while obtaining 19 × parameter reduction and 15 × MFLOPS reduction on two large, real-world EEG datasets. By deploying these models to an Android application on a smartphone, we quantitatively observe that Rest allows models to achieve up to 17 × energy reduction and 9 × faster inference. We open source the code repository with this paper: https://github.com/duggalrahul/REST.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

154. MadDroid: Characterizing and Detecting Devious Ad Contents for Android Apps.

Paper Link】 【Pages】:1715-1726

【Authors】: Tianming Liu ; Haoyu Wang ; Li Li ; Xiapu Luo ; Feng Dong ; Yao Guo ; Liu Wang ; Tegawendé F. Bissyandé ; Jacques Klein

【Abstract】: Advertisement drives the economy of the mobile app ecosystem. As a key component in the mobile ad business model, mobile ad content has been overlooked by the research community, which poses a number of threats, e.g., propagating malware and undesirable contents. To understand the practice of these devious ad behaviors, we perform a large-scale study on the app contents harvested through automated app testing. In this work, we first provide a comprehensive categorization of devious ad contents, including five kinds of behaviors belonging to two categories: ad loading content and ad clicking content. Then, we propose MadDroid, a framework for automated detection of devious ad contents. MadDroid leverages an automated app testing framework with a sophisticated ad view exploration strategy for effectively collecting ad-related network traffic and subsequently extracting ad contents. We then integrate dedicated approaches into the framework to identify devious ad contents. We have applied MadDroid to 40,000 Android apps and found that roughly 6% of apps deliver devious ad contents, e.g., distributing malicious apps that cannot be downloaded via traditional app markets. Experiment results indicate that devious ad contents are prevalent, suggesting that our community should invest more effort into the detection and mitigation of devious ads towards building a trustworthy mobile advertising ecosystem.

【Keywords】:

155. Mobile App Squatting.

Paper Link】 【Pages】:1727-1738

【Authors】: Yangyu Hu ; Haoyu Wang ; Ren He ; Li Li ; Gareth Tyson ; Ignacio Castro ; Yao Guo ; Lei Wu ; Guoai Xu

【Abstract】: Domain squatting, the adversarial tactic where attackers register domain names that mimic popular ones, has been observed for decades. However, there has been growing anecdotal evidence that this style of attack has spread to other domains. In this paper, we explore the presence of squatting attacks in the mobile app ecosystem. In “App Squatting”, attackers release apps with identifiers (e.g., app name or package name) that are confusingly similar to those of popular apps or well-known Internet brands. This paper presents the first in-depth measurement study of app squatting showing its prevalence and implications. We first identify 11 common deformation approaches of app squatters and propose “AppCrazy”, a tool for automatically generating variations of app identifiers. We have applied AppCrazy to the top-500 most popular apps in Google Play, generating 224,322 deformation keywords which we then use to test for app squatters on popular markets. Through this, we confirm the scale of the problem, identifying 10,553 squatting apps (an average of over 20 squatting apps for each legitimate one). Our investigation reveals that more than 51% of the squatting apps are malicious, with some being extremely popular (up to 10 million downloads). Meanwhile, we also find that mobile app markets have not been successful in identifying and eliminating squatting apps. Our findings demonstrate the urgency to identify and prevent app squatting abuses. To this end, we have publicly released all the identified squatting apps, as well as our tool AppCrazy.

【Keywords】:

156. Sub-linear RACE Sketches for Approximate Kernel Density Estimation on Streaming Data.

Paper Link】 【Pages】:1739-1749

【Authors】: Benjamin Coleman ; Anshumali Shrivastava

【Abstract】: Kernel density estimation is a simple and effective method that lies at the heart of many important machine learning applications. Unfortunately, kernel methods scale poorly for large, high dimensional datasets. Approximate kernel density estimation has a prohibitively high memory and computation cost, especially in the streaming setting. Recent sampling algorithms for high dimensional densities can reduce the computation cost but cannot operate online, while streaming algorithms cannot handle high dimensional datasets due to the curse of dimensionality. We propose RACE, an efficient sketching algorithm for kernel density estimation on high-dimensional streaming data. RACE compresses a set of N high dimensional vectors into tiny arrays of integer counters. These arrays are sufficient to estimate the kernel density for a large class of kernels. Our one-pass sketch is simple to implement and comes with strong theoretical guarantees. We evaluate our method on real-world high-dimensional datasets and show that our sketch achieves 10x better compression compared to existing methods.1

【Keywords】:

157. LOVBench: Ontology Ranking Benchmark.

Paper Link】 【Pages】:1750-1760

【Authors】: Niklas Kolbe ; Pierre-Yves Vandenbussche ; Sylvain Kubler ; Yves Le Traon

【Abstract】: Ontology search and ranking are key building blocks to establish and reuse shared conceptualizations of domain knowledge on the Web. However, the effectiveness of proposed ontology ranking models is difficult to compare since these are often evaluated on diverse datasets that are limited by their static nature and scale. In this paper, we first introduce the LOVBench dataset as a benchmark for ontology term ranking. With inferred relevance judgments for more than 7000 queries, LOVBench is large enough to perform a comparison study using learning to rank (LTR) with complex ontology ranking models. Instead of relying on relevance judgments from a few experts, we consider implicit feedback from many actual users collected from the Linked Open Vocabularies (LOV) platform. Our approach further enables continuous updates of the benchmark, capturing the evolution of ontologies’ relevance in an ever-changing data community. Second, we compare the performance of several feature configurations from the literature using LOVBench in LTR settings and discuss the results in the context of the observed real-world user behavior. Our experimental results show that feature configurations which are (i) well-suited to the user behavior, (ii) cover all features types, and (iii) consider decomposition of features can significantly improve the ranking performance.

【Keywords】:

158. Adaptive Low-level Storage of Very Large Knowledge Graphs.

Paper Link】 【Pages】:1761-1772

【Authors】: Jacopo Urbani ; Ceriel J. H. Jacobs

【Abstract】: The increasing availability and usage of Knowledge Graphs (KGs) on the Web calls for scalable and general-purpose solutions to store this type of data structures. We propose Trident, a novel storage architecture for very large KGs on centralized systems. Trident uses several interlinked data structures to provide fast access to nodes and edges, with the physical storage changing depending on the topology of the graph to reduce the memory footprint. In contrast to single architectures designed for single tasks, our approach offers an interface with few low-level and general-purpose primitives that can be used to implement tasks like SPARQL query answering, reasoning, or graph analytics. Our experiments show that Trident can handle graphs with 1011 edges using inexpensive hardware, delivering competitive performance on multiple workloads.

【Keywords】:

159. Generating Representative Headlines for News Stories.

Paper Link】 【Pages】:1773-1784

【Authors】: Xiaotao Gu ; Yuning Mao ; Jiawei Han ; Jialu Liu ; You Wu ; Cong Yu ; Daniel Finnie ; Hongkun Yu ; Jiaqi Zhai ; Nicholas Zukoski

【Abstract】: Millions of news articles are published online every day, which can be overwhelming for readers to follow. Grouping articles that are reporting the same event into news stories is a common way of assisting readers in their news consumption. However, it remains a challenging research problem to efficiently and effectively generate a representative headline for each story. Automatic summarization of a document set has been studied for decades, while few studies have focused on generating representative headlines for a set of articles. Unlike summaries, which aim to capture most information with least redundancy, headlines aim to capture information jointly shared by the story articles in short length and exclude information specific to each individual article.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing

160. Adversarial Cooperative Imitation Learning for Dynamic Treatment Regimes✱.

Paper Link】 【Pages】:1785-1795

【Authors】: Lu Wang ; Wenchao Yu ; Xiaofeng He ; Wei Cheng ; Martin Renqiang Ren ; Wei Wang ; Bo Zong ; Haifeng Chen ; Hongyuan Zha

【Abstract】: Recent developments in discovering dynamic treatment regimes (DTRs) have heightened the importance of deep reinforcement learning (DRL) which are used to recover the doctor’s treatment policies. However, existing DRL-based methods expose the following limitations: 1) supervised methods based on behavior cloning suffer from compounding errors; 2) the self-defined reward signals in reinforcement learning models are either too sparse or need clinical guidance; 3) only positive trajectories (e.g. survived patients) are considered in current imitation learning models, with negative trajectories (e.g. deceased patients) been largely ignored, which are examples of what not to do and could help the learned policy avoid repeating mistakes. To address these limitations, in this paper, we propose the adversarial cooperative imitation learning model, ACIL, to deduce the optimal dynamic treatment regimes that mimics the positive trajectories while differs from the negative trajectories. Specifically, two discriminators are used to help achieve this goal: an adversarial discriminator is designed to minimize the discrepancies between the trajectories generated from the policy and the positive trajectories, and a cooperative discriminator is used to distinguish the negative trajectories from the positive and generated trajectories. The reward signals from the discriminators are utilized to refine the policy for dynamic treatment regimes. Experiments on the publicly real-world medical data demonstrate that ACIL improves the likelihood of patient survival and provides better dynamic treatment regimes with the exploitation of information from both positive and negative trajectories.

【Keywords】: Computing methodologies; Machine learning

161. A Data-Driven Metric of Incentive Compatibility.

Paper Link】 【Pages】:1796-1806

【Authors】: Yuan Deng ; Sébastien Lahaie ; Vahab S. Mirrokni ; Song Zuo

【Abstract】: An incentive-compatible auction incentivizes buyers to truthfully reveal their private valuations. However, many ad auction mechanisms deployed in practice are not incentive-compatible, such as first-price auctions (for display advertising) and the generalized second-price auction (for search advertising). We introduce a new metric to quantify incentive compatibility in both static and dynamic environments. Our metric is data-driven and can be computed directly through black-box auction simulations without relying on reference mechanisms or complex optimizations. We provide interpretable characterizations of our metric and prove that it is monotone in auction parameters for several mechanisms used in practice, such as soft floors and dynamic reserve prices. We empirically evaluate our metric on ad auction data from a major ad exchange and a major search engine to demonstrate its broad applicability in practice.

【Keywords】: Applied computing; Law, social and behavioral sciences; Economics

162. Modeling and Aggregation of Complex Annotations via Annotation Distances.

Paper Link】 【Pages】:1807-1818

【Authors】: Alexander Braylan ; Matthew Lease

【Abstract】: Modeling annotators and their labels is valuable for ensuring collected data quality. Though many models have been proposed for binary or categorical labels, prior methods do not generalize to complex annotations (e.g., open-ended text, multivariate, or structured responses) without devising new models for each specific task. To obviate the need for task-specific modeling, we propose to model distances between labels, rather than the labels themselves. Our models are largely agnostic to the distance function; we leave it to the requesters to specify an appropriate distance function for their given annotation task. We propose three models of annotation quality, including a Bayesian hierarchical extension of multidimensional scaling which can be trained in an unsupervised or semi-supervised manner. Results show the generality and effectiveness of our models across diverse complex annotation tasks: sequence labeling, translation, syntactic parsing, and ranking.

【Keywords】:

163. #Outage: Detecting Power and Communication Outages from Social Networks.

Paper Link】 【Pages】:1819-1829

【Authors】: Udit Paul ; Alexander Ermakov ; Michael Nekrasov ; Vivek Adarsh ; Elizabeth M. Belding

【Abstract】: Natural disasters are increasing worldwide at an alarming rate. To aid relief operations during and post disaster, humanitarian organizations rely on various types of situational information such as missing, trapped or injured people and damaged infrastructure in an area. Crucial and timely identification of infrastructure and utility damage is critical to properly plan and execute search and rescue operations. However, in the wake of natural disasters, real-time identification of this information becomes challenging. In this research, we investigate the use of tweets posted on the Twitter social media platform to detect power and communication outages during natural disasters. We first curate a data set of 18,097 tweets based on domain-specific keywords obtained using Latent Dirichlet Allocation. We annotate the gathered data set to separate the tweets into different types of outage-related events: power outage, communication outage and both power-communication outage. We analyze the tweets to identify information such as popular words, length of words and hashtags as well as sentiments that are associated with tweets in these outage-related categories. Furthermore, we apply machine learning algorithms to classify these tweets into their respective categories. Our results show that simple classifiers such as the boosting algorithm are able to classify outage related tweets from unrelated tweets with close to 100% f1-score. Additionally, we observe that the transfer learning model, BERT, is able to classify different categories of outage-related tweets with close to 90% accuracy in less than 90 seconds of training and testing time, demonstrating that tweets can be mined in real-time to assist first responders during natural disasters.

【Keywords】:

164. LOREM: Language-consistent Open Relation Extraction from Unstructured Text.

Paper Link】 【Pages】:1830-1838

【Authors】: Tom Harting ; Sepideh Mesbah ; Christoph Lofi

【Abstract】: We introduce a Language-consistent multi-lingual Open Relation Extraction Model (LOREM) for finding relation tuples of any type between entities in unstructured texts. LOREM does not rely on language-specific knowledge or external NLP tools such as translators or PoS-taggers, and exploits information and structures that are consistent over different languages. This allows our model to be easily extended with only limited training efforts to new languages, but also provides a boost to performance for a given single language. An extensive evaluation performed on 5 languages shows that LOREM outperforms state-of-the-art mono-lingual and cross-lingual open relation extractors. Moreover, experiments on languages with no or only little training data indicate that LOREM generalizes to other languages than the languages that it is trained on.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing

165. DyCRS: Dynamic Interpretable Postoperative Complication Risk Scoring.

Paper Link】 【Pages】:1839-1850

【Authors】: Wen Wang ; Han Zhao ; Honglei Zhuang ; Nirav Shah ; Rema Padman

【Abstract】: Early identification of patients at risk for postoperative complications can facilitate timely workups and treatments and improve health outcomes. Currently, a widely-used surgical risk calculator online web system developed by the American College of Surgeons (ACS) uses patients’ static features, e.g. gender, age, to assess the risk of postoperative complications. However, the most crucial signals that reflect the actual postoperative physical conditions of patients are usually real-time dynamic signals, including the vital signs of patients (e.g., heart rate, blood pressure) collected from postoperative monitoring. In this paper, we develop a dynamic postoperative complication risk scoring framework (DyCRS) to detect the “at-risk” patients in a real-time way based on postoperative sequential vital signs and static features. DyCRS is based on adaptations of the Hidden Markov Model (HMM) that captures hidden states as well as observable states to generate a real-time, probabilistic, complication risk score. Evaluating our model using electronic health record (EHR) on elective Colectomy surgery from a major health system, we show that DyCRS significantly outperforms the state-of-the-art ACS calculator and real-time predictors with 50.16% area under precision-recall curve (AUCPRC) gain on average in terms of detection effectiveness. In terms of earliness, our DyCRS can predict 15hrs55mins earlier on average than clinician’s diagnosis with the recall of 60% and precision of 55%. Furthermore, Our DyCRS can extract interpretable patients’ stages, which are consistent with previous medical postoperative complication studies. We believe that our contributions demonstrate significant promise for developing a more accurate, robust and interpretable postoperative complication risk scoring system, which can benefit more than 50 million annual surgeries in the US by substantially lowering adverse events and healthcare costs.

【Keywords】: Applied computing; Life and medical sciences; Health care information systems; Computing methodologies; Machine learning

166. OpenCrowd: A Human-AI Collaborative Approach for Finding Social Influencers via Open-Ended Answers Aggregation.

Paper Link】 【Pages】:1851-1862

【Authors】: Ines Arous ; Jie Yang ; Mourad Khayati ; Philippe Cudré-Mauroux

【Abstract】: Finding social influencers is a fundamental task in many online applications ranging from brand marketing to opinion mining. Existing methods heavily rely on the availability of expert labels, whose collection is usually a laborious process even for domain experts. Using open-ended questions, crowdsourcing provides a cost-effective way to find a large number of social influencers in a short time. Individual crowd workers, however, only possess fragmented knowledge that is often of low quality.

【Keywords】: Information systems; Information systems applications; Data mining

167. Correcting for Selection Bias in Learning-to-rank Systems.

Paper Link】 【Pages】:1863-1873

【Authors】: Zohreh Ovaisi ; Ragib Ahsan ; Yifan Zhang ; Kathryn Vasilaky ; Elena Zheleva

【Abstract】: Click data collected by modern recommendation systems are an important source of observational data that can be utilized to train learning-to-rank (LTR) systems. However, these data suffer from a number of biases that can result in poor performance for LTR systems. Recent methods for bias correction in such systems mostly focus on position bias, the fact that higher ranked results (e.g., top search engine results) are more likely to be clicked even if they are not the most relevant results given a user’s query. Less attention has been paid to correcting for selection bias, which occurs because clicked documents are reflective of what documents have been shown to the user in the first place. Here, we propose new counterfactual approaches which adapt Heckman’s two-stage method and accounts for selection and position bias in LTR systems. Our empirical evaluation shows that our proposed methods are much more robust to noise and have better accuracy compared to existing unbiased LTR algorithms, especially when there is moderate to no position bias.

【Keywords】:

168. The Pod People: Understanding Manipulation of Social Media Popularity via Reciprocity Abuse.

Paper Link】 【Pages】:1874-1884

【Authors】: Janith Weerasinghe ; Bailey Flanigan ; Aviel J. Stein ; Damon McCoy ; Rachel Greenstadt

【Abstract】: Online Social Network (OSN) Users’ demand to increase their account popularity has driven the creation of an underground ecosystem that provides services or techniques to help users manipulate content curation algorithms. One method of subversion that has recently emerged occurs when users form groups, called pods, to facilitate reciprocity abuse, where each member reciprocally interacts with content posted by other members of the group. We collect 1.8 million Instagram posts that were posted in pods hosted on Telegram. We first summarize the properties of these pods and how they are used, uncovering that they are easily discoverable by Google search and have a low barrier to entry. We then create two machine learning models for detecting Instagram posts that have gained interaction through two different kinds of pods, achieving 0.91 and 0.94 AUC, respectively. Finally, we find that pods are effective tools for increasing users’ Instagram popularity, we estimate that pod utilization leads to a significantly increased level of likely organic comment interaction on users’ subsequent posts.

【Keywords】:

Paper Link】 【Pages】:1885-1896

【Authors】: Paolo Rosso ; Dingqi Yang ; Philippe Cudré-Mauroux

【Abstract】: Knowledge Graph (KG) embeddings are a powerful tool for predicting missing links in KGs. Existing techniques typically represent a KG as a set of triplets, where each triplet (h, r, t) links two entities h and t through a relation r, and learn entity/relation embeddings from such triplets while preserving such a structure. However, this triplet representation oversimplifies the complex nature of the data stored in the KG, in particular for hyper-relational facts, where each fact contains not only a base triplet (h, r, t), but also the associated key-value pairs (k, v). Even though a few recent techniques tried to learn from such data by transforming a hyper-relational fact into an n-ary representation (i.e., a set of key-value pairs only without triplets), they result in suboptimal models as they are unaware of the triplet structure, which serves as the fundamental data structure in modern KGs and preserves the essential information for link prediction. To address this issue, we propose HINGE, a hyper-relational KG embedding model, which directly learns from hyper-relational facts in a KG. HINGE captures not only the primary structural information of the KG encoded in the triplets, but also the correlation between each triplet and its associated key-value pairs. Our extensive evaluation shows the superiority of HINGE on various link prediction tasks over KGs. In particular, HINGE consistently outperforms not only the KG embedding methods learning from triplets only (by 0.81-41.45% depending on the link prediction tasks and settings), but also the methods learning from hyper-relational facts using the n-ary representation (by 13.2-84.1%).

【Keywords】:

170. Context-Aware Document Term Weighting for Ad-Hoc Search.

Paper Link】 【Pages】:1897-1907

【Authors】: Zhuyun Dai ; Jamie Callan

【Abstract】: Bag-of-words document representations play a fundamental role in modern search engines, but their power is limited by the shallow frequency-based term weighting scheme. This paper proposes HDCT, a context-aware document term weighting framework for document indexing and retrieval. It first estimates the semantic importance of a term in the context of each passage. These fine-grained term weights are then aggregated into a document-level bag-of-words representation, which can be stored into a standard inverted index for efficient retrieval. This paper also proposes two approaches that enable training HDCT without relevance labels. Experiments show that an index using HDCT weights significantly improved the retrieval accuracy compared to typical term-frequency and state-of-the-art embedding-based indexes.

【Keywords】: Information systems; Information retrieval; Retrieval models and ranking

171. NetTaxo: Automated Topic Taxonomy Construction from Text-Rich Network.

Paper Link】 【Pages】:1908-1919

【Authors】: Jingbo Shang ; Xinyang Zhang ; Liyuan Liu ; Sha Li ; Jiawei Han

【Abstract】: The automated construction of topic taxonomies can benefit numerous applications, including web search, recommendation, and knowledge discovery. One of the major advantages of automatic taxonomy construction is the ability to capture corpus-specific information and adapt to different scenarios. To better reflect the characteristics of a corpus, we take the meta-data of documents into consideration and view the corpus as a text-rich network. In this paper, we propose NetTaxo, a novel automatic topic taxonomy construction framework, which goes beyond the existing paradigm and allows text data to collaborate with network structure. Specifically, we learn term embeddings from both text and network as contexts. Network motifs are adopted to capture appropriate network contexts. We conduct an instance-level selection for motifs, which further refines term embedding according to the granularity and semantics of each taxonomy node. Clustering is then applied to obtain sub-topics under a taxonomy node. Extensive experiments on two real-world datasets demonstrate the superiority of our method over the state-of-the-art, and further verify the effectiveness and importance of instance-level motif selection.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing; Machine learning; Information systems; Information systems applications; Data mining

172. Do podcasts and music compete with one another? Understanding users' audio streaming habits.

Paper Link】 【Pages】:1920-1931

【Authors】: Ang Li ; Alice Wang ; Zahra Nazari ; Praveen Chandar ; Benjamin A. Carterette

【Abstract】: Over the past decade, podcasts have been one of the fastest growing online streaming media. Many online audio streaming platforms such as Pandora, Spotify, etc. that traditionally focused on music content have started to incorporate services related to podcasts. Although incorporating new media types such as podcasts has created tremendous opportunities for these streaming platforms to expand their content offering, it also introduces new challenges. Since the functional use of podcasts and music may largely overlap for many people, the two types of content may compete with one another for the finite amount of time that users may allocate for audio streaming. As a result, incorporating podcast listening may influence and change the way users have originally consumed music. Adopting quasi-experimental techniques, the current study assesses the causal influence of adding a new class of content on user listening behavior by using large scale observational data collected from a widely used audio streaming platform. Our results demonstrate that podcast and music consumption compete slightly but do not replace one another – users open another time window to listen to podcasts. In addition, users who have added podcasts to their music listening demonstrate significantly different consumption habits for podcasts vs. music in terms of the streaming time, duration and frequency. Taking all the differences as input features to a machine learning model, we demonstrate that a podcast listening session is predictable at the start of a new listening session. Our study provides a novel contribution for online audio streaming and consumption services to understand their potential consumers and to best support their current users with an improved recommendation system.

【Keywords】:

173. Near-Perfect Recovery in the One-Dimensional Latent Space Model.

Paper Link】 【Pages】:1932-1942

【Authors】: Yu Chen ; Sampath Kannan ; Sanjeev Khanna

【Abstract】: Suppose a graph G is stochastically created by uniformly sampling vertices along a line segment and connecting each pair of vertices with a probability that is a known decreasing function of their distance. We ask if it is possible to reconstruct the actual positions of the vertices in G by only observing the generated unlabeled graph. We study this question for two natural edge probability functions — one where the probability of an edge decays exponentially with the distance and another where this probability decays only linearly. We initiate our study with the weaker goal of recovering only the order in which vertices appear on the line segment. For a segment of length n and a precision parameter δ, we show that for both exponential and linear decay edge probability functions, there is an efficient algorithm that correctly recovers (up to reflection symmetry) the order of all vertices that are at least δ apart, using only samples (vertices). Building on this result, we then show that vertices (samples) are sufficient to additionally recover the location of each vertex on the line to within a precision of δ. We complement this result with an lower bound on samples needed for reconstructing positions (even by a computationally unbounded algorithm), showing that the task of recovering positions is information-theoretically harder than recovering the order. We give experimental results showing that our algorithm recovers the positions of almost all points with high accuracy.

【Keywords】:

174. Finding a Choice in a Haystack: Automatic Extraction of Opt-Out Statements from Privacy Policy Text.

Paper Link】 【Pages】:1943-1954

【Authors】: Vinayshekhar Bannihatti Kumar ; Roger Iyengar ; Namita Nisal ; Yuanyuan Feng ; Hana Habib ; Peter Story ; Sushain Cherivirala ; Margaret Hagan ; Lorrie Faith Cranor ; Shomir Wilson ; Florian Schaub ; Norman M. Sadeh

【Abstract】: Website privacy policies sometimes provide users the option to opt-out of certain collections and uses of their personal data. Unfortunately, many privacy policies bury these instructions deep in their text, and few web users have the time or skill necessary to discover them. We describe a method for the automated detection of opt-out choices in privacy policy text and their presentation to users through a web browser extension. We describe the creation of two corpora of opt-out choices, which enable the training of classifiers to identify opt-outs in privacy policies. Our overall approach for extracting and classifying opt-out choices combines heuristics to identify commonly found opt-out hyperlinks with supervised machine learning to automatically identify less conspicuous instances. Our approach achieves a precision of 0.93 and a recall of 0.9. We introduce Opt-Out Easy, a web browser extension designed to present available opt-out choices to users as they browse the web. We evaluate the usability of our browser extension with a user study. We also present results of a large-scale analysis of opt-outs found in the text of thousands of the most popular websites.

【Keywords】: Security and privacy; Human and societal aspects of security and privacy

175. eDarkFind: Unsupervised Multi-view Learning for Sybil Account Detection.

Paper Link】 【Pages】:1955-1965

【Authors】: Ramnath Kumar ; Shweta Yadav ; Raminta Daniulaityte ; Francois R. Lamy ; Krishnaprasad Thirunarayan ; Usha Lokala ; Amit P. Sheth

【Abstract】: Darknet crypto markets are online marketplaces using crypto currencies (e.g., Bitcoin, Monero) and advanced encryption techniques to offer anonymity to vendors and consumers trading for illegal goods or services. The exact volume of substances advertised and sold through these crypto markets is difficult to assess, at least partially, because vendors tend to maintain multiple accounts (or Sybil accounts) within and across different crypto markets. Linking these different accounts will allow us to accurately evaluate the volume of substances advertised across the different crypto markets by each vendor. In this paper, we present a multi-view unsupervised framework (eDarkFind) that helps modeling vendor characteristics and facilitates Sybil account detection. We employ a multi-view learning paradigm to generalize and improve the performance by exploiting the diverse views from multiple rich sources such as BERT, stylometric, and location representation. Our model is further tailored to take advantage of domain-specific knowledge such as the Drug Abuse Ontology to take into consideration the substance information. We performed extensive experiments and demonstrated that the multiple views obtained from diverse sources can be effective in linking Sybil accounts. Our proposed eDarkFind model achieves an accuracy of 98% on three real-world datasets which shows the generality of the approach.

【Keywords】:

176. Provably and Efficiently Approximating Near-cliques using the Turán Shadow: PEANUTS.

Paper Link】 【Pages】:1966-1976

【Authors】: Shweta Jain ; C. Seshadhri

【Abstract】: Clique and near-clique counts are important graph properties with applications in graph generation, graph modeling, graph analytics, community detection among others. They are the archetypal examples of dense subgraphs. While there are several different definitions of near-cliques, most of them share the attribute that they are cliques that are missing a small number of edges. Clique counting is itself considered a challenging problem. Counting near-cliques is significantly harder more so since the search space for near-cliques is orders of magnitude larger than that of cliques.

【Keywords】: Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms

177. Differentially Private Stream Processing for the Semantic Web.

Paper Link】 【Pages】:1977-1987

【Authors】: Daniele Dell'Aglio ; Abraham Bernstein

【Abstract】: Data often contains sensitive information, which poses a major obstacle to publishing it. Some suggest to obfuscate the data or only releasing some data statistics. These approaches have, however, been shown to provide insufficient safeguards against de-anonymisation. Recently, differential privacy (DP), an approach that injects noise into the query answers to provide statistical privacy guarantees, has emerged as a solution to release sensitive data. This study investigates how to continuously release privacy-preserving histograms (or distributions) from online streams of sensitive data by combining DP and semantic web technologies. We focus on distributions, as they are the basis for many analytic applications. Specifically, we propose SihlQL, a query language that processes RDF streams in a privacy-preserving fashion. SihlQL builds on top of SPARQL and the w-event DP framework. We show how some peculiarities of w-event privacy constrain the expressiveness of SihlQL queries. Addressing these constraints, we propose an extension of w-event privacy that provides answers to a larger class of queries while preserving their privacy. To evaluate SihlQL, we implemented a prototype engine that compiles queries to Apache Flink topologies and studied its privacy properties using real-world data from an IPTV provider and an online e-commerce web site.

【Keywords】:

178. Learning to Hash with Graph Neural Networks for Recommender Systems.

Paper Link】 【Pages】:1988-1998

【Authors】: Qiaoyu Tan ; Ninghao Liu ; Xing Zhao ; Hongxia Yang ; Jingren Zhou ; Xia Hu

【Abstract】: Recommender systems in industry generally include two stages: recall and ranking. Recall refers to efficiently identify hundreds of candidate items that user may interest in from a large volume of item corpus, while the latter aims to output a precise ranking list using complex ranking models. Recently, graph representation learning has attracted much attention in supporting high quality candidate search at scale. Despite its effectiveness in learning embedding vectors for objects in the user-item interaction network, the computational costs to infer users’ preferences in continuous embedding space are tremendous. In this work, we investigate the problem of hashing with graph neural networks (GNNs) for high quality retrieval, and propose a simple yet effective discrete representation learning framework to jointly learn continuous and discrete codes. Specifically, a deep hashing with GNNs (HashGNN) is presented, which consists of two components, a GNN encoder for learning node representations, and a hash layer for encoding representations to hash codes. The whole architecture is trained end-to-end by jointly optimizing two losses, i.e., reconstruction loss from reconstructing observed links, and ranking loss from preserving the relative ordering of hash codes. A novel discrete optimization strategy based on straight through estimator (STE) with guidance is proposed. The principal idea is to avoid gradient magnification in back-propagation of STE with continuous embedding guidance, in which we begin from learning an easier network that mimic the continuous embedding and let it evolve during the training until it finally goes back to STE. Comprehensive experiments over three publicly available and one real-world Alibaba company datasets demonstrate that our model not only can achieve comparable performance compared with its continuous counterpart but also runs multiple times faster during inference.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

179. Edge formation in Social Networks to Nurture Content Creators.

Paper Link】 【Pages】:1999-2008

【Authors】: Chun Lo ; Emilie de Longueau ; Ankan Saha ; Shaunak Chatterjee

【Abstract】: Social networks act as major content marketplaces where creators and consumers come together to share and consume various kinds of content. Content ranking applications (e.g., newsfeed, moments, notifications) and edge recommendation products (e.g., connect to members, follow celebrities or groups or hashtags) on such platforms aim at improving the consumer experience. In this work, we focus on the creator experience and specifically on improving edge recommendations to better serve creators in such ecosystems.

【Keywords】:

180. Open Intent Extraction from Natural Language Interactions.

Paper Link】 【Pages】:2009-2020

【Authors】: Nikhita Vedula ; Nedim Lipka ; Pranav Maneriker ; Srinivasan Parthasarathy

【Abstract】: Accurately discovering user intents from their written or spoken language plays a critical role in natural language understanding and automated dialog response. Most existing research models this as a classification task with a single intent label per utterance, grouping user utterances into a single intent type from a set of categories known beforehand. Going beyond this formulation, we define and investigate a new problem of open intent discovery. It involves discovering one or more generic intent types from text utterances, that may not have been encountered during training. We propose a novel domain-agnostic approach, OPINE, which formulates the problem as a sequence tagging task under an open-world setting. It employs a CRF on top of a bidirectional LSTM to extract intents in a consistent format, subject to constraints among intent tag labels. We apply a multi-head self-attention mechanism to effectively learn dependencies between distant words. We further use adversarial training to improve performance and robustly adapt our model across varying domains. Finally, we curate and plan to release an open intent annotated dataset of 25K real-life utterances spanning diverse domains. Extensive experiments show that our approach outperforms state-of-the-art baselines by 5-15% F1 score points. We also demonstrate the efficacy of OPINE in recognizing multiple, diverse domain intents with limited (can also be zero) training examples per unique domain.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing

181. Efficient Algorithms towards Network Intervention.

Paper Link】 【Pages】:2021-2031

【Authors】: Hui-Ju Hung ; Wang-Chien Lee ; De-Nian Yang ; Chih-Ya Shen ; Zhen Lei ; Sy-Miin Chow

【Abstract】: Research suggests that social relationships have substantial impacts on individuals’ health outcomes. Network intervention, through careful planning, can assist a network of users to build healthy relationships. However, most previous work is not designed to assist such planning by carefully examining and improving multiple network characteristics. In this paper, we propose and evaluate algorithms that facilitate network intervention planning through simultaneous optimization of network degree, closeness, betweenness, and local clustering coefficient, under scenarios involving Network Intervention with Limited Degradation - for Single target (NILD-S) and Network Intervention with Limited Degradation - for Multiple targets (NILD-M). We prove that NILD-S and NILD-M are NP-hard and cannot be approximated within any ratio in polynomial time unless P=NP. We propose the Candidate Re-selection with Preserved Dependency (CRPD) algorithm for NILD-S, and the Objective-aware Intervention edge Selection and Adjustment (OISA) algorithm for NILD-M. Various pruning strategies are designed to boost the efficiency of the proposed algorithms. Extensive experiments on various real social networks collected from public schools and Web and an empirical study are conducted to show that CRPD and OISA outperform the baselines in both efficiency and effectiveness.

【Keywords】:

182. Asking Questions the Human Way: Scalable Question-Answer Generation from Text Corpus.

Paper Link】 【Pages】:2032-2043

【Authors】: Bang Liu ; Haojie Wei ; Di Niu ; Haolan Chen ; Yancheng He

【Abstract】: The ability to ask questions is important in both human and machine intelligence. Learning to ask questions helps knowledge acquisition, improves question-answering and machine reading comprehension tasks, and helps a chatbot to keep the conversation flowing with a human. Existing question generation models are ineffective at generating a large amount of high-quality question-answer pairs from unstructured text, since given an answer and an input passage, question generation is inherently a one-to-many mapping. In this paper, we propose Answer-Clue-Style-aware Question Generation (ACS-QG), which aims at automatically generating high-quality and diverse question-answer pairs from unlabeled text corpus at scale by imitating the way a human asks questions. Our system consists of: i) an information extractor, which samples from the text multiple types of assistive information to guide question generation; ii) neural question generators, which generate diverse and controllable questions, leveraging the extracted assistive information; and iii) a neural quality controller, which removes low-quality generated data based on text entailment. We compare our question generation models with existing approaches and resort to voluntary human evaluation to assess the quality of the generated question-answer pairs. The evaluation results suggest that our system dramatically outperforms state-of-the-art neural question generation models in terms of the generation quality, while being scalable in the meantime. With models trained on a relatively smaller amount of data, we can generate 2.8 million quality-assured question-answer pairs from a million sentences found in Wikipedia.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing; Machine learning; Machine learning approaches; Neural networks

183. Expanding Taxonomies with Implicit Edge Semantics.

Paper Link】 【Pages】:2044-2054

【Authors】: Emaad Manzoor ; Rui Li ; Dhananjay Shrouty ; Jure Leskovec

【Abstract】: Curated taxonomies enhance the performance of machine-learning systems via high-quality structured knowledge. However, manually curating a large and rapidly-evolving taxonomy is infeasible. In this work, we propose Arborist, an approach to automatically expand textual taxonomies by predicting the parents of new taxonomy nodes. Unlike previous work, Arborist handles the more challenging scenario of taxonomies with heterogeneous edge semantics that are unobserved. Arborist learns latent representations of the edge semantics along with embeddings of the taxonomy nodes to measure taxonomic relatedness between node pairs. Arborist is then trained by optimizing a large-margin ranking loss with a dynamic margin function. We propose a principled formulation of the margin function, which theoretically guarantees that Arborist minimizes an upper-bound on the shortest-path distance between the predicted parents and actual parents in the taxonomy. Via extensive evaluation on a curated taxonomy at Pinterest and several public datasets, we demonstrate that Arborist outperforms the state-of-the-art, achieving up to 59% in mean reciprocal rank and 83% in recall at 15. We also explore the ability of Arborist to infer nodes’ taxonomic-roles, without explicit supervision on this task.

【Keywords】: Computing methodologies; Machine learning

184. Herding a Deluge of Good Samaritans: How GitHub Projects Respond to Increased Attention.

Paper Link】 【Pages】:2055-2065

【Authors】: Danaja Maldeniya ; Ceren Budak ; Lionel P. Robert Jr. ; Daniel M. Romero

【Abstract】: Collaborative crowdsourcing is a well-established model of work, especially in the case of open source software development. The structure and operation of these virtual and loosely-knit teams differ from traditional organizations. As such, little is known about how their behavior may change in response to an increase in external attention. To understand these dynamics, we analyze millions of actions of thousands of contributors in over 1100 open source software projects that topped the GitHub Trending Projects page and thus experienced a large increase in attention, in comparison to a control group of projects identified through propensity score matching. In carrying out our research, we use the lens of organizational change, which considers the challenges teams face during rapid growth and how they adapt their work routines, organizational structure, and management style. We show that trending results in an explosive growth in the effective team size. However, most newcomers make only shallow and transient contributions. In response, the original team transitions towards administrative roles, responding to requests and reviewing work done by newcomers. Projects evolve towards a more distributed coordination model with newcomers becoming more central, albeit in limited ways. Additionally, teams become more modular with subgroups specializing in different aspects of the project. We discuss broader implications for collaborative crowdsourcing teams that face attention shocks.

【Keywords】:

185. Don't Let Me Be Misunderstood: Comparing Intentions and Perceptions in Online Discussions.

Paper Link】 【Pages】:2066-2077

【Authors】: Jonathan P. Chang ; Justin Cheng ; Cristian Danescu-Niculescu-Mizil

【Abstract】: Discourse involves two perspectives: a person’s intention in making an utterance and others’ perception of that utterance. The misalignment between these perspectives can lead to undesirable outcomes, such as misunderstandings, low productivity and even overt strife. In this work, we present a computational framework for exploring and comparing both perspectives in online public discussions.

【Keywords】:

186. Discovering Strategic Behaviors for Collaborative Content-Production in Social Networks.

Paper Link】 【Pages】:2078-2088

【Authors】: Yuxin Xiao ; Adit Krishnan ; Hari Sundaram

【Abstract】: Some social networks provide explicit mechanisms to allocate social rewards such as reputation based on users’ actions, while the mechanism is more opaque in other networks. Nonetheless, there are always individuals who obtain greater rewards and reputation than their peers. An intuitive yet important question to ask is whether these successful users employ strategic behaviors to become influential. It might appear that the influencers ”have gamed the system.” However, it remains difficult to conclude the rationality of their actions due to factors like the combinatorial strategy space, inability to determine payoffs, and resource limitations faced by individuals. The challenging nature of this question has drawn attention from both the theory and data mining communities. Therefore, in this paper, we are motivated to investigate if resource-limited individuals discover strategic behaviors associated with high payoffs when producing collaborative/interactive content in social networks. We propose a novel framework of Dynamic Dual Attention Networks (DDAN) which models individuals’ content production strategies through a generative process, under the influence of social interactions involved in the process. Extensive experimental results illustrate the model’s effectiveness in user behavior modeling. We make three strong empirical findings: (1) Different strategies give rise to different social payoffs; (2) The best performing individuals exhibit stability in their preference over the discovered strategies, which indicates the emergence of strategic behavior; and (3) The stability of a user’s preference is correlated with high payoffs.

【Keywords】: Information systems; Information systems applications; Data mining

187. Seeding Network Influence in Biased Networks and the Benefits of Diversity.

Paper Link】 【Pages】:2089-2098

【Authors】: Ana-Andreea Stoica ; Jessy Xinyi Han ; Augustin Chaintreau

【Abstract】: The problem of social influence maximization is widely applicable in designing viral campaigns, news dissemination, or medical aid. State-of-the-art algorithms often select “early adopters” that are most central in a network unfortunately mirroring or exacerbating historical biases and leaving under-represented communities out of the loop. Through a theoretical model of biased networks, we characterize the intricate relationship between diversity and efficiency, which sometimes may be at odds but may also reinforce each other. Most importantly, we find a mathematically proven analytical condition under which more equitable choices of early adopters lead simultaneously to fairer outcomes and larger outreach. Analysis of data on the DBLP network confirms that our condition is often met in real networks. We design and test a set of algorithms leveraging the network structure to optimize the diffusion of a message while avoiding to create disparate impact among participants based on their demographics, such as gender or race.

【Keywords】:

188. Liquidity in Credit Networks with Constrained Agents.

Paper Link】 【Pages】:2099-2108

【Authors】: Geoffrey Ramseyer ; Ashish Goel ; David Mazières

【Abstract】: In order to scale transaction rates for deployment across the global web, many cryptocurrencies have deployed so-called ”Layer-2” networks of private payment channels. An idealized payment network behaves like a Credit Network, a model for transactions across a network of bilateral trust relationships. Credit Networks capture many aspects of traditional currencies as well as new virtual currencies and payment mechanisms. In the traditional credit network model, if an agent defaults, every other node that trusted it is vulnerable to loss. In a cryptocurrency context, trust is manufactured by capital deposits, and thus there arises a natural tradeoff between network liquidity (i.e. the fraction of transactions that succeed) and the cost of capital deposits.

【Keywords】:

189. Dark Matter: Uncovering the DarkComet RAT Ecosystem.

Paper Link】 【Pages】:2109-2120

【Authors】: Brown Farinholt ; Mohammad Rezaeirad ; Damon McCoy ; Kirill Levchenko

【Abstract】: Remote Access Trojans (RATs) are a persistent class of malware that give an attacker direct, interactive access to a victim’s personal computer, allowing the attacker to steal private data, spy on the victim in real-time using the camera and microphone, and verbally harass the victim through the speaker. To date, the users and victims of this pernicious form of malware have been challenging to observe in the wild due to the unobtrusive nature of infections. In this work, we report the results of a longitudinal study of the DarkComet RAT ecosystem. Using a known method for collecting victim log databases from DarkComet controllers, we present novel techniques for tracking RAT controllers across hostname changes and improve on established techniques for filtering spurious victim records caused by scanners and sandboxed malware executions. We downloaded 6,620 DarkComet databases from 1,029 unique controllers spanning over 5 years of operation. Our analysis shows that there have been at least 57,805 victims of DarkComet over this period, with 69 new victims infected every day; many of whose keystrokes have been captured, actions recorded, and webcams monitored during this time. Our methodologies for more precisely identifying campaigns and victims could potentially be useful for improving the efficiency and efficacy of victim cleanup efforts and prioritization of law enforcement investigations.

【Keywords】: Social and professional topics; Computing / technology policy; Computer crime

190. Discriminative Topic Mining via Category-Name Guided Text Embedding.

Paper Link】 【Pages】:2121-2132

【Authors】: Yu Meng ; Jiaxin Huang ; Guangyuan Wang ; Zihan Wang ; Chao Zhang ; Yu Zhang ; Jiawei Han

【Abstract】: Mining a set of meaningful and distinctive topics automatically from massive text corpora has broad applications. Existing topic models, however, typically work in a purely unsupervised way, which often generate topics that do not fit users’ particular needs and yield suboptimal performance on downstream tasks. We propose a new task, discriminative topic mining, which leverages a set of user-provided category names to mine discriminative topics from text corpora. This new task not only helps a user understand clearly and distinctively the topics he/she is most interested in, but also benefits directly keyword-driven classification tasks. We develop CatE, a novel category-name guided text embedding method for discriminative topic mining, which effectively leverages minimal user guidance to learn a discriminative embedding space and discover category representative terms in an iterative manner. We conduct a comprehensive set of experiments to show that CatE mines high-quality set of topics guided by category names only, and benefits a variety of downstream applications including weakly-supervised classification and lexical entailment direction identification.

【Keywords】: Computing methodologies; Machine learning

191. Designing for Trust: A Behavioral Framework for Sharing Economy Platforms.

Paper Link】 【Pages】:2133-2143

【Authors】: Natã M. Barbosa ; Emily Sun ; Judd Antin ; Paolo Parigi

【Abstract】: Trust is a fundamental prerequisite in the growth and sustainability of sharing economy platforms. Many of such platforms rely on actions that require trust to take place, such as entering a stranger’s car or sleeping at a stranger’s place. For this reason, understanding, measuring, and tracking trust can be of great benefit to such platforms, enabling them to identify trust behaviors, both online and offline, and identify groups which may benefit from trust-building interventions. In this work, we present the design and evaluation of a behavioral framework to measure a user’s propensity to trust others on Airbnb. We conducted an online experiment with 4,499 Airbnb users in the form of an investment game in order to capture users’ propensity to trust other users on Airbnb. Then, we used the experimental data to generate both explanatory and predictive models of trust propensity. Our contribution is a framework that can be used to measure trust propensity in sharing economy platforms via online and offline signals. We discuss which affordances need to be in place so that sharing economy platforms can get signals of trust, in addition to how such a framework can be used to inform design around trust in the short and long term.

【Keywords】:

192. A Generic Edge-Empowered Graph Convolutional Network via Node-Edge Mutual Enhancement.

Paper Link】 【Pages】:2144-2154

【Authors】: Pengyang Wang ; Jiaping Gui ; Zhengzhang Chen ; Junghwan Rhee ; Haifeng Chen ; Yanjie Fu

【Abstract】: Graph Convolutional Networks (GCNs) have shown to be a powerful tool for analyzing graph-structured data. Most of previous GCN methods focus on learning a good node representation by aggregating the representations of neighboring nodes, whereas largely ignoring the edge information. Although few recent methods have been proposed to integrate edge attributes into GCNs to initialize edge embeddings, these methods do not work when edge attributes are (partially) unavailable. Can we develop a generic edge-empowered framework to exploit node-edge enhancement, regardless of the availability of edge attributes? In this paper, we propose a novel framework EE-GCN that achieves node-edge enhancement. In particular, the framework EE-GCN includes three key components: (i) Initialization: this step is to initialize the embeddings of both nodes and edges. Unlike node embedding initialization, we propose a line graph-based method to initialize the embedding of edges regardless of edge attributes. (ii) Feature space alignment: we propose a translation-based mapping method to align edge embedding with node embedding space, and the objective function is penalized by a translation loss when both spaces are not aligned. (iii) Node-edge mutually enhanced updating: node embedding is updated by aggregating embedding of neighboring nodes and associated edges, while edge embedding is updated by the embedding of associated nodes and itself. Through the above improvements, our framework provides a generic strategy for all of the spatial-based GCNs to allow edges to participate in embedding computation and exploit node-edge mutual enhancement. Finally, we present extensive experimental results to validate the improved performances of our method in terms of node classification, link prediction, and graph classification.

【Keywords】: Information systems; Information systems applications; Data mining

193. Algorithmic Effects on the Diversity of Consumption on Spotify.

Paper Link】 【Pages】:2155-2165

【Authors】: Ashton Anderson ; Lucas Maystre ; Ian Anderson ; Rishabh Mehrotra ; Mounia Lalmas

【Abstract】: On many online platforms, users can engage with millions of pieces of content, which they discover either organically or through algorithmically-generated recommendations. While the short-term benefits of recommender systems are well-known, their long-term impacts are less well understood. In this work, we study the user experience on Spotify, a popular music streaming service, through the lens of diversity—the coherence of the set of songs a user listens to. We use a high-fidelity embedding of millions of songs based on listening behavior on Spotify to quantify how musically diverse every user is, and find that high consumption diversity is strongly associated with important long-term user metrics, such as conversion and retention. However, we also find that algorithmically-driven listening through recommendations is associated with reduced consumption diversity. Furthermore, we observe that when users become more diverse in their listening over time, they do so by shifting away from algorithmic consumption and increasing their organic consumption. Finally, we deploy a randomized experiment and show that algorithmic recommendations are more effective for users with lower diversity. Our work illuminates a central tension in online platforms: how do we recommend content that users are likely to enjoy in the short term while simultaneously ensuring they can remain diverse in their consumption in the long term?

【Keywords】:

194. NERO: A Neural Rule Grounding Framework for Label-Efficient Relation Extraction.

Paper Link】 【Pages】:2166-2176

【Authors】: Wenxuan Zhou ; Hongtao Lin ; Bill Yuchen Lin ; Ziqi Wang ; Junyi Du ; Leonardo Neves ; Xiang Ren

【Abstract】: Deep neural models for relation extraction tend to be less reliable when perfectly labeled data is limited, despite their success in label-sufficient scenarios. Instead of seeking more instance-level labels from human annotators, here we propose to annotate frequent surface patterns to form labeling rules. These rules can be automatically mined from large text corpora and generalized via a soft rule matching mechanism. Prior works use labeling rules in an exact matching fashion, which inherently limits the coverage of sentence matching and results in the low-recall issue. In this paper, we present a neural approach to ground rules for RE, named Nero, which jointly learns a relation extraction module and a soft matching module. One can employ any neural relation extraction models as the instantiation for the RE module. The soft matching module learns to match rules with semantically similar sentences such that raw corpora can be automatically labeled and leveraged by the RE module (in a much better coverage) as augmented supervision, in addition to the exactly matched sentences. Extensive experiments and analysis on two public and widely-used datasets demonstrate the effectiveness of the proposed Nero framework, comparing with both rule-based and semi-supervised methods. Through user studies, we find that the time efficiency for a human to annotate rules and sentences are similar (0.30 vs. 0.35 min per label). In particular, Nero’s performance using 270 rules is comparable to the models trained using 3,000 labeled sentences, yielding a 9.5x speedup. Moreover, Nero can predict for unseen relations at test time and provide interpretable predictions. We release our code1 to the community for future research.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing; Machine learning

195. Clustering and Constructing User Coresets to Accelerate Large-scale Top-K Recommender Systems.

Paper Link】 【Pages】:2177-2187

【Authors】: Jyun-Yu Jiang ; Patrick H. Chen ; Cho-Jui Hsieh ; Wei Wang

【Abstract】: Top-K recommender systems aim to generate few but satisfactory personalized recommendations for various practical applications, such as item recommendation for e-commerce and link prediction for social networks. However, the numbers of users and items can be enormous, thereby leading to myriad potential recommendations as well as the bottleneck in evaluating and ranking all possibilities. Existing Maximum Inner Product Search (MIPS) based methods treat the item ranking problem for each user independently and the relationship between users has not been explored. In this paper, we propose a novel model for clustering and navigating for top-K recommenders (CANTOR) to expedite the computation of top-K recommendations based on latent factor models. A clustering-based framework is first presented to leverage user relationships to partition users into affinity groups, each of which contains users with similar preferences. CANTOR then derives a coreset of representative vectors for each affinity group by constructing a set cover with a theoretically guaranteed difference to user latent vectors. Using these representative vectors in the coreset, approximate nearest neighbor search is then applied to obtain a small set of candidate items for each affinity group to be used when computing recommendations for each user in the affinity group. This approach can significantly reduce the computation without compromising the quality of the recommendations. Extensive experiments are conducted on six publicly available large-scale real-world datasets for item recommendation and personalized link prediction. The experimental results demonstrate that CANTOR significantly speeds up matrix factorization models with high precision. For instance, CANTOR can achieve 355.1x speedup for inferring recommendations in a million-user network with 99.5% [email protected] to the original system while the state-of-the-art method can only obtain 93.7x speedup with 99.0% [email protected]

【Keywords】:

196. Guiding Corpus-based Set Expansion by Auxiliary Sets Generation and Co-Expansion.

Paper Link】 【Pages】:2188-2198

【Authors】: Jiaxin Huang ; Yiqing Xie ; Yu Meng ; Jiaming Shen ; Yunyi Zhang ; Jiawei Han

【Abstract】: Given a small set of seed entities (e.g., “USA”, “Russia”), corpus-based set expansion is to induce an extensive set of entities which share the same semantic class (Country in this example) from a given corpus. Set expansion benefits a wide range of downstream applications in knowledge discovery, such as web search, taxonomy construction, and query suggestion. Existing corpus-based set expansion algorithms typically bootstrap the given seeds by incorporating lexical patterns and distributional similarity. However, due to no negative sets provided explicitly, these methods suffer from semantic drift caused by expanding the seed set freely without guidance. We propose a new framework, Set-CoExpan, that automatically generates auxiliary sets as negative sets that are closely related to the target set of user’s interest, and then performs multiple sets co-expansion that extracts discriminative features by comparing target set with auxiliary sets, to form multiple cohesive sets that are distinctive from one another, thus resolving the semantic drift issue. In this paper we demonstrate that by generating auxiliary sets, we can guide the expansion process of target set to avoid touching those ambiguous areas around the border with auxiliary sets, and we show that Set-CoExpan outperforms strong baseline methods significantly.

【Keywords】:

197. Déjà vu: A Contextualized Temporal Attention Mechanism for Sequential Recommendation.

Paper Link】 【Pages】:2199-2209

【Authors】: Jibang Wu ; Renqin Cai ; Hongning Wang

【Abstract】: Predicting users’ preferences based on their sequential behaviors in history is challenging and crucial for modern recommender systems. Most existing sequential recommendation algorithms focus on transitional structure among the sequential actions, but largely ignore the temporal and context information, when modeling the influence of a historical event to current prediction.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

198. An End-to-end Topic-Enhanced Self-Attention Network for Social Emotion Classification.

Paper Link】 【Pages】:2210-2219

【Authors】: Chang Wang ; Bang Wang

【Abstract】: Social emotion classification is to predict the distribution of different emotions evoked by an article among its readers. Prior studies have shown that document semantic and topical features can help improve classification performance. However, how to effectively extract and jointly exploit such features have not been well researched. In this paper, we propose an end-to-end topic-enhanced self-attention network (TESAN) that jointly encodes document semantics and extracts document topics. In particular, TESAN first constructs a neural topic model to learn topical information and generates a topic embedding for a document. We then propose a topic-enhanced self-attention mechanism to encode semantic and topical information into a document vector. Finally, a fusion gate is used to compose the document representation for emotion classification by integrating the document vector and the topic embedding. The entire TESAN is trained in an end-to-end manner. Experimental results on three public datasets reveal that TESAN outperforms the state-of-the-art schemes in terms of higher classification accuracy and higher average Pearson correlation coefficient. Furthermore, the TESAN is computation efficient and can generate more coherent topics.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

199. When Recommender Systems Meet Fleet Management: Practical Study in Online Driver Repositioning System.

Paper Link】 【Pages】:2220-2229

【Authors】: Zhe Xu ; Chang Men ; Peng Li ; Bicheng Jin ; Ge Li ; Yue Yang ; Chunyang Liu ; Ben Wang ; Xiaohu Qie

【Abstract】: E-hailing platforms have become an important component of public transportation in recent years. The supply (online drivers) and demand (passenger requests) are intrinsically imbalanced because of the pattern of human behavior, especially in time and locations such as peak hours and train stations. Hence, how to balance supply and demand is one of the key problems to satisfy passengers and drivers and increase social welfare. As an intuitive and effective approach to address this problem, driver repositioning has been employed by some real-world e-hailing platforms. In this paper, we describe a novel framework of driver repositioning system, which meets various requirements in practical situations, including robust driver experience satisfaction and multi-driver collaboration. We introduce an effective and user-friendly driver interaction design called “driver repositioning task”. A novel modularized algorithm is developed to generate the repositioning tasks in real time. To our knowledge, this is the first industry-level application of driver repositioning. We evaluate the proposed method in real-world experiments, achieving a 2% improvement of driver income. Our framework has been fully deployed in the online system of DiDi Chuxing and serves millions of drivers on a daily basis.

【Keywords】:

200. Domain Adaptive Multi-Modality Neural Attention Network for Financial Forecasting.

Paper Link】 【Pages】:2230-2240

【Authors】: Dawei Zhou ; Lecheng Zheng ; Yada Zhu ; Jianbo Li ; Jingrui He

【Abstract】: Financial time series analysis plays a central role in optimizing investment decision and hedging market risks. This is a challenging task as the problems are always accompanied by dual-level (i.e, data-level and task-level) heterogeneity. For instance, in stock price forecasting, a successful portfolio with bounded risks usually consists of a large number of stocks from diverse domains (e.g, utility, information technology, healthcare, etc.), and forecasting stocks in each domain can be treated as one task; within a portfolio, each stock is characterized by temporal data collected from multiple modalities (e.g, finance, weather, and news), which corresponds to the data-level heterogeneity. Furthermore, the finance industry follows highly regulated processes, which require prediction models to be interpretable, and the output results to meet compliance. Therefore, a natural research question is how to build a model that can achieve satisfactory performance on such multi-modality multi-task learning problems, while being able to provide comprehensive explanations for the end users.

【Keywords】: Information systems; Information systems applications; Data mining; Mathematics of computing; Probability and statistics; Statistical paradigms; Time series analysis

201. Collective Multi-type Entity Alignment Between Knowledge Graphs.

Paper Link】 【Pages】:2241-2252

【Authors】: Qi Zhu ; Hao Wei ; Bunyamin Sisman ; Da Zheng ; Christos Faloutsos ; Xin Luna Dong ; Jiawei Han

【Abstract】: Knowledge graph (e.g. Freebase, YAGO) is a multi-relational graph representing rich factual information among entities of various types. Entity alignment is the key step towards knowledge graph integration from multiple sources. It aims to identify entities across different knowledge graphs that refer to the same real world entity. However, current entity alignment systems overlook the sparsity of different knowledge graphs and can not align multi-type entities by one single model. In this paper, we present a Collective Graph neural network for Multi-type entity Alignment, called CG-MuAlign. Different from previous work, CG-MuAlign jointly aligns multiple types of entities, collectively leverages the neighborhood information and generalizes to unlabeled entity types. Specifically, we propose novel collective aggregation function tailored for this task, that (1) relieves the incompleteness of knowledge graphs via both cross-graph and self attentions, (2) scales up efficiently with mini-batch training paradigm and effective neighborhood sampling strategy. We conduct experiments on real world knowledge graphs with millions of entities and observe the superior performance beyond existing methods. In addition, the running time of our approach is much less than the current state-of-the-art deep learning methods.

【Keywords】:

202. Characterizing Search-Engine Traffic to Internet Research Agency Web Properties.

Paper Link】 【Pages】:2253-2263

【Authors】: Alexander Spangher ; Gireeja Ranade ; Besmira Nushi ; Adam Fourney ; Eric Horvitz

【Abstract】: The Russia-based Internet Research Agency (IRA) carried out a broad information campaign in the U.S. before and after the 2016 presidential election. The organization created an expansive set of internet properties: web domains, Facebook pages, and Twitter bots, which received traffic via purchased Facebook ads, tweets, and search engines indexing their domains. In this paper, we focus on IRA activities that received exposure through search engines, by joining data from Facebook and Twitter with logs from the Internet Explorer 11 and Edge browsers and the Bing.com search engine.

【Keywords】:

203. Understanding Electricity-Theft Behavior via Multi-Source Data.

Paper Link】 【Pages】:2264-2274

【Authors】: Wenjie Hu ; Yang Yang ; Jianbo Wang ; Xuanwen Huang ; Ziqiang Cheng

【Abstract】: Electricity theft, the behavior that involves users conducting illegal operations on electrical meters to avoid individual electricity bills, is a common phenomenon in the developing countries. Considering its harmfulness to both power grids and the public, several mechanized methods have been developed to automatically recognize electricity-theft behaviors. However, these methods, which mainly assess users’ electricity usage records, can be insufficient due to the diversity of theft tactics and the irregularity of user behaviors.

【Keywords】:

204. Understanding the Performance Costs and Benefits of Privacy-focused Browser Extensions.

Paper Link】 【Pages】:2275-2286

【Authors】: Kevin Borgolte ; Nick Feamster

【Abstract】: Advertisements and behavioral tracking have become an invasive nuisance on the Internet in recent years. Indeed, privacy advocates and expert users consider the invasion significant enough to warrant the use of ad blockers and anti-tracking browser extensions. At the same time, one of the largest advertisement companies in the world, Google, is developing the most popular browser, Google Chrome. This conflict of interest, that is developing a browser (a user agent) and being financially motivated to track users’ online behavior, possibly violating their privacy expectations, while claiming to be a ”user agent,” did not remain unnoticed. As a matter of fact, Google recently sparked an outrage when proposing changes to Chrome how extensions can inspect and modify requests to ”improve extension performance and privacy,” which would render existing privacy-focused extensions inoperable.

【Keywords】:

205. Estimate the Implicit Likelihoods of GANs with Application to Anomaly Detection.

Paper Link】 【Pages】:2287-2297

【Authors】: Shaogang Ren ; Dingcheng Li ; Zhixin Zhou ; Ping Li

【Abstract】: The thriving of deep models and generative models provides approaches to model high dimensional distributions. Generative adversarial networks (GANs) can approximate data distributions and generate data samples from the learned data manifolds as well. In this paper, we propose an approach to estimate the implicit likelihoods of GAN models. A stable inverse function of the generator can be learned with the help of a variance network of the generator. The local variance of the sample distribution can be approximated by the normalized distance in the latent space. Simulation studies and likelihood testing on real-world data sets validate the proposed algorithm, which outperforms several baseline methods in these tasks. The proposed method has been further applied to anomaly detection. Experiments show that the method can achieve state-of-the-art anomaly detection performance on real-world data sets.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

206. RLPer: A Reinforcement Learning Model for Personalized Search.

Paper Link】 【Pages】:2298-2308

【Authors】: Jing Yao ; Zhicheng Dou ; Jun Xu ; Ji-Rong Wen

【Abstract】: Personalized search improves generic ranking models by taking user interests into consideration and returning more accurate search results to individual users. In recent years, machine learning and deep learning techniques have been successfully applied in personalized search. Most existing personalization models simply regard the search history as a static set of user behaviours and learn fixed ranking strategies based on the recorded data. Though improvements have been observed, it is obvious that these methods ignore the dynamic nature of the search process: search is a sequence of interactions between the search engine and the user. During the search process, the user interests may dynamically change. It would be more helpful if a personalized search model could track the whole interaction process and update its ranking strategy continuously. In this paper, we propose a reinforcement learning based personalization model, referred to as RLPer, to track the sequential interactions between the users and search engine with a hierarchical Markov Decision Process (MDP). In RLPer, the search engine interacts with the user to update the underlying ranking model continuously with real-time feedback. And we design a feedback-aware personalized ranking component to catch the user’s feedback which has impacts on the user interest profile for the next query. Experimental results on the publicly available AOL search log verify that our proposed model can significantly outperform state-of-the-art personalized search models.

【Keywords】: Information systems; Information retrieval; Information retrieval query processing

207. Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning.

Paper Link】 【Pages】:2309-2319

【Authors】: Wei Ye ; Rui Xie ; Jinglei Zhang ; Tianxiang Hu ; Xiaoyin Wang ; Shikun Zhang

【Abstract】: Code summarization generates brief natural language description given a source code snippet, while code retrieval fetches relevant source code given a natural language query. Since both tasks aim to model the association between natural language and programming language, recent studies have combined these two tasks to improve their performance. However, researchers have yet been able to effectively leverage the intrinsic connection between the two tasks as they train these tasks in a separate or pipeline manner, which means their performance can not be well balanced. In this paper, we propose a novel end-to-end model for the two tasks by introducing an additional code generation task. More specifically, we explicitly exploit the probabilistic correlation between code summarization and code generation with dual learning, and utilize the two encoders for code summarization and code generation to train the code retrieval task via multi-task learning. We have carried out extensive experiments on an existing dataset of SQL and Python, and results show that our model can significantly improve the results of the code retrieval task over the-state-of-art models, as well as achieve competitive performance in terms of BLEU score for the code summarization task.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

208. Hierarchically Structured Transformer Networks for Fine-Grained Spatial Event Forecasting.

Paper Link】 【Pages】:2320-2330

【Authors】: Xian Wu ; Chao Huang ; Chuxu Zhang ; Nitesh V. Chawla

【Abstract】: Spatial event forecasting is challenging and crucial for urban sensing scenarios, which is beneficial for a wide spectrum of spatial-temporal mining applications, ranging from traffic management, public safety, to environment policy making. In spite of significant progress has been made to solve spatial-temporal prediction problem, most existing deep learning based methods based on a coarse-grained spatial setting and the success of such methods largely relies on data sufficiency. In many real-world applications, predicting events with a fine-grained spatial resolution do play a critical role to provide high discernibility of spatial-temporal data distributions. However, in such cases, applying existing methods will result in weak performance since they may not well capture the quality spatial-temporal representations when training triple instances are highly imbalanced across locations and time.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks; Information systems; Information systems applications; Data mining

209. MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding.

Paper Link】 【Pages】:2331-2341

【Authors】: Xinyu Fu ; Jiani Zhang ; Ziqiao Meng ; Irwin King

【Abstract】: A large number of real-world graphs or networks are inherently heterogeneous, involving a diversity of node types and relation types. Heterogeneous graph embedding is to embed rich structural and semantic information of a heterogeneous graph into low-dimensional node representations. Existing models usually define multiple metapaths in a heterogeneous graph to capture the composite relations and guide neighbor selection. However, these models either omit node content features, discard intermediate nodes along the metapath, or only consider one metapath. To address these three limitations, we propose a new model named Metapath Aggregated Graph Neural Network (MAGNN) to boost the final performance. Specifically, MAGNN employs three major components, i.e., the node content transformation to encapsulate input node attributes, the intra-metapath aggregation to incorporate intermediate semantic nodes, and the inter-metapath aggregation to combine messages from multiple metapaths. Extensive experiments on three real-world heterogeneous graph datasets for node classification, node clustering, and link prediction show that MAGNN achieves more accurate prediction results than state-of-the-art baselines.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

210. Mining Points-of-Interest for Explaining Urban Phenomena: A Scalable Variational Inference Approach.

Paper Link】 【Pages】:2342-2353

【Authors】: Christof Naumzik ; Patrick Zoechbauer ; Stefan Feuerriegel

【Abstract】: Points-of-interest (POIs; i.e., restaurants, bars, landmarks, and other entities) are common in web-mined data: they greatly explain the spatial distributions of urban phenomena. The conventional modeling approach relies upon feature engineering, yet it ignores the spatial structure among POIs. In order to overcome this shortcoming, the present paper proposes a novel spatial model for explaining spatial distributions based on web-mined POIs. Our key contributions are: (1) We present a rigorous yet highly interpretable formalization in order to model the influence of POIs on a given outcome variable. Specifically, we accommodate the spatial distributions of both the outcome and POIs. In our case, this modeled by the sum of latent Gaussian processes. (2) In contrast to previous literature, our model infers the influence of POIs without feature engineering, instead we model the influence of POIs via distance-weighted kernel functions with fully learnable parameterizations. (3) We propose a scalable learning algorithm based on sparse variational approximation. For this purpose, we derive a tailored evidence lower bound (ELBO) and, for appropriate likelihoods, we even show that an analytical expression can be obtained. This allows fast and accurate computation of the ELBO. Finally, the value of our approach for web mining is demonstrated in two real-world case studies. Our findings provide substantial improvements over state-of-the-art baselines with regard to both predictive and, in particular, explanatory performance. Altogether, this yields a novel spatial model for leveraging web-mined POIs. Within the context of location-based social networks, it promises an extensive range of new insights and use cases.

【Keywords】:

211. Large-Scale Talent Flow Embedding for Company Competitive Analysis.

Paper Link】 【Pages】:2354-2364

【Authors】: Le Zhang ; Tong Xu ; Hengshu Zhu ; Chuan Qin ; Qingxin Meng ; Hui Xiong ; Enhong Chen

【Abstract】: Recent years have witnessed the growing interests in investigating the competition among companies. Existing studies for company competitive analysis generally rely on subjective survey data and inferential analysis. Instead, in this paper, we aim to develop a new paradigm for studying the competition among companies through the analysis of talent flows. The rationale behind this is that the competition among companies usually leads to talent movement. Along this line, we first build a Talent Flow Network based on the large-scale job transition records of talents, and formulate the concept of “competitiveness” for companies with consideration of their bi-directional talent flows in the network. Then, we propose a Talent Flow Embedding (TFE) model to learn the bi-directional talent attractions of each company, which can be leveraged for measuring the pairwise competitive relationships between companies. Specifically, we employ the random-walk based model in original and transpose networks respectively to learn representations of companies by preserving their competitiveness. Furthermore, we design a multi-task strategy to refine the learning results from a fine-grained perspective, which can jointly embed multiple talent flow networks by assuming the features of company keep stable but take different roles in networks of different job positions. Finally, extensive experiments on a large-scale real-world dataset clearly validate the effectiveness of our TFE model in terms of company competitive analysis and reveal some interesting rules of competition based on the derived insights on talent flows.

【Keywords】: Information systems; Information systems applications; Data mining

212. Quantifying Engagement with Citations on Wikipedia.

Paper Link】 【Pages】:2365-2376

【Authors】: Tiziano Piccardi ; Miriam Redi ; Giovanni Colavizza ; Robert West

【Abstract】: Wikipedia is one of the most visited sites on the Web and a common source of information for many users. As an encyclopedia, Wikipedia was not conceived as a source of original information, but as a gateway to secondary sources: according to Wikipedia’s guidelines, facts must be backed up by reliable sources that reflect the full spectrum of views on the topic. Although citations lie at the heart of Wikipedia, little is known about how users interact with them. To close this gap, we built client-side instrumentation for logging all interactions with links leading from English Wikipedia articles to cited references during one month, and conducted the first analysis of readers’ interactions with citations. We find that overall engagement with citations is low: about one in 300 page views results in a reference click (0.29% overall; 0.56% on desktop; 0.13% on mobile). Matched observational studies of the factors associated with reference clicking reveal that clicks occur more frequently on shorter pages and on pages of lower quality, suggesting that references are consulted more commonly when Wikipedia itself does not contain the information sought by the user. Moreover, we observe that recent content, open access sources, and references about life events (births, deaths, marriages, etc.) are particularly popular. Taken together, our findings deepen our understanding of Wikipedia’s role in a global information economy where reliability is ever less certain, and source attribution ever more vital.

【Keywords】:

213. Condition Aware and Revise Transformer for Question Answering.

Paper Link】 【Pages】:2377-2387

【Authors】: Xinyan Zhao ; Feng Xiao ; Haoming Zhong ; Jun Yao ; Huanhuan Chen

【Abstract】: The study of question answering has received increasing attention in recent years. This work focuses on providing an answer that compatible with both user intent and conditioning information corresponding to the question, such as delivery status and stock information in e-commerce. However, these conditions may be wrong or incomplete in real-world applications. Although existing question answering systems have considered the external information, such as categorical attributes and triples in knowledge base, they all assume that the external information is correct and complete. To alleviate the effect of defective condition values, this paper proposes condition aware and revise Transformer (CAR-Transformer). CAR-Transformer (1) revises each condition value based on the whole conversation and original conditions values, and (2) it encodes the revised conditions and utilizes the conditions embedding to select an answer. Experimental results on a real-world customer service dataset demonstrate that the CAR-Transformer can still select an appropriate reply when conditions corresponding to the question exist wrong or missing values, and substantially outperforms baseline models on automatic and human evaluations. The proposed CAR-Transformer can be extended to other NLP tasks which need to consider conditioning information.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

214. In Opinion Holders' Shoes: Modeling Cumulative Influence for View Change in Online Argumentation.

Paper Link】 【Pages】:2388-2399

【Authors】: Zhen Guo ; Zhe Zhang ; Munindar P. Singh

【Abstract】: Understanding how people change their views during multiparty argumentative discussions is important in applications that involve human communication, e.g., in social media and education. Existing research focuses on lexical features of individual comments, dynamics of discussions, or the personalities of participants but deemphasizes the cumulative influence of the interplay of comments by different participants on a participant’s mindset. We address the task of predicting the points where a user’s view changes given an entire discussion, thereby tackling the confusion due to multiple plausible alternatives when considering the entirety of a discussion.

【Keywords】:

215. Efficient Non-Sampling Factorization Machines for Optimal Context-Aware Recommendation.

Paper Link】 【Pages】:2400-2410

【Authors】: Chong Chen ; Min Zhang ; Weizhi Ma ; Yiqun Liu ; Shaoping Ma

【Abstract】: To provide more accurate recommendation, it is a trending topic to go beyond modeling user-item interactions and take context features into account. Factorization Machines (FM) with negative sampling is a popular solution for context-aware recommendation. However, it is not robust as sampling may lost important information and usually leads to non-optimal performances in practical. Several recent efforts have enhanced FM with deep learning architectures for modelling high-order feature interactions. While they either focus on rating prediction task only, or typically adopt the negative sampling strategy for optimizing the ranking performance. Due to the dramatic fluctuation of sampling, it is reasonable to argue that these sampling-based FM methods are still suboptimal for context-aware recommendation.

【Keywords】: Information systems; Information retrieval; Retrieval tasks and goals; Recommender systems

216. Domain-Guided Task Decomposition with Self-Training for Detecting Personal Events in Social Media.

Paper Link】 【Pages】:2411-2420

【Authors】: Payam Karisani ; Joyce C. Ho ; Eugene Agichtein

【Abstract】: Mining social media content for tasks such as detecting personal experiences or events, suffer from lexical sparsity, insufficient training data, and inventive lexicons. To reduce the burden of creating extensive labeled data and improve classification performance, we propose to perform these tasks in two steps: 1. Decomposing the task into domain-specific sub-tasks by identifying key concepts, thus utilizing human domain understanding; and 2. Combining the results of learners for each key concept using co-training to reduce the requirements for labeled training data. We empirically show the effectiveness and generality of our approach, Co-Decomp, using three representative social media mining tasks, namely Personal Health Mention detection, Crisis Report detection, and Adverse Drug Reaction monitoring. The experiments show that our model is able to outperform the state-of-the-art text classification models–including those using the recently introduced BERT model–when small amounts of training data are available.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing; Machine learning

217. Leveraging Passage-level Cumulative Gain for Document Ranking.

Paper Link】 【Pages】:2421-2431

【Authors】: Zhijing Wu ; Jiaxin Mao ; Yiqun Liu ; Jingtao Zhan ; Yukun Zheng ; Min Zhang ; Shaoping Ma

【Abstract】: Document ranking is one of the most studied but challenging problems in information retrieval (IR) research. A number of existing document ranking models capture relevance signals at the whole document level. Recently, more and more research has begun to address this problem from fine-grained document modeling. Several works leveraged fine-grained passage-level relevance signals in ranking models. However, most of these works focus on context-independent passage-level relevance signals and ignore the context information, which may lead to inaccurate estimation of passage-level relevance. In this paper, we investigate how information gain accumulates with passages when users sequentially read a document. We propose the context-aware Passage-level Cumulative Gain (PCG), which aggregates relevance scores of passages and avoids the need to formally split a document into independent passages. Next, we incorporate the patterns of PCG into a BERT-based sequential model called Passage-level Cumulative Gain Model (PCGM) to predict the PCG sequence. Finally, we apply PCGM to the document ranking task. Experimental results on two public ad hoc retrieval benchmark datasets show that PCGM outperforms most existing ranking models and also indicates the effectiveness of PCG signals. We believe that this work contributes to improving ranking performance and providing more explainability for document ranking.

【Keywords】: Information systems; Information retrieval; Retrieval models and ranking

218. Towards Hybrid Human-AI Workflows for Unknown Unknown Detection.

Paper Link】 【Pages】:2432-2442

【Authors】: Anthony Z. Liu ; Santiago Guerra ; Isaac Fung ; Gabriel Matute ; Ece Kamar ; Walter S. Lasecki

【Abstract】: Predictive models are susceptible to errors called unknown unknowns, in which the model assigns incorrect labels to instances with high confidence. These commonly arise when training data does not represent variations of a class encountered at model deployment. Prior work showed that crowd workers can identify instances of unknown unknowns, but asking the crowd to identify a sufficient number of individual instances can be costly to acquire [2]. Instead, this paper presents an approach that leverages people’s ability to find patterns to retrain classifiers more effectively with fewer examples. We ask crowd workers to suggest and verify patterns in unknown unknowns. We then use these patterns to train an expansion classifier to identify additional examples from existing data that the primary classifier has encountered (and potentially misclassified) in the past. Our experiments show that our approach outperforms existing unknown unknown detection methods at improving classifier performance. This work is the first to leverage crowds to identify error patterns in large datasets to improve ML training.

【Keywords】: Computing methodologies; Machine learning

219. Examining Protest as An Intervention to Reduce Online Prejudice: A Case Study of Prejudice Against Immigrants.

Paper Link】 【Pages】:2443-2454

【Authors】: Kai Wei ; Yu-Ru Lin ; Muheng Yan

【Abstract】: There has been a growing concern about online users using social media to incite prejudice and hatred against other individuals or groups. While there has been research in developing automated techniques to identify online prejudice acts and hate speech, how to effectively counter online prejudice remains a societal challenge. Social protests, on the other hand, have been frequently used as an intervention for countering prejudice. However, research to date has not examined the relationship between protests and online prejudice. Using large-scale panel data collected from Twitter, we examine the changes in users’ tweeting behaviors relating to prejudice against immigrants following recent protests in the U.S. on immigration related topics. This is the first empirical study examining the effect of protests on reducing online prejudice. Our results show that there were both negative and positive changes in the measured prejudice after a protest, suggesting protest might have a mixed effect on reducing prejudice. We further identify users who are likely to change (or resist change) after a protest. This work contributes to the understanding of online prejudice and its intervention effect. The findings of this research have implications for designing targeted intervention.

【Keywords】:

Session: Future of the Web Track 1

220. Interpretable Complex Question Answering.

Paper Link】 【Pages】:2455-2457

【Authors】: Soumen Chakrabarti

【Abstract】: We will review cross-community co-evolution of question answering (QA) with the advent of large-scale knowledge graphs (KGs), continuous representations of text and graphs, and deep sequence analysis. Early QA systems were information retrieval (IR) systems enhanced to extract named entity spans from high-scoring passages. Starting with WordNet, a series of structured curations of language and world knowledge, called KGs, enabled further improvements. Corpus is unstructured and messy to exploit for QA. If a question can be answered using the KG alone, it is attractive to ‘interpret’ the free-form question into a structured query, which is then executed on the structured KG. This process is called KGQA. Answers can be high-quality and explainable if the KG has an answer, but manual curation results in low coverage. KGs were soon found useful to harness corpus information. Named entity mention spans could be tagged with fine-grained types (e.g., scientist), or even specific entities (e.g., Einstein). The QA system can learn to decompose a query into functional parts, e.g., “which scientist” and “played the violin”. With increasing success of such systems, ambition grew to address multi-hop or multi-clause queries, e.g., “the father of the director of La La Land teaches at which university?” or “who directed an award-winning movie and is the son of a Princeton University professor?” Questions limited to simple path traversals in KGs have been encoded to a vector representation, which a decoder then uses to guide the KG traversal. Recently the corpus counterpart of such strategies has also been proposed. However, for general multi-clause queries that do not necessarily translate to paths, and seek to bind multiple variables to satisfy multiple clauses, or involve logic, comparison, aggregation and other arithmetic, neural programmer-interpreter systems have seen some success. Our key focus will be on identifying situations where manual introduction of structural bias is essential for accuracy, as against cases where sufficient data can get around distant or no supervision.

【Keywords】:

Session: Short Paper 97

221. Practical Data Poisoning Attack against Next-Item Recommendation.

Paper Link】 【Pages】:2458-2464

【Authors】: Hengtong Zhang ; Yaliang Li ; Bolin Ding ; Jing Gao

【Abstract】: Online recommendation systems make use of a variety of information sources to provide users the items that users are potentially interested in. However, due to the openness of the online platform, recommendation systems are vulnerable to data poisoning attacks. Existing attack approaches are either based on simple heuristic rules or designed against specific recommendations approaches. The former often suffers unsatisfactory performance, while the latter requires strong knowledge of the target system. In this paper, we focus on a general next-item recommendation setting and propose a practical poisoning attack approach named LOKI against blackbox recommendation systems. The proposed LOKI utilizes the reinforcement learning algorithm to train the attack agent, which can be used to generate user behavior samples for data poisoning. In real-world recommendation systems, the cost of retraining recommendation models is high, and the interaction frequency between users and a recommendation system is restricted. Given these real-world restrictions, we propose to let the agent interact with a recommender simulator instead of the target recommendation system and leverage the transferability of the generated adversarial samples to poison the target system. We also propose to use the influence function to efficiently estimate the influence of injected samples on the recommendation results, without re-training the models within the simulator. Extensive experiments on two datasets against four representative recommendation models show that the proposed LOKI achieves better attacking performance than existing methods.

【Keywords】: Computing methodologies; Machine learning

222. Efficient Online Multi-Task Learning via Adaptive Kernel Selection.

Paper Link】 【Pages】:2465-2471

【Authors】: Peng Yang ; Ping Li

【Abstract】: Conventional multi-task model restricts the task structure to be linearly related, which may not be suitable when data is linearly nonseparable. To remedy this issue, we propose a kernel algorithm for online multi-task classification, as the large approximation space provided by reproducing kernel Hilbert spaces often contains an accurate function. Specifically, it maintains a local-global Gaussian distribution over each task model that guides the direction and scale of parameter updates. Nonetheless, optimizing over this space is computationally expensive. Moreover, most multi-task learning methods require accessing to the entire training instances, which is luxury unavailable in the large-scale streaming learning scenario. To overcome this issue, we propose a randomized kernel sampling technique across multiple tasks. Instead of requiring all inputs’ labels, the proposed algorithm determines whether to query a label or not via considering the confidence from the related tasks over label prediction. Theoretically, the algorithm trained on actively sampled labels can achieve a comparable result with one learned on all labels. Empirically, the proposed algorithm is able to achieve promising learning efficacy, while reducing the computational complexity and labeling cost simultaneously.

【Keywords】:

223. Few-Shot Learning for New User Recommendation in Location-based Social Networks.

Paper Link】 【Pages】:2472-2478

【Authors】: Ruirui Li ; Xian Wu ; Xiusi Chen ; Wei Wang

【Abstract】: The proliferation of GPS-enabled devices establishes the prosperity of location-based social networks, which results in a tremendous amount of user check-ins. These check-ins bring in preeminent opportunities to understand users’ preferences and facilitate matching between users and businesses. However, the user check-ins are extremely sparse due to the huge user and business bases, which makes matching a daunting task. In this work, we investigate the recommendation problem in the context of identifying potential new customers for businesses in LBSNs. In particular, we focus on investigating the geographical influence, composed of geographical convenience and geographical dependency. In addition, we leverage metric-learning-based few-shot learning to fully utilize the user check-ins and facilitate the matching between users and businesses. To evaluate our proposed method, we conduct a series of experiments to extensively compare with 13 baselines using two real-world datasets. The results demonstrate that the proposed method outperforms all these baselines by a significant margin.

【Keywords】:

224. Ad Hoc Table Retrieval using Intrinsic and Extrinsic Similarities.

Paper Link】 【Pages】:2479-2485

【Authors】: Roee Shraga ; Haggai Roitman ; Guy Feigenblat ; Mustafa Canim

【Abstract】: Given a keyword query, the ad hoc table retrieval task aims at retrieving a ranked list of the top-k most relevant tables in a given table corpus. Previous works have primarily focused on designing table-centric lexical and semantic features, which could be utilized for learning-to-rank (LTR) tables. In this work, we make a novel use of intrinsic (passage-based) and extrinsic (manifold-based) table similarities for enhanced retrieval. Using the WikiTables benchmark, we study the merits of utilizing such similarities for this task. To this end, we combine both similarity types via a simple, yet an effective, cascade re-ranking approach. Overall, our proposed approach results in a significantly better table retrieval quality, which even transcends that of strong semantically-rich baselines.

【Keywords】: Information systems; Information retrieval

225. Leveraging Context for Neural Question Generation in Open-domain Dialogue Systems.

Paper Link】 【Pages】:2486-2492

【Authors】: Yanxiang Ling ; Fei Cai ; Honghui Chen ; Maarten de Rijke

【Abstract】: Question generation in open-domain dialogue systems is a challenging but less-explored task. It aims to enhance the interactivity and persistence of human-machine interactions. Previous work mainly focuses on question generation in the setting of single-turn dialogues, or investigates it as a data augmentation method for machine comprehension. We propose a Context-augmented Neural Question Generation (CNQG) model that leverages the conversational context to generate questions for promoting interactivity and persistence of multi-turn dialogues. More specifically, we formulate the task of question generation as a two-stage process. First, we employ an encoder-decoder framework to predict a question pattern, which denotes a set of representative interrogatives, and identify the potential topics from the conversational context by employing point-wise mutual information. Then, we generate the question by decoding the concatenation of the current dialogue utterance, the pattern, and the topics with an attention mechanism. To the best of our knowledge, ours is the first work on question generation in multi-turn open-domain dialogue systems. Our experimental results on two publicly available multi-turn conversation datasets show that CNQG outperforms the state-of-the-art baselines in terms of BLEU-1, BLEU-2, Distinct-1 and Distinct-2. In addition, we find that CNQG allows one to efficiently distill useful features from long contexts, and maintain robust effectiveness even for short contexts.

【Keywords】:

226. Higher-Order Label Homogeneity and Spreading in Graphs.

Paper Link】 【Pages】:2493-2499

【Authors】: Dhivya Eswaran ; Srijan Kumar ; Christos Faloutsos

【Abstract】: Do higher-order network structures aid graph semi-supervised learning? Given a graph and a few labeled vertices, labeling the remaining vertices is a high-impact problem with applications in several tasks, such as recommender systems, fraud detection and protein identification. However, traditional methods rely on edges for spreading labels, which is limited as all edges are not equal. Vertices with stronger connections participate in higher-order structures in graphs, which calls for methods that can leverage these structures in the semi-supervised learning tasks.

【Keywords】: Computing methodologies; Machine learning

227. Enhanced-RCNN: An Efficient Method for Learning Sentence Similarity.

Paper Link】 【Pages】:2500-2506

【Authors】: Shuang Peng ; Hengbin Cui ; Niantao Xie ; Sujian Li ; Jiaxing Zhang ; Xiaolong Li

【Abstract】: Learning sentence similarity is a fundamental research topic and has been explored using various deep learning methods recently. In this paper, we further propose an enhanced recurrent convolutional neural network (Enhanced-RCNN) model for learning sentence similarity. Compared to the state-of-the-art BERT model, the architecture of our proposed model is far less complex. Experimental results show that our similarity learning method outperforms the baselines and achieves the competitive performance on two real-world paraphrase identification datasets.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing; Machine learning; Machine learning approaches; Neural networks

228. MetaSelector: Meta-Learning for Recommendation with User-Level Adaptive Model Selection.

Paper Link】 【Pages】:2507-2513

【Authors】: Mi Luo ; Fei Chen ; Pengxiang Cheng ; Zhenhua Dong ; Xiuqiang He ; Jiashi Feng ; Zhenguo Li

【Abstract】: Recommender systems often face heterogeneous datasets containing highly personalized historical data of users, where no single model could give the best recommendation for every user. We observe this ubiquitous phenomenon on both public and private datasets and address the model selection problem in pursuit of optimizing the quality of recommendation for each user. We propose a meta-learning framework to facilitate user-level adaptive model selection in recommender systems. In this framework, a collection of recommenders is trained with data from all users, on top of which a model selector is trained via meta-learning to select the best single model for each user with the user-specific historical data. We conduct extensive experiments on two public datasets and a real-world production dataset, demonstrating that our proposed framework achieves improvements over single model baselines and sample-level model selector in terms of AUC and LogLoss. In particular, the improvements may lead to huge profit gain when deployed in online recommender systems.

【Keywords】:

229. TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis.

Paper Link】 【Pages】:2514-2520

【Authors】: Zilong Wang ; Zhaohong Wan ; Xiaojun Wan

【Abstract】: Multimodal sentiment analysis is an important research area that predicts speaker’s sentiment tendency through features extracted from textual, visual and acoustic modalities. The central challenge is the fusion method of the multimodal information. A variety of fusion methods have been proposed, but few of them adopt end-to-end translation models to mine the subtle correlation between modalities. Enlightened by recent success of Transformer in the area of machine translation, we propose a new fusion method, TransModality, to address the task of multimodal sentiment analysis. We assume that translation between modalities contributes to a better joint representation of speaker’s utterance. With Transformer, the learned features embody the information both from the source modality and the target modality. We validate our model on multiple multimodal datasets: CMU-MOSI, MELD, IEMOCAP. The experiments show that our proposed method achieves the state-of-the-art performance.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing

230. Recommending Themes for Ad Creative Design via Visual-Linguistic Representations.

Paper Link】 【Pages】:2521-2527

【Authors】: Yichao Zhou ; Shaunak Mishra ; Manisha Verma ; Narayan Bhamidipati ; Wei Wang

【Abstract】: There is a perennial need in the online advertising industry to refresh ad creatives, i.e., images and text used for enticing online users towards a brand. Such refreshes are required to reduce the likelihood of ad fatigue among online users, and to incorporate insights from other successful campaigns in related product categories. Given a brand, to come up with themes for a new ad is a painstaking and time consuming process for creative strategists. Strategists typically draw inspiration from the images and text used for past ad campaigns, as well as world knowledge on the brands. To automatically infer ad themes via such multimodal sources of information in past ad campaigns, we propose a theme (keyphrase) recommender system for ad creative strategists. The theme recommender is based on aggregating results from a visual question answering (VQA) task, which ingests the following: (i) ad images, (ii) text associated with the ads as well as Wikipedia pages on the brands in the ads, and (iii) questions around the ad. We leverage transformer based cross-modality encoders to train visual-linguistic representations for our VQA task. We study two formulations for the VQA task along the lines of classification and ranking; via experiments on a public dataset, we show that cross-modal representations lead to significantly better classification accuracy and ranking precision-recall metrics. Cross-modal representations show better performance compared to separate image and text representations. In addition, the use of multimodal information shows a significant lift over using only textual or visual information.

【Keywords】:

231. Attentive Sequential Models of Latent Intent for Next Item Recommendation.

Paper Link】 【Pages】:2528-2534

【Authors】: Md. Mehrab Tanjim ; Congzhe Su ; Ethan Benjamin ; Diane Hu ; Liangjie Hong ; Julian J. McAuley

【Abstract】: Users exhibit different intents across e-commerce services (e.g. discovering items, purchasing gifts, etc.) which drives them to interact with a wide variety of items in multiple ways (e.g. click, add-to-cart, add-to-favorites, purchase). To give better recommendations, it is important to capture user intent, in addition to considering their historic interactions. However these intents are by definition latent, as we observe only a user’s interactions, and not their underlying intent. To discover such latent intents, and use them effectively for recommendation, in this paper we propose an Attentive Sequential model of Latent Intent (ASLI in short). Our model first learns item similarities from users’ interaction histories via a self-attention layer, then uses a Temporal Convolutional Network layer to obtain a latent representation of the user’s intent from her actions on a particular category. We use this representation to guide an attentive model to predict the next item. Results from our experiments show that our model can capture the dynamics of user behavior and preferences, leading to state-of-the-art performance across datasets from two major e-commerce platforms, namely Etsy and Alibaba.

【Keywords】:

232. Latent Linear Critiquing for Conversational Recommender Systems.

Paper Link】 【Pages】:2535-2541

【Authors】: Kai Luo ; Scott Sanner ; Ga Wu ; Hanze Li ; Hojin Yang

【Abstract】: Critiquing is a method for conversational recommendation that iteratively adapts recommendations in response to user preference feedback. In this setting, a user is iteratively provided with an item recommendation and attribute description for that item; a user may either accept the recommendation, or critique the attributes in the item description to generate a new recommendation. Historical critiquing methods were largely based on explicit constraint- and utility-based methods for modifying recommendations w.r.t. critiqued item attributes. In this paper, we revisit the critiquing approach in the era of recommendation methods based on latent embeddings with subjective item descriptions (i.e., keyphrases from user reviews). Two critical research problems arise: (1) how to co-embed keyphrase critiques with user preference embeddings to update recommendations, and (2) how to modulate the strength of multi-step critiquing feedback, where critiques are not necessarily independent, nor of equal importance. To address (1), we build on an existing state-of-the-art linear embedding recommendation algorithm to align review-based keyphrase attributes with user preference embeddings. To address (2), we exploit the linear structure of the embeddings and recommendation prediction to formulate a linear program (LP) based optimization problem to determine optimal weights for incorporating critique feedback. We evaluate the proposed framework on two recommendation datasets containing user reviews with simulated users. Empirical results compared to a standard approach of averaging critique feedback show that our approach reduces the number of interactions required to find a satisfactory item and increases the overall success rate.

【Keywords】:

233. A Multimodal Variational Encoder-Decoder Framework for Micro-video Popularity Prediction.

Paper Link】 【Pages】:2542-2548

【Authors】: Jiayi Xie ; Yaochen Zhu ; Zhibin Zhang ; Jian Peng ; Jing Yi ; Yaosi Hu ; Hongyi Liu ; Zhenzhong Chen

【Abstract】: Predicting the popularity of a micro-video is a challenging task, due to a number of factors impacting the distribution such as the diversity of the video content and user interests, complex online interactions, etc. In this paper, we propose a multimodal variational encoder-decoder (MMVED) framework that considers the uncertain factors as the randomness for the mapping from the multimodal features to the popularity. Specifically, the MMVED first encodes features from multiple modalities in the observation space into latent representations and learns their probability distributions based on variational inference, where only relevant features in the input modalities can be extracted into the latent representations. Then, the modality-specific hidden representations are fused through Bayesian reasoning such that the complementary information from all modalities is well utilized. Finally, a temporal decoder implemented as a recurrent neural network is designed to predict the popularity sequence of a certain micro-video. Experiments conducted on a real-world dataset demonstrate the effectiveness of our proposed model in the micro-video popularity prediction task.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

234. Graph-Query Suggestions for Knowledge Graph Exploration.

Paper Link】 【Pages】:2549-2555

【Authors】: Matteo Lissandrini ; Davide Mottin ; Themis Palpanas ; Yannis Velegrakis

【Abstract】: We consider the task of exploratory search through graph queries on knowledge graphs. We propose to assist the user by expanding the query with intuitive suggestions to provide a more informative (full) query that can retrieve more detailed and relevant answers. To achieve this result, we propose a model that can bridge graph search paradigms with well-established techniques for information-retrieval. Our approach does not require any additional knowledge from the user and builds on principled language modelling approaches. We empirically show the effectiveness and efficiency of our approach on a large knowledge graph and how our suggestions are able to help build more complete and informative queries.

【Keywords】: Information systems; Information retrieval; Information retrieval query processing

235. Visual Concept Naming: Discovering Well-Recognized Textual Expressions of Visual Concepts.

Paper Link】 【Pages】:2556-2562

【Authors】: Masayasu Muraoka ; Tetsuya Nasukawa ; Rudy Raymond ; Bishwaranjan Bhattacharjee

【Abstract】: We propose a task called Visual Concept Naming to associate visual concepts with the corresponding textual expressions, i.e., names of visual concepts found in real-world multimodal data. To tackle the task, we create a dataset consisting of 3.4 million tweets in total in three languages. We also propose a method for extracting candidate names of visual concepts and validating them by exploiting Web-based knowledge obtained through image search. To demonstrate the capability of our method, we conduct an experiment with the dataset we create and evaluate names obtained by our method through crowdsourcing, where we establish an evaluation method to verify the names. The experimental results indicate that the proposed method can identify a wide variety of names of visual concepts. The names we obtained also show interesting insights regarding languages and countries where the languages are used.1

【Keywords】:

236. Hierarchical Visual-aware Minimax Ranking Based on Co-purchase Data for Personalized Recommendation.

Paper Link】 【Pages】:2563-2569

【Authors】: Xiaoya Chong ; Qing Li ; Howard Leung ; Qianhui Men ; Xianjin Chao

【Abstract】: Personalized recommendation aims at ranking a set of items according to the learnt preferences of the user. Existing methods optimize the ranking function by considering an item that the user has not bought yet as a negative item and assuming that the user prefers the positive item that he has bought to the negative item. The strategy is to exclude irrelevant items from the dataset to narrow down the set of potential positive items to improve ranking accuracy. It conflicts with the goal of recommendation from the seller’s point of view, which aims to enlarge that set for each user. In this paper, we diminish this limitation by proposing a novel learning method called Hierarchical Visual-aware Minimax Ranking (H-VMMR), in which a new concept of predictive sampling is proposed to sample items in a close relationship with the positive items (e.g., substitutes, compliments). We set up the problem by maximizing the preference discrepancy between positive and negative items, as well as minimizing the gap between positive and predictive items based on visual features. We also build a hierarchical learning model based on co-purchase data to solve the data sparsity problem. Our method is able to enlarge the set of potential positive items as well as true negative items during ranking. The experimental results show that our H-VMMR outperforms the state-of-the-art learning methods.

【Keywords】: Information systems; Information retrieval; Retrieval tasks and goals; Recommender systems

237. A Cue Adaptive Decoder for Controllable Neural Response Generation.

Paper Link】 【Pages】:2570-2576

【Authors】: Weichao Wang ; Shi Feng ; Wei Gao ; Daling Wang ; Yifei Zhang

【Abstract】: In open-domain dialogue systems, dialogue cues such as emotion, persona, and emoji can be incorporated into conversation models for strengthening the semantic relevance of generated responses. Existing neural response generation models either incorporate dialogue cue into decoder’s initial state or embed the cue indiscriminately into the state of every generated word, which may cause the gradients of the embedded cue to vanish or disturb the semantic relevance of generated words during back propagation. In this paper, we propose a Cue Adaptive Decoder (CueAD) that aims to dynamically determine the involvement of a cue at each generation step in the decoding. For this purpose, we extend the Gated Recurrent Unit (GRU) network with an adaptive cue representation for facilitating cue incorporation, in which an adaptive gating unit is utilized to decide when to incorporate cue information so that the cue can provide useful clues for enhancing the semantic relevance of the generated words. Experimental results show that CueAD outperforms state-of-the-art baselines with large margins.

【Keywords】:

238. Unsupervised Dual-Cascade Learning with Pseudo-Feedback Distillation for Query-Focused Extractive Summarization.

Paper Link】 【Pages】:2577-2584

【Authors】: Haggai Roitman ; Guy Feigenblat ; Doron Cohen ; Odellia Boni ; David Konopnicki

【Abstract】: We propose Dual-CES – a novel unsupervised, query-focused, multi-document extractive summarizer. Dual-CES builds on top of the Cross Entropy Summarizer (CES) and is designed to better handle the tradeoff between saliency and focus in summarization. To this end, Dual-CES employs a two-step dual-cascade optimization approach with saliency-based pseudo-feedback distillation. Overall, Dual-CES significantly outperforms all other state-of-the-art unsupervised alternatives. Dual-CES is even shown to be able to outperform strong supervised summarizers.

【Keywords】:

239. Extracting Knowledge from Web Text with Monte Carlo Tree Search.

Paper Link】 【Pages】:2585-2591

【Authors】: Guiliang Liu ; Xu Li ; Jiakang Wang ; Mingming Sun ; Ping Li

【Abstract】: To extract knowledge from general web text, it requires to build a domain-independent extractor that scales to the entire web corpus. This task is known as Open Information Extraction (OIE). This paper proposes to apply Monte-Carlo Tree Search (MCTS) to accomplish OIE. To achieve this goal, we define a Markov Decision Process for OIE and build a simulator to learn the reward signals, which provides a complete reinforcement learning framework for MCTS. Using this framework, MCTS explores candidate words (and symbols) under the guidance of a pre-trained Sequence-to-Sequence (Seq2Seq) predictor and generates abundant exploration samples during training. We apply the exploration samples to update the reward simulator and the predictor, based on which we implement another MCTS to search the optimal predictions during inference. Empirical evaluation demonstrates that the MCTS inference substantially improves the accuracy of prediction (more than 10%) and achieves a leading performance over other state-of-the-art comparison models.

【Keywords】: Computing methodologies; Machine learning

240. IART: Intent-aware Response Ranking with Transformers in Information-seeking Conversation Systems.

Paper Link】 【Pages】:2592-2598

【Authors】: Liu Yang ; Minghui Qiu ; Chen Qu ; Cen Chen ; Jiafeng Guo ; Yongfeng Zhang ; W. Bruce Croft ; Haiqing Chen

【Abstract】: Personal assistant systems, such as Apple Siri, Google Assistant, Amazon Alexa, and Microsoft Cortana, are becoming ever more widely used. Understanding user intent such as clarification questions, potential answers and user feedback in information-seeking conversations is critical for retrieving good responses. In this paper, we analyze user intent patterns in information-seeking conversations and propose an intent-aware neural response ranking model “IART”, which refers to “Intent-Aware Ranking with Transformers”. IART is built on top of the integration of user intent modeling and language representation learning with the Transformer architecture, which relies entirely on a self-attention mechanism instead of recurrent nets [35]. It incorporates intent-aware utterance attention to derive an importance weighting scheme of utterances in conversation context with the aim of better conversation history understanding. We conduct extensive experiments with three information-seeking conversation data sets including both standard benchmarks and commercial data. Our proposed model outperforms all baseline methods with respect to a variety of metrics. We also perform case studies and analysis of learned user intent and its impact on response ranking in information-seeking conversations to provide interpretation of results.

【Keywords】:

241. ResQueue: A Smarter Datacenter Flow Scheduler.

Paper Link】 【Pages】:2599-2605

【Authors】: Hamed Rezaei ; Balajee Vamanan

【Abstract】: Datacenters host a mix of applications: foreground applications perform distributed lookups in order to service user queries and background applications perform batch processing tasks such as data reorganization, backup, and replication. While background flows produce the most load, foreground applications produce the most number of flows. Because packets from both types of applications compete at switches for network bandwidth, the performance of applications is sensitive to scheduling mechanisms. Existing schedulers use flow size to distinguish critical flows from non-critical flows. However, recent studies on datacenter workloads reveal that most flows are small (e.g., most flows consist of only a handful number of packets). In light of recent findings, we make the key observation that because most flows are small, flow size is not sufficient to distinguish critical flows from non-critical flows and therefore existing flow schedulers do not achieve the desired prioritization. In this paper, we introduce ResQueue, which uses a combination of flow size and packet history to calculate the priority of each flow. Our evaluation shows that ResQueue improves tail flow completion times of short flows by up to 60% over the state-of-the-art flow scheduling mechanisms.

【Keywords】:

242. PARS: Peers-aware Recommender System.

Paper Link】 【Pages】:2606-2612

【Authors】: Huiqiang Mao ; Yanzhi Li ; Chenliang Li ; Di Chen ; Xiaoqing Wang ; Yuming Deng

【Abstract】: The presence or absence of one item in a recommendation list will affect the demand for other items because customers are often willing to switch to other items if their most preferred items are not available. The cross-item influence, called “peers effect”, has been largely ignored in the literature. In this paper, we develop a peers-aware recommender system, named PARS. We apply a ranking-based choice model to capture the cross-item influence and solve the resultant MaxMin problem with a decomposition algorithm. The MaxMin model solves for the recommendation decision in the meanwhile of estimating users’ preferences towards the items, which yields high-quality recommendations robust to input data variation. Experimental results illustrate that PARS outperforms a few frequently used methods in practice. An online evaluation with a flash sales scenario at Taobao also shows that PARS delivers significant improvements in terms of both conversion rates and user value.

【Keywords】:

243. Fast Computation of Explanations for Inconsistency in Large-Scale Knowledge Graphs.

Paper Link】 【Pages】:2613-2619

【Authors】: Trung-Kien Tran ; Mohamed H. Gad-Elrab ; Daria Stepanova ; Evgeny Kharlamov ; Jannik Strötgen

【Abstract】: Knowledge graphs (KGs) are essential resources for many applications including Web search and question answering. As KGs are often automatically constructed, they may contain incorrect facts. Detecting them is a crucial, yet extremely expensive task. Prominent solutions detect and explain inconsistency in KGs with respect to accompanying ontologies that describe the KG domain of interest. Compared to machine learning methods they are more reliable and human-interpretable but scale poorly on large KGs. In this paper, we present a novel approach to dramatically speed up the process of detecting and explaining inconsistency in large KGs by exploiting KG abstractions that capture prominent data patterns. Though much smaller, KG abstractions preserve inconsistency and their explanations. Our experiments with large KGs (e.g., DBpedia and Yago) demonstrate the feasibility of our approach and show that it significantly outperforms the popular baseline.

【Keywords】: Computing methodologies; Artificial intelligence; Knowledge representation and reasoning

244. Review-guided Helpful Answer Identification in E-commerce.

Paper Link】 【Pages】:2620-2626

【Authors】: Wenxuan Zhang ; Wai Lam ; Yang Deng ; Jing Ma

【Abstract】: Product-specific community question answering platforms can greatly help address the concerns of potential customers. However, the user-provided answers on such platforms often vary a lot in their qualities. Helpfulness votes from the community can indicate the overall quality of the answer, but they are often missing. Accurately predicting the helpfulness of an answer to a given question and thus identifying helpful answers is becoming a demanding need. Since the helpfulness of an answer depends on multiple perspectives instead of only topical relevance investigated in typical QA tasks, common answer selection algorithms are insufficient for tackling this task. In this paper, we propose the Review-guided Answer Helpfulness Prediction (RAHP) model that not only considers the interactions between QA pairs but also investigates the opinion coherence between the answer and crowds’ opinions reflected in the reviews, which is another important factor to identify helpful answers. Moreover, we tackle the task of determining opinion coherence as a language inference problem and explore the utilization of pre-training strategy to transfer the textual inference knowledge obtained from a specifically designed trained network. Extensive experiments conducted on real-world data across seven product categories show that our proposed model achieves superior performance on the prediction task.

【Keywords】:

245. How Much and When Do We Need Higher-order Informationin Hypergraphs? A Case Study on Hyperedge Prediction.

Paper Link】 【Pages】:2627-2633

【Authors】: Se-eun Yoon ; HyungSeok Song ; Kijung Shin ; Yung Yi

【Abstract】: Hypergraphs provide a natural way of representing group relations, whose complexity motivates an extensive array of prior work to adopt some form of abstraction and simplification of higher-order interactions. However, the following question has yet to be addressed: How much abstraction of group interactions is sufficient in solving a hypergraph task, and how different such results become across datasets? This question, if properly answered, provides a useful engineering guideline on how to trade off between complexity and accuracy of solving a downstream task. To this end, we propose a method of incrementally representing group interactions using a notion of n-projected graph whose accumulation contains information on up to n-way interactions, and quantify the accuracy of solving a task as n grows for various datasets. As a downstream task, we consider hyperedge prediction, an extension of link prediction, which is a canonical task for evaluating graph models. Through experiments on 15 real-world datasets, we draw the following messages: (a) Diminishing returns: small n is enough to achieve accuracy comparable with near-perfect approximations, (b) Troubleshooter: as the task becomes more challenging, larger n brings more benefit, and (c) Irreducibility: datasets whose pairwise interactions do not tell much about higher-order interactions lose much accuracy when reduced to pairwise abstractions.

【Keywords】: Information systems; Information systems applications; Data mining

246. Multi-Context Attention for Entity Matching.

Paper Link】 【Pages】:2634-2640

【Authors】: Dongxiang Zhang ; Yuyang Nie ; Sai Wu ; Yanyan Shen ; Kian-Lee Tan

【Abstract】: Entity matching (EM) is a classic research problem that identifies data instances referring to the same real-world entity. Recent technical trend in this area is to take advantage of deep learning (DL) to automatically extract discriminative features. DeepER and DeepMatcher have emerged as two pioneering DL models for EM. However, these two state-of-the-art solutions simply incorporate vanilla RNNs and straightforward attention mechanisms. In this paper, we fully exploit the semantic context of embedding vectors for the pair of entity text descriptions. In particular, we propose an integrated multi-context attention framework that takes into account self-attention, pair-attention and global-attention from three types of context. The idea is further extended to incorporate attribute attention in order to support structured datasets. We conduct extensive experiments with 7 benchmark datasets that are publicly accessible. The experimental results clearly establish our superiority over DeepER and DeepMatcher in all the datasets.

【Keywords】:

247. Dolphin: A Spoken Language Proficiency Assessment System for Elementary Education.

Paper Link】 【Pages】:2641-2647

【Authors】: Zitao Liu ; Guowei Xu ; Tianqiao Liu ; Weiping Fu ; Yubi Qi ; Wenbiao Ding ; Yujia Song ; Chaoyou Guo ; Cong Kong ; Songfan Yang ; Gale Yan Huang

【Abstract】: Spoken language proficiency is critically important for children’s growth and personal development. Due to the limited and imbalanced educational resources in China, elementary students barely have chances to improve their oral language skills in classes. Verbal fluency tasks (VFTs) were invented to let the students practice their spoken language proficiency after school. VFTs are simple but concrete math related questions that ask students to not only report answers but speak out the entire thinking process. In spite of the great success of VFTs, they bring a heavy grading burden to elementary teachers. To alleviate this problem, we develop Dolphin, a spoken language proficiency assessment system for Chinese elementary education. Dolphin is able to automatically evaluate both phonological fluency and semantic relevance of students’ VFT answers. We conduct a wide range of offline and online experiments to demonstrate the effectiveness of Dolphin. In our offline experiments, we show that Dolphin improves both phonological fluency and semantic relevance evaluation performance when compared to state-of-the-art baselines on real-world educational data sets. In our online A/B experiments, we test Dolphin with 183 teachers from 2 major cities (Hangzhou and Xi’an) in China for 10 weeks and the results show that VFT assignments grading coverage is improved by 22%.

【Keywords】:

248. Twitter User Location Inference Based on Representation Learning and Label Propagation.

Paper Link】 【Pages】:2648-2654

【Authors】: Hechan Tian ; Meng Zhang ; Xiangyang Luo ; Fenlin Liu ; Yaqiong Qiao

【Abstract】: Social network user location inference technology has been widely used in various geospatial applications like public health monitoring and local advertising recommendation. Due to insufficient consideration of relationships between users and location indicative words, most of existing inference methods estimate label propagation probabilities solely based on statistical features, resulting in large location inference error. In this paper, a Twitter user location inference method based on representation learning and label propagation is proposed. Firstly, the heterogeneous connection relation graph is constructed based on relationships between Twitter users and relationships between users and location indicative words, and relationships unrelated to geographic attributes are filtered. Then, vector representations of users are learnt from the connection relation graph. Finally, label propagation probabilities between adjacent users are calculated based on vector representations, and the locations of unknown users are predicted through iterative label propagation. Experiments on two representative Twitter datasets - GeoText and TwUs, show that the proposed method can accurately calculate label propagation probabilities based on vector representations and improve the accuracy of location inference. Compared with existing typical Twitter user location inference methods - GCN and MLP-TXT+NET, the median error distance of the proposed method is reduced by 18% and 16%, respectively.

【Keywords】:

249. The Structure of Social Influence in Recommender Networks.

Paper Link】 【Pages】:2655-2661

【Authors】: Pantelis Pipergias Analytis ; Daniel Barkoczi ; Philipp Lorenz-Spreen ; Stefan Herzog

【Abstract】: People’s ability to influence others’ opinion on matters of taste varies greatly—both offline and in recommender systems. What are the mechanisms underlying these striking differences? Using the weighted k-nearest neighbors algorithm (k-nn) to represent an array of social learning strategies, we show—leveraging methods from network science—how the k-nn algorithm gives rise to networks of social influence in six real-world domains of taste. We show three novel results that apply both to offline advice taking and online recommender settings. First, influential individuals have mainstream tastes and high dispersion in their taste similarity with others. Second, the fewer people an individual or algorithm consults (i.e., the lower k is) or the larger the weight placed on the opinions of more similar others, the smaller the group of people with substantial influence. Third, the influence networks emerging from deploying the k-nn algorithm are hierarchically organized. Our results shed new light on classic empirical findings in communication and network science and can help improve the understanding of social influence offline and online.

【Keywords】:

250. A Multi-task Learning Framework for Road Attribute Updating via Joint Analysis of Map Data and GPS Traces.

Paper Link】 【Pages】:2662-2668

【Authors】: Yifang Yin ; Jagannadan Varadarajan ; Guanfeng Wang ; Xueou Wang ; Dhruva Sahrawat ; Roger Zimmermann ; See-Kiong Ng

【Abstract】: The quality of a digital map is of utmost importance for geo-aware services. However, maintaining an accurate and up-to-date map is a highly challenging task that usually involves a substantial amount of manual work. To reduce the manual efforts, methods have been proposed to automatically derive road attributes by mining GPS traces. However, previous methods always modeled each road attribute separately based on intuitive hand-crafted features extracted from GPS traces. This observation motivates us to propose a machine learning based method to learn joint features not only from GPS traces but also from map data. To model the relations among the target road attributes, we extract low-level shared feature embeddings via multi-task learning, while still being able to generate task-specific fused representations by applying attention-based feature fusion. To model the relations between the target road attributes and other contextual information that is available from a digital map, we propose to leverage map tiles at road centers as visual features that capture the information of the surrounding geographic objects around the roads. We perform extensive experiments on the OpenStreetMap where state-of-the-art classification accuracy has been obtained compared to existing road attribute detection approaches.

【Keywords】: Information systems; Information systems applications; Data mining

251. Predicting Drug Demand with Wikipedia Views: Evidence from Darknet Markets.

Paper Link】 【Pages】:2669-2675

【Authors】: Sam Miller ; Abeer El-Bahrawy ; Martin Dittus ; Mark Graham ; Joss Wright

【Abstract】: Rapid changes in illicit drug demand, such as the Fentanyl epidemic, are a major public health issue. Policymakers currently rely on annual surveys to monitor public consumption, which are arguably too infrequent to detect rapid shifts in drug use. We present a novel method to predict drug use based on high-frequency sales data from darknet markets. We show that models based on historic trades alone cannot accurately predict drug demand. However, augmenting these models with data on Wikipedia page views for each drug greatly improves predictive accuracy, particularly for less popular drugs, suggesting such models may be particularly useful for detecting newly emerging substances. These results hold out-of-sample at high time frequency, across a range of drugs and countries. Therefore Wikipedia data may enable us to build a high-frequency measure of drug demand, which could help policymakers respond more quickly to future drug crises.

【Keywords】:

Paper Link】 【Pages】:2676-2682

【Authors】: Xiaotie Deng ; Tao Lin ; Tao Xiao

【Abstract】: In this paper, We revisit the sponsored search auction as a repeated auction. We view it as a learning and exploiting task of the seller against the private data distribution of the buyers. We model such a game between the seller and buyers by a Private Data Manipulation (PDM) game: the auction seller first announces an auction for which allocation and payment rules are based on the value distributions submitted by buyers. The seller’s expected revenue depends on the design of the protocol and the game played among the buyers in their choice on the submitted (fake) value distributions.

【Keywords】: Applied computing; Law, social and behavioral sciences; Economics

253. Active Domain Transfer on Network Embedding.

Paper Link】 【Pages】:2683-2689

【Authors】: Lichen Jin ; Yizhou Zhang ; Guojie Song ; Yilun Jin

【Abstract】: Recent works show that end-to-end, (semi-) supervised network embedding models can generate satisfactory vectors to represent network topology, and are even applicable to unseen graphs by inductive learning. However, domain mismatch between training and testing network for inductive learning, as well as lack of labeled data often compromises the outcome of such methods. To make matters worse, while transfer learning and active learning techniques, being able to solve such problems correspondingly, have been well studied on regular i.i.d data, relatively few attention has been paid on networks. Consequently, we propose in this paper a method for active transfer learning on networks named active-transfer network embedding, abbreviated ATNE. In ATNE we jointly consider the influence of each node on the network from the perspectives of transfer and active learning, and hence design novel and effective influence scores combining both aspects in the training process to facilitate node selection. We demonstrate that ATNE is efficient and decoupled from the actual model used. Further extensive experiments show that ATNE outperforms state-of-the-art active node selection methods and shows versatility in different situations.

【Keywords】:

254. To be Tough or Soft: Measuring the Impact of Counter-Ad-blocking Strategies on User Engagement.

Paper Link】 【Pages】:2690-2696

【Authors】: Shuai Zhao ; Achir Kalra ; Cristian Borcea ; Yi Chen

【Abstract】: The fast growing ad-blocker usage results in large revenue decrease for ad-supported online websites. Facing this problem, many online publishers choose either to cooperate with ad-blocker software companies to show acceptable ads or to build a wall that requires users to whitelist the site for content access. However, there is lack of studies on the impact of these two counter-ad-blocking strategies on user behaviors. To address this issue, we conduct a randomized field experiment on the website of Forbes Media, a major US media publisher. The ad-blocker users are divided into a treatment group, which receives the wall strategy, and a control group, which receives the acceptable ads strategy. We utilize the difference-in-differences method to estimate the causal effects. Our study shows that the wall strategy has an overall negative impact on user engagements. However, it has no statistically significant effect on high-engaged users as they would view the pages no matter what strategy is used. It has a big impact on low-engaged users, who have no loyalty to the site. Our study also shows that revisiting behavior decreases over time, but the ratio of session whitelisting increases over time as the remaining users have relatively high loyalty and high engagement. The paper concludes with discussions of managerial insights for publishers when determining counter-ad-blocking strategies.

【Keywords】:

255. Just SLaQ When You Approximate: Accurate Spectral Distances for Web-Scale Graphs.

Paper Link】 【Pages】:2697-2703

【Authors】: Anton Tsitsulin ; Marina Munkhoeva ; Bryan Perozzi

【Abstract】: Graph comparison is a fundamental operation in data mining and information retrieval. Due to the combinatorial nature of graphs, it is hard to balance the expressiveness of the similarity measure and its scalability. Spectral analysis provides quintessential tools for studying the multi-scale structure of graphs and is a well-suited foundation for reasoning about differences between graphs. However, computing full spectrum of large graphs is computationally prohibitive; thus, spectral graph comparison methods often rely on rough approximation techniques with weak error guarantees.

【Keywords】:

256. Heterogeneous Graph Transformer.

Paper Link】 【Pages】:2704-2710

【Authors】: Ziniu Hu ; Yuxiao Dong ; Kuansan Wang ; Yizhou Sun

【Abstract】: Recent years have witnessed the emerging success of graph neural networks (GNNs) for modeling structured data. However, most GNNs are designed for homogeneous graphs, in which all nodes and edges belong to the same types, making it infeasible to represent heterogeneous structures. In this paper, we present the Heterogeneous Graph Transformer (HGT) architecture for modeling Web-scale heterogeneous graphs. To model heterogeneity, we design node- and edge-type dependent parameters to characterize the heterogeneous attention over each edge, empowering HGT to maintain dedicated representations for different types of nodes and edges. To handle Web-scale graph data, we design the heterogeneous mini-batch graph sampling algorithm—HGSampling—for efficient and scalable training. Extensive experiments on the Open Academic Graph of 179 million nodes and 2 billion edges show that the proposed HGT model consistently outperforms all the state-of-the-art GNN baselines by 9–21 on various downstream tasks. The dataset and source code of HGT are publicly available at https://github.com/acbull/pyHGT.

【Keywords】:

257. On the Robustness of Cascade Diffusion under Node Attacks.

Paper Link】 【Pages】:2711-2717

【Authors】: Alvis Logins ; Yuchen Li ; Panagiotis Karras

【Abstract】: How can we assess a network’s ability to maintain its functionality under attacks? Network robustness has been studied extensively in the case of deterministic networks. However, applications such as online information diffusion and the behavior of networked public raise a question of robustness in probabilistic networks. We propose three novel robustness measures for networks hosting a diffusion under the Independent Cascade (IC) model, susceptible to node attacks. The outcome of such a process depends on the selection of its initiators, or seeds, by the seeder, as well as on two factors outside the seeder’s discretion: the attack strategy and the probabilistic diffusion outcome. We consider three levels of seeder awareness regarding these two uncontrolled factors, and evaluate the network’s viability aggregated over all possible extents of node attacks. We introduce novel algorithms from building blocks found in previous works to evaluate the proposed measures. A thorough experimental study with synthetic and real, scale-free and homogeneous networks establishes that these algorithms are effective and efficient, while the proposed measures highlight differences among networks in terms of robustness and the surprise they furnish when attacked. Last, we devise a new measure of diffusion entropy that can inform the design of probabilistically robust networks.

【Keywords】:

258. Certified Robustness of Community Detection against Adversarial Structural Perturbation via Randomized Smoothing.

Paper Link】 【Pages】:2718-2724

【Authors】: Jinyuan Jia ; Binghui Wang ; Xiaoyu Cao ; Neil Zhenqiang Gong

【Abstract】: Community detection plays a key role in understanding graph structure. However, several recent studies showed that community detection is vulnerable to adversarial structural perturbation. In particular, via adding or removing a small number of carefully selected edges in a graph, an attacker can manipulate the detected communities. However, to the best of our knowledge, there are no studies on certifying robustness of community detection against such adversarial structural perturbation. In this work, we aim to bridge this gap. Specifically, we develop the first certified robustness guarantee of community detection against adversarial structural perturbation. Given an arbitrary community detection method, we build a new smoothed community detection method via randomly perturbing the graph structure. We theoretically show that the smoothed community detection method provably groups a given arbitrary set of nodes into the same community (or different communities) when the number of edges added/removed by an attacker is bounded. Moreover, we show that our certified robustness is tight. We also empirically evaluate our method on multiple real-world graphs with ground truth communities.

【Keywords】:

Paper Link】 【Pages】:2725-2732

【Authors】: Ruilin Li ; Zhen Qin ; Xuanhui Wang ; Suming J. Chen ; Donald Metzler

【Abstract】: Neural search ranking models have been not only actively studied in the information retrieval community, but also widely adopted in real-world industrial applications. However, due to the non-convexity and stochastic training of neural model formulations, the obtained models are unstable in the sense that model predictions can vary a lot for two models trained with the same configuration. In practice, new features are continuously introduced and new model architectures are explored to improve model effectiveness. In these cases, the instability of neural models leads to unnecessary document ranking changes for a large portion of queries. Such changes not only lead to inconsistent user experience, but also add noise to online experimentation and can slow down model improvement cycles. How to stabilize neural search ranking models during model update is an important but largely unexplored problem. Motivated by trigger analysis, we suggest balancing the trade-off between performance improvement and the number of affected queries. Concretely, we formulate it as an optimization problem with the objective as maximizing the average effect over the affected queries. We propose two heuristics and one theory-guided stabilization method to solve the optimization problem. Our proposed methods are evaluated on two of the world’s largest personal search services: Gmail search and Google Drive search. Empirical results show that our proposed methods are very effective in optimizing the proposed objective and are applicable to different model update scenarios.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

260. Evolution of a Web-Scale Near Duplicate Image Detection System.

Paper Link】 【Pages】:2733-2739

【Authors】: Andrey Gusev ; Jiajing Xu

【Abstract】: Detecting near duplicate images is fundamental to the content ecosystem of photo sharing web applications. However, such a task is challenging when involving a web-scale image corpus containing billions of images. In this paper, we present an efficient system for detecting near duplicate images across 8 billion images. Our system consists of three stages: candidate generation, candidate selection, and clustering. We also demonstrate that this system can be used to greatly improve the quality of recommendations and search results across a number of real-world applications.

【Keywords】:

261. A Feedback Shift Correction in Predicting Conversion Rates under Delayed Feedback.

Paper Link】 【Pages】:2740-2746

【Authors】: Shota Yasui ; Gota Morishita ; Komei Fujita ; Masashi Shibata

【Abstract】: In display advertising, predicting the conversion rate, that is, the probability that a user takes a predefined action on an advertiser’s website, such as purchasing goods is fundamental in estimating the value of displaying the advertisement. However, there is a relatively long time delay between a click and its resultant conversion. Because of the delayed feedback, some positive instances at the training period are labeled as negative because some conversions have not yet occurred when training data are gathered. As a result, the conditional label distributions differ between the training data and the production environment. This situation is referred to as a feedback shift. We address this problem by using an importance weight approach typically used for covariate shift correction. We prove its consistency for the feedback shift. Results in both offline and online experiments show that our proposed method outperforms the existing method.

【Keywords】: Computing methodologies; Machine learning

262. Deconstruct Densest Subgraphs.

Paper Link】 【Pages】:2747-2753

【Authors】: Lijun Chang ; Miao Qiao

【Abstract】: In this paper, we aim to understand the distribution of the densest subgraphs of a given graph under the density notion of average-degree. We show that the structures, the relationships and the distributions of all the densest subgraphs of a graph G can be encoded in O(L) space in an index called the ds-Index. Here L denotes the maximum output size of a densest subgraph of G. More importantly, ds-Indexcan report all the minimal densest subgraphs of G collectively in O(L) time and can enumerate all the densest subgraphs of G with an O(L) delay. Besides, the construction of ds-Indexcosts no more than finding a single densest subgraph using the state-of-the-art approach. Our empirical study shows that for web-scale graphs with one billion edges, the ds-Indexcan be constructed in several minutes on an ordinary commercial machine.

【Keywords】: Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms

263. Anchored Model Transfer and Soft Instance Transfer for Cross-Task Cross-Domain Learning: A Study Through Aspect-Level Sentiment Classification.

Paper Link】 【Pages】:2754-2760

【Authors】: Yaowei Zheng ; Richong Zhang ; Suyuchen Wang ; Samuel Mensah ; Yongyi Mao

【Abstract】: Supervised learning relies heavily on readily available labelled data to infer an effective classification function. However, proposed methods under the supervised learning paradigm are faced with the scarcity of labelled data within domains, and are not generalized enough to adapt to other tasks. Transfer learning has proved to be a worthy choice to address these issues, by allowing knowledge to be shared across domains and tasks. In this paper, we propose two transfer learning methods Anchored Model Transfer (AMT) and Soft Instance Transfer (SIT), which are both based on multi-task learning, and account for model transfer and instance transfer, and can be combined into a common framework. We demonstrate the effectiveness of AMT and SIT for aspect-level sentiment classification showing the competitive performance against baseline models on benchmark datasets. Interestingly, we show that the integration of both methods AMT+SIT achieves state-of-the-art performance on the same task.

【Keywords】:

264. Scaling PageRank to 100 Billion Pages.

Paper Link】 【Pages】:2761-2767

【Authors】: Stergios Stergiou

【Abstract】: Distributed graph processing frameworks formulate tasks as sequences of supersteps within which communication is performed asynchronously by sending messages over the graph edges. PageRank’s communication pattern is identical across all its supersteps since each vertex sends messages to all its edges. We exploit this pattern to develop a new communication paradigm that allows us to exchange messages that include only edge payloads, dramatically reducing bandwidth requirements. Experiments on a web graph of 38 billion vertices and 3.1 trillion edges yield execution times of 34.4 seconds per iteration, suggesting more than an order of magnitude improvement over the state-of-the-art.

【Keywords】:

265. Exploiting Aesthetic Preference in Deep Cross Networks for Cross-domain Recommendation.

Paper Link】 【Pages】:2768-2774

【Authors】: Jian Liu ; Pengpeng Zhao ; Fuzhen Zhuang ; Yanchi Liu ; Victor S. Sheng ; Jiajie Xu ; Xiaofang Zhou ; Hui Xiong

【Abstract】: Visual aesthetics of products plays an important role in the decision process when purchasing appearance-first products, e.g., clothes. Indeed, user’s aesthetic preference, which serves as a personality trait and a basic requirement, is domain independent and could be used as a bridge between domains for knowledge transfer. However, existing work has rarely considered the aesthetic information in product images for cross-domain recommendation. To this end, in this paper, we propose a new deep Aesthetic Cross-Domain Networks (ACDN), in which parameters characterizing personal aesthetic preferences are shared across networks to transfer knowledge between domains. Specifically, we first leverage an aesthetic network to extract aesthetic features. Then, we integrate these features into a cross-domain network to transfer users’ domain independent aesthetic preferences. Moreover, network cross-connections are introduced to enable dual knowledge transfer across domains. Finally, the experimental results on real-world datasets show that our proposed model ACDN outperforms benchmark methods in terms of recommendation accuracy.

【Keywords】: Information systems; Information retrieval; Retrieval tasks and goals; Recommender systems

266. Large-scale Causal Approaches to Debiasing Post-click Conversion Rate Estimation with Multi-task Learning.

Paper Link】 【Pages】:2775-2781

【Authors】: Wenhao Zhang ; Wentian Bao ; Xiao-Yang Liu ; Keping Yang ; Quan Lin ; Hong Wen ; Ramin Ramezani

【Abstract】: Post-click conversion rate (CVR) estimation is a critical task in e-commerce recommender systems. This task is deemed quite challenging under industrial setting with two major issues: 1) selection bias caused by user self-selection, and 2) data sparsity due to the rare click events. A successful conversion typically has the following sequential events: ”exposure → click → conversion”. Conventional CVR estimators are trained in the click space, but inference is done in the entire exposure space. They fail to account for the causes of the missing data and treat them as missing at random. Hence, their estimations are highly likely to deviate from the real values by large. In addition, the data sparsity issue can also handicap many industrial CVR estimators which usually have large parameter spaces.

【Keywords】:

267. ROSE: Role-based Signed Network Embedding.

Paper Link】 【Pages】:2782-2788

【Authors】: Amin Javari ; Tyler Derr ; Pouya Esmailian ; Jiliang Tang ; Kevin Chen-Chuan Chang

【Abstract】: In real-world networks, nodes might have more than one type of relationship. Signed networks are an important class of such networks consisting of two types of relations: positive and negative. Recently, embedding signed networks has attracted increasing attention and is more challenging than classic networks since nodes are connected by paths with multi-types of links. Existing works capture the complex relationships by relying on social theories. However, this approach has major drawbacks, including the incompleteness/inaccurateness of such theories. Thus, we propose network transformation based embedding to address these shortcomings. The core idea is that rather than directly finding the similarities of two nodes from the complex paths connecting them, we can obtain their similarities through simple paths connecting their different roles. We employ this idea to build our proposed embedding technique that can be described in three steps: (1) the input directed signed network is transformed into an unsigned bipartite network with each node mapped to a set of nodes we denote as role-nodes. Each role-node captures a certain role that a node in the original network plays; (2) the network of role-nodes is embedded; and (3) the original network is encoded by aggregating the embedding vectors of role-nodes. Our experiments show the novel proposed technique substantially outperforms existing models.

【Keywords】: Information systems; Information systems applications; Data mining

268. Natural Key Discovery in Wikipedia Tables.

Paper Link】 【Pages】:2789-2795

【Authors】: Leon Bornemann ; Tobias Bleifuß ; Dmitri V. Kalashnikov ; Felix Naumann ; Divesh Srivastava

【Abstract】: Wikipedia is the largest encyclopedia to date. Scattered among its articles, there is an enormous number of tables that contain structured, relational information. In contrast to database tables, these webtables lack metadata, making it difficult to automatically interpret the knowledge they harbor. The natural key is a particularly important piece of metadata, which acts as a primary key and consists of attributes inherent to an entity. Determining natural keys is crucial for many tasks, such as information integration, table augmentation, or tracking changes to entities over time.

【Keywords】:

269. Negative Purchase Intent Identification in Twitter.

Paper Link】 【Pages】:2796-2802

【Authors】: Samed Atouati ; Xiao Lu ; Mauro Sozio

【Abstract】: Social network users often express their discontent with a product or a service from a company on social media. Such a reaction is more pronounced in the aftermath of a corporate scandal such as a corruption scandal or food poisoning in a chain restaurant. In our work, we focus on identifying negative purchase intent in a tweet, i.e. the intent of a user of not purchasing any product or consuming any service from a company. We develop a binary classifier for such a task, which consists of a generalization of logistic regression leveraging the locality of purchase intent in posts from Twitter. We conduct an extensive experimental evaluation against state-of-the-art approaches on a large collection of tweets, showing the effectiveness of our approach in terms of F1 score. We also provide some preliminary results on which kinds of corporate scandals might affect the purchase intent of customers the most.

【Keywords】:

270. War of Words: The Competitive Dynamics of Legislative Processes.

Paper Link】 【Pages】:2803-2809

【Authors】: Victor Kristof ; Matthias Grossglauser ; Patrick Thiran

【Abstract】: A body of law is an example of a dynamic corpus of text documents that are jointly maintained by a group of editors who compete and collaborate in complex constellations. Our goal is to develop predictive models for this process, thereby shedding light on the competitive dynamics of parliamentarians who make laws. For this purpose, we curated a dataset of 450000 legislative edits introduced by European parliamentarians over the last ten years. An edit modifies the status quo of a law, and could be in competition with another edit if it modifies the same part of that law. We propose a model for predicting the success of such edits, in the face of both the inertia of the status quo and the competition between overlapping edits. The parameters of this model can be interpreted in terms of the influence of parliamentarians and of the controversy of laws.

【Keywords】:

271. Deep Rating Elicitation for New Users in Collaborative Filtering.

Paper Link】 【Pages】:2810-2816

【Authors】: Wonbin Kweon ; SeongKu Kang ; Junyoung Hwang ; Hwanjo Yu

【Abstract】: Recent recommender systems started to use rating elicitation, which asks new users to rate a small seed itemset for inferring their preferences, to improve the quality of initial recommendations. The key challenge of the rating elicitation is to choose the seed items which can best infer the new users’ preference. This paper proposes a novel end-to-end Deep learning framework for Rating Elicitation (DRE), that chooses all the seed items at a time with consideration of the non-linear interactions. To this end, it first defines categorical distributions to sample seed items from the entire itemset, then it trains both the categorical distributions and a neural reconstruction network to infer users’ preferences on the remaining items from CF information of the sampled seed items. Through the end-to-end training, the categorical distributions are learned to select the most representative seed items while reflecting the complex non-linear interactions. Experimental results show that DRE outperforms the state-of-the-art approaches in the recommendation quality by accurately inferring the new users’ preferences and its seed itemset better represents the latent space than the seed itemset obtained by the other methods.

【Keywords】:

Paper Link】 【Pages】:2817-2823

【Authors】: Unmesh Joshi ; Jacopo Urbani

【Abstract】: Embedding-based models of Knowledge Graphs (KGs) can be used to predict the existence of missing links by ranking the entities according to some likelihood scores. An exhaustive computation of all likelihood scores is very expensive if the KG is large. To counter this problem, we propose a technique to reduce the search space by identifying smaller subsets of promising entities. Our technique first creates embeddings of subgraphs using the embeddings from the model. Then, it ranks the subgraphs with some proposed ranking functions and considers only the entities in the top k subgraphs. Our experiments show that our technique is able to reduce the search space significantly while maintaining a good recall.

【Keywords】:

273. Asymptotic Behavior of Sequence Models.

Paper Link】 【Pages】:2824-2830

【Authors】: Flavio Chierichetti ; Ravi Kumar ; Andrew Tomkins

【Abstract】: In this paper we study the limiting dynamics of a sequential process that generalizes Pólya’s urn. This process has been studied also in the context of language generation, discrete choice, repeat consumption, and models for the web graph. The process we study generates future items by copying from past items. It is parameterized by a sequence of weights describing how much to prefer copying from recent versus more distant locations. We show that, if the weight sequence follows a power law with exponent α ∈ [0, 1), then the sequences generated by the model tend toward a limiting behavior in which the eventual frequency of each token in the alphabet attains a limit. Moreover, in the case α > 2, we show that the sequence converges to a token being chosen infinitely often, and each other token being chosen only constantly many times.

【Keywords】:

274. Clustering with a faulty oracle.

Paper Link】 【Pages】:2831-2834

【Authors】: Kasper Green Larsen ; Michael Mitzenmacher ; Charalampos E. Tsourakakis

【Abstract】: Clustering, i.e., finding groups in the data, is a problem that permeates multiple fields of science and engineering. Recently, the problem of clustering with a noisy oracle has drawn attention due to various applications including crowdsourced entity resolution [33], and predicting signs of interactions in large-scale online social networks [20, 21]. Here, we consider the following fundamental model for two clusters as proposed by Mitzenmacher and Tsourakakis [28], and Mazumdar and Saha [25]; there exist n items, belonging to two unknown groups. We are allowed to query any pair of nodes whether they belong to the same cluster or not, but the answer to the query is corrupted with some probability . Let 1 > δ = 1 − 2q > 0 be the bias.

【Keywords】: Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms

275. Matching Cross Network for Learning to Rank in Personal Search.

Paper Link】 【Pages】:2835-2841

【Authors】: Zhen Qin ; Zhongliang Li ; Michael Bendersky ; Donald Metzler

【Abstract】: Recent neural ranking algorithms focus on learning semantic matching between query and document terms. However, practical learning to rank systems typically rely on a wide range of side information beyond query and document textual features, like location, user context, etc. It is common practice to concatenate all of these features and rely on deep models to learn a complex representation.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

276. RLIRank: Learning to Rank with Reinforcement Learning for Dynamic Search.

Paper Link】 【Pages】:2842-2848

【Authors】: Jianghong Zhou ; Eugene Agichtein

【Abstract】: To support complex search tasks, where the initial information requirements are complex or may change during the search, a search engine must adapt the information delivery as the user’s information requirements evolve. To support this dynamic ranking paradigm effectively, search result ranking must incorporate both the user feedback received, and the information displayed so far. To address this problem, we introduce a novel reinforcement learning-based approach, RLIRank. We first build an adapted reinforcement learning framework to integrate the key components of the dynamic search. Then, we implement a new Learning to Rank (LTR) model for each iteration of the dynamic search, using a recurrent Long Short Term Memory neural network (LSTM), which estimates the gain for each next result, learning from each previously ranked document. To incorporate the user’s feedback, we develop a word-embedding variation of the classic Rocchio Algorithm, to help guide the ranking towards the high-value documents. Those innovations enable RLIRank to outperform the previously reported methods from the TREC Dynamic Domain Tracks 2017 and exceed all the methods in 2016 TREC Dynamic Domain after multiple search iterations, advancing the state of the art for dynamic search.

【Keywords】:

277. Reducing Disparate Exposure in Ranking: A Learning To Rank Approach.

Paper Link】 【Pages】:2849-2855

【Authors】: Meike Zehlike ; Carlos Castillo

【Abstract】: Ranked search results have become the main mechanism by which we find content, products, places, and people online. Thus their ordering contributes not only to the satisfaction of the searcher, but also to career and business opportunities, educational placement, and even social success of those being ranked. Researchers have become increasingly concerned with systematic biases in data-driven ranking models, and various post-processing methods have been proposed to mitigate discrimination and inequality of opportunity. This approach, however, has the disadvantage that it still allows an unfair ranking model to be trained.

【Keywords】:

Paper Link】 【Pages】:2856-2862

【Authors】: Porter Jenkins ; Jennifer Zhao ; Heath Vinicombe ; Anant Subramanian ; Arun Prasad ; Atillia Dobi ; Eileen Li ; Yunsong Guo

【Abstract】: Understanding content at scale is a difficult but important problem for many platforms. Many previous studies focus on content understanding to optimize engagement with existing users. However, little work studies how to leverage better content understanding to attract new users. In this work, we build a framework for generating natural language content annotations and show how they can be used for search engine optimization. The proposed framework relies on an XGBoost model that labels “pins” with high probability phrases, and a logistic regression layer that learns to rank aggregated annotations for groups of content. The pipeline identifies keywords that are descriptive and contextually meaningful. We perform a large-scale production experiment deployed on the Pinterest platform and show that natural language annotations cause a 1-2% increase in traffic from leading search engines. This increase is statistically significant. Finally, we explore and interpret the characteristics of our annotations framework.

【Keywords】:

279. Graph Enhanced Representation Learning for News Recommendation.

Paper Link】 【Pages】:2863-2869

【Authors】: Suyu Ge ; Chuhan Wu ; Fangzhao Wu ; Tao Qi ; Yongfeng Huang

【Abstract】: With the explosion of online news, personalized news recommendation becomes increasingly important for online news platforms to help their users find interesting information. Existing news recommendation methods achieve personalization by building accurate news representations from news content and user representations from their direct interactions with news (e.g., click), while ignoring the high-order relatedness between users and news. Here we propose a news recommendation method which can enhance the representation learning of users and news by modeling their relatedness in a graph setting. In our method, users and news are both viewed as nodes in a bipartite graph constructed from historical user click behaviors. For news representations, a transformer architecture is first exploited to build news semantic representations. Then we combine it with the information from neighbor news in the graph via a graph attention network. For user representations, we not only represent users from their historically clicked news, but also attentively incorporate the representations of their neighbor users in the graph. Improved performances on a large-scale real-world dataset validate the effectiveness of our proposed method.

【Keywords】:

280. End-to-End Deep Attentive Personalized Item Retrieval for Online Content-sharing Platforms.

Paper Link】 【Pages】:2870-2877

【Authors】: Jyun-Yu Jiang ; Tao Wu ; Georgios Roumpos ; Heng-Tze Cheng ; Xinyang Yi ; Ed Chi ; Harish Ganapathy ; Nitin Jindal ; Pei Cao ; Wei Wang

【Abstract】: Modern online content-sharing platforms host billions of items like music, videos, and products uploaded by various providers for users to discover items of their interests. To satisfy the information needs, the task of effective item retrieval (or item search ranking) given user search queries has become one of the most fundamental problems to online content-sharing platforms. Moreover, the same query can represent different search intents for different users, so personalization is also essential for providing more satisfactory search results. Different from other similar research tasks, such as ad-hoc retrieval and product retrieval with copious words and reviews, items in content-sharing platforms usually lack sufficient descriptive information and related meta-data as features. In this paper, we propose the end-to-end deep attentive model (EDAM) to deal with personalized item retrieval for online content-sharing platforms using only discrete personal item history and queries. Each discrete item in the personal item history of a user and its content provider are first mapped to embedding vectors as continuous representations. A query-aware attention mechanism is then applied to identify the relevant contexts in the user history and construct the overall personal representation for a given query. Finally, an extreme multi-class softmax classifier aggregates the representations of both query and personal item history to provide personalized search results. We conduct extensive experiments on a large-scale real-world dataset with hundreds of million users from a large video media platform at Google. The experimental results demonstrate that our proposed approach significantly outperforms several competitive baseline methods. It is also worth mentioning that this work utilizes a massive dataset from a real-world commercial content-sharing platform for personalized item retrieval to provide more insightful analysis from the industrial aspects.

【Keywords】:

281. Multimodal Post Attentive Profiling for Influencer Marketing.

Paper Link】 【Pages】:2878-2884

【Authors】: Seungbae Kim ; Jyun-Yu Jiang ; Masaki Nakada ; Jinyoung Han ; Wei Wang

【Abstract】: Influencer marketing has become a key marketing method for brands in recent years. Hence, brands have been increasingly utilizing influencers’ social networks to reach niche markets, and researchers have been studying various aspects of influencer marketing. However, brands have often suffered from searching and hiring the right influencers with specific interests/topics for their marketing due to a lack of available influencer data and/or limited capacity of marketing agencies. This paper proposes a multimodal deep learning model that uses text and image information from social media posts (i) to classify influencers into specific interests/topics (e.g., fashion, beauty) and (ii) to classify their posts into certain categories. We use the attention mechanism to select the posts that are more relevant to the topics of influencers, thereby generating useful influencer representations. We conduct experiments on the dataset crawled from Instagram, which is the most popular social media for influencer marketing. The experimental results show that our proposed model significantly outperforms existing user profiling methods by achieving 98% and 96% accuracy in classifying influencers and their posts, respectively. We release our influencer dataset of 33,935 influencers labeled with specific topics based on 10,180,500 posts to facilitate future research.

【Keywords】: Information systems; Information systems applications; Data mining

282. Voice-based Reformulation of Community Answers.

Paper Link】 【Pages】:2885-2891

【Authors】: Simone Filice ; Nachshon Cohen ; David Carmel

【Abstract】: Community Question Answering (CQA) websites, such as Stack Exchange1 or Quora2, allow users to freely ask questions and obtain answers from other users, i.e., the community. Personal assistants, such as Amazon Alexa or Google Home, can also exploit CQA data to answer a broader range of questions and increase customers’ engagement. However, the voice-based interaction poses new challenges to the Question Answering scenario. Even assuming that we are able to retrieve a previously asked question that perfectly matches the user’s query, we cannot simply read its answer to the user. A major limitation is the answer length. Reading these answers to the user is cumbersome and boring. Furthermore, many answers contain non-voice-friendly parts, such as images, or URLs.

【Keywords】:

283. VRoC: Variational Autoencoder-aided Multi-task Rumor Classifier Based on Text.

Paper Link】 【Pages】:2892-2898

【Authors】: Mingxi Cheng ; Shahin Nazarian ; Paul Bogdan

【Abstract】: Social media became popular and percolated almost all aspects of our daily lives. While online posting proves very convenient for individual users, it also fosters fast-spreading of various rumors. The rapid and wide percolation of rumors can cause persistent adverse or detrimental impacts. Therefore, researchers invest great efforts on reducing the negative impacts of rumors. Towards this end, the rumor classification system aims to to detect, track, and verify rumors in social media. Such systems typically include four components: (i) a rumor detector, (ii) a rumor tracker, (iii) a stance classifier, and (iv) a veracity classifier. In order to improve the state-of-the-art in rumor detection, tracking, and verification, we propose VRoC, a tweet-level variational autoencoder-based rumor classification system. VRoC consists of a co-train engine that trains variational autoencoders (VAEs) and rumor classification components. The co-train engine helps the VAEs to tune their latent representations to be classifier-friendly. We also show that VRoC is able to classify unseen rumors with high levels of accuracy. For the PHEME dataset, VRoC consistently outperforms several state-of-the-art techniques, on both observed and unobserved rumors, by up to 26.9%, in terms of macro-F1 scores.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

Paper Link】 【Pages】:2899-2905

【Authors】: Nikesh Joshi ; Francesca Spezzano ; Mayson Green ; Elijah Hill

【Abstract】: Wikipedia, the free and open-collaboration based online encyclopedia, has millions of pages that are maintained by thousands of volunteer editors. As per Wikipedia’s fundamental principles, pages on Wikipedia are written with a neutral point of view and maintained by volunteer editors for free with well-defined guidelines in order to avoid or disclose any conflict of interest. However, there have been several known incidents where editors intentionally violate such guidelines in order to get paid (or even extort money) for maintaining promotional spam articles without disclosing such.

【Keywords】: Human-centered computing; Human computer interaction (HCI); Interaction paradigms; Web-based interaction

285. Bursts of Activity: Temporal Patterns of Help-Seeking and Support in Online Mental Health Forums.

Paper Link】 【Pages】:2906-2912

【Authors】: Taisa Kushner ; Amit Sharma

【Abstract】: Recent years have seen a rise in social media platforms that provide peer-to-peer support to individuals suffering from mental distress. Studies on the impact of these platforms have focused on either short-term scales of single-post threads, or long-term changes over arbitrary period of time (months or years). While important, such periods of time do not necessarily follow users’ progressions through acute periods of distress. Using data from Talklife, a mental health platform, we find that user activity follows a distinct pattern of high activity periods with interleaving periods of no activity, and propose a method for identifying such bursts & breaks in activity. We then show how studying activity during bursts can provide a personalized, medium-term analysis for a key question in online mental health communities: What characteristics of user activity lead some users to find support and help, while others fall short? Using two independent outcome metrics, moments of cognitive change and self-reported changes in mood during a burst of activity, we identify two actionable features that can improve outcomes for users: persistence within bursts, and giving complex emotional support to others. Our results demonstrate the value of considering bursts as a natural unit of analysis for psychosocial change in online mental health communities.

【Keywords】:

286. Envy, Regret, and Social Welfare Loss.

Paper Link】 【Pages】:2913-2919

【Authors】: Riccardo Colini-Baldeschi ; Stefano Leonardi ; Okke Schrijvers ; Eric Sodomka

【Abstract】: Incentive compatibility (IC) is a desirable property for any auction mechanism, including those used in online advertising. However, in real world applications practical constraints and complex environments often result in mechanisms that lack incentive compatibility. Recently, several papers investigated the problem of deploying black-box statistical tests to determine if an auction mechanism is incentive compatible by using the notion of IC-Regret that measures the regret of a truthful bidder. Unfortunately, most of those methods are computationally intensive, since they require the execution of many counterfactual experiments.

【Keywords】: Applied computing; Law, social and behavioral sciences; Economics; Theory of computation; Design and analysis of algorithms

287. ShapeVis: High-dimensional Data Visualization at Scale.

Paper Link】 【Pages】:2920-2926

【Authors】: Nupur Kumari ; Siddarth R. ; Akash Rupela ; Piyush Gupta ; Balaji Krishnamurthy

【Abstract】: We present ShapeVis, a scalable visualization technique for point cloud data inspired from topological data analysis. Our method captures the underlying geometric and topological structure of the data in a compressed graphical representation. Much success has been reported by the data visualization technique Mapper, that discreetly approximates the Reeb graph of a filter function on the data. However, when using standard dimensionality reduction algorithms as the filter function, Mapper suffers from considerable computational cost. This makes it difficult to scale to high-dimensional data. Our proposed technique relies on finding a subset of points called landmarks along the data manifold to construct a weighted witness-graph over it. This graph captures the structural characteristics of the point cloud, and its weights are determined using a Finite Markov Chain. We further compress this graph by applying induced maps from standard community detection algorithms. Using techniques borrowed from manifold tearing, we prune and reinstate edges in the induced graph based on their modularity to summarize the shape of data. We empirically demonstrate how our technique captures the structural characteristics of real and synthetic data sets. Further, we compare our approach with Mapper using various filter functions like t-SNE, UMAP, LargeVis and show that our algorithm scales to millions of data points while preserving the quality of data visualization.

【Keywords】: Information systems; Information systems applications; Data mining

288. Using Cliques with Higher-order Spectral Embeddings Improves Graph Visualizations.

Paper Link】 【Pages】:2927-2933

【Authors】: Huda Nassar ; Caitlin Kennedy ; Shweta Jain ; Austin R. Benson ; David F. Gleich

【Abstract】: In the simplest setting, graph visualization is the problem of producing a set of two-dimensional coordinates for each node that meaningfully shows connections and latent structure in a graph. Among other uses, having a meaningful layout is often useful to help interpret the results from network science tasks such as community detection and link prediction. There are several existing graph visualization techniques in the literature that are based on spectral methods, graph embeddings, or optimizing graph distances. Despite the large number of methods, it is still often challenging or extremely time consuming to produce meaningful layouts of graphs with hundreds of thousands of vertices. Existing methods often either fail to produce a visualization in a meaningful time window, or produce a layout colorfully called a “hairball”, which does not illustrate any internal structure in the graph. Here, we show that adding higher-order information based on cliques to a classic eigenvector based graph visualization technique enables it to produce meaningful plots of large graphs. We further evaluate these visualizations along a number of graph visualization metrics and we find that it outperforms existing techniques on a metric that uses random walks to measure the local structure. Finally, we show many examples of how our algorithm successfully produces layouts of large networks. Code to reproduce our results is available.

【Keywords】: Mathematics of computing; Discrete mathematics; Graph theory; Graph algorithms

289. Distant Supervision for Multi-Stage Fine-Tuning in Retrieval-Based Question Answering.

Paper Link】 【Pages】:2934-2940

【Authors】: Yuqing Xie ; Wei Yang ; Luchen Tan ; Kun Xiong ; Nicholas Jing Yuan ; Baoxing Huai ; Ming Li ; Jimmy Lin

【Abstract】: We tackle the problem of question answering directly on a large document collection, combining simple “bag of words” passage retrieval with a BERT-based reader for extracting answer spans. In the context of this architecture, we present a data augmentation technique using distant supervision to automatically annotate paragraphs as either positive or negative examples to supplement existing training data, which are then used together to fine-tune BERT. We explore a number of details that are critical to achieving high accuracy in this setup: the proper sequencing of different datasets during fine-tuning, the balance between “difficult” vs. “easy” examples, and different approaches to gathering negative examples. Experimental results show that, with the appropriate settings, we can achieve large gains in effectiveness on two English and two Chinese QA datasets. We are able to achieve results at or near the state of the art without any modeling advances, which once again affirms the cliché “there’s no data like more data”.

【Keywords】:

290. NCVis: Noise Contrastive Approach for Scalable Visualization.

Paper Link】 【Pages】:2941-2947

【Authors】: Aleksandr Artemenkov ; Maxim Panov

【Abstract】: Modern methods for data visualization via dimensionality reduction, such as t-SNE, usually have performance issues that prohibit their application to large amounts of high-dimensional data. In this work, we propose NCVis – a high-performance dimensionality reduction method built on a sound statistical basis of noise contrastive estimation. We show that NCVis outperforms state-of-the-art techniques in terms of speed while preserving the representation quality of other methods. In particular, the proposed approach successfully proceeds a large dataset of more than 1 million news headlines in several minutes and presents the underlying structure in a human-readable way. Moreover, it provides results consistent with classical methods like t-SNE on more straightforward datasets like images of hand-written digits. We believe that the broader usage of such software can significantly simplify the large-scale data analysis and lower the entry barrier to this area.

【Keywords】: Computing methodologies; Machine learning

291. I've Got Your Packages: Harvesting Customers' Delivery Order Information using Package Tracking Number Enumeration Attacks.

Paper Link】 【Pages】:2948-2954

【Authors】: Simon S. Woo ; Hanbin Jang ; Woojung Ji ; Hyoungshick Kim

【Abstract】: A package tracking number (PTN) is widely used to monitor and track a shipment. Through the lenses of security and privacy, however, a package tracking number can possibly reveal certain personal information, leading to security and privacy breaches. In this work, we examine the privacy issues associated with online package tracking systems used in the top three most popular package delivery service providers (FedEx, DHL, and UPS) in the world and found that those websites inadvertently leak users’ personal data with a PTN. Moreover, we discovered that PTNs are highly structured and predictable. Therefore, customers’ personal data can be massively collected via PTN enumeration attacks. We analyzed more than one million package tracking records obtained from Fedex, DHL, and UPS, and showed that within 5 attempts, an attacker can efficiently guess more than 90% of PTNs for FedEx and DHL, and close to 50% of PTNs for UPS. In addition, we present two practical attack scenarios: 1) to infer business transactions information and 2) to uniquely identify recipients. Also, we found that more than 109 recipients can be uniquely identified with less than 10 comparisons by linking the PTN information with the online people search service, Whitepages.

【Keywords】: Security and privacy; Human and societal aspects of security and privacy

292. Crowdsourcing Detection of Sampling Biases in Image Datasets.

Paper Link】 【Pages】:2955-2961

【Authors】: Xiao Hu ; Haobo Wang ; Anirudh Vegesana ; Somesh Dube ; Kaiwen Yu ; Gore Kao ; Shuo-Han Chen ; Yung-Hsiang Lu ; George K. Thiruvathukal ; Ming Yin

【Abstract】: Despite many exciting innovations in computer vision, recent studies reveal a number of risks in existing computer vision systems, suggesting results of such systems may be unfair and untrustworthy. Many of these risks can be partly attributed to the use of a training image dataset that exhibits sampling biases and thus does not accurately reflect the real visual world. Being able to detect potential sampling biases in the visual dataset prior to model development is thus essential for mitigating the fairness and trustworthy concerns in computer vision. In this paper, we propose a three-step crowdsourcing workflow to get humans into the loop for facilitating bias discovery in image datasets. Through two sets of evaluation studies, we find that the proposed workflow can effectively organize the crowd to detect sampling biases in both datasets that are artificially created with designed biases and real-world image datasets that are widely used in computer vision research and system development.

【Keywords】:

293. Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing.

Paper Link】 【Pages】:2962-2968

【Authors】: Subendhu Rongali ; Luca Soldaini ; Emilio Monti ; Wael Hamza

【Abstract】: Virtual assistants such as Amazon Alexa, Apple Siri, and Google Assistant often rely on a semantic parsing component to understand which action(s) to execute for an utterance spoken by its users. Traditionally, rule-based or statistical slot-filling systems have been used to parse “simple” queries; that is, queries that contain a single action and can be decomposed into a set of non-overlapping entities. More recently, shift-reduce parsers have been proposed to process more complex utterances. These methods, while powerful, impose specific limitations on the type of queries that can be parsed; namely, they require a query to be representable as a parse tree.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing; Machine learning; Machine learning approaches; Neural networks

294. Addressing the Target Customer Distortion Problem in Recommender Systems.

Paper Link】 【Pages】:2969-2975

【Authors】: Xing Zhao ; Ziwei Zhu ; Majid Alfifi ; James Caverlee

【Abstract】: Predicting the potential target customers for a product is essential. However, traditional recommender systems typically aim to optimize an engagement metric without considering the overall distribution of target customers, thereby leading to serious distortion problems. In this paper, we conduct a data-driven study to reveal several distortions that arise from conventional recommenders. Toward overcoming these issues, we propose a target customer re-ranking algorithm to adjust the population distribution and composition in the Top-k target customers of an item while maintaining recommendation quality. By applying this proposed algorithm onto a real-world dataset, we find the proposed method can effectively make the class distribution of items’ target customers close to the desired distribution, thereby mitigating distortion.

【Keywords】: Information systems; Information retrieval; Retrieval tasks and goals; Recommender systems

295. Quantifying Community Characteristics of Maternal Mortality Using Social Media.

Paper Link】 【Pages】:2976-2983

【Authors】: Rediet Abebe ; Salvatore Giorgi ; Anna Tedijanto ; Anneke Buffone ; H. Andrew Schwartz

【Abstract】: While most mortality rates have decreased in the US, maternal mortality has increased and is among the highest of any OECD nation. Extensive public health research is ongoing to better understand the characteristics of communities with relatively high or low rates. In this work, we explore the role that social media language can play in providing insights into such community characteristics. Analyzing pregnancy-related tweets generated in US counties, we reveal a diverse set of latent topics including Morning Sickness, Celebrity Pregnancies, and Abortion Rights. We find that rates of mentioning these topics on Twitter predicts maternal mortality rates with higher accuracy than standard socioeconomic and risk variables such as income, race, and access to health-care, holding even after reducing the analysis to six topics chosen for their interpretability and connections to known risk factors. We then investigate psychological dimensions of community language, finding the use of less trustful, more stressed, and more negative affective language is significantly associated with higher mortality rates, while trust and negative affect also explain a significant portion of racial disparities in maternal mortality. We discuss the potential for these insights to inform actionable health interventions at the community-level.

【Keywords】:

296. Adaptive Hierarchical Translation-based Sequential Recommendation.

Paper Link】 【Pages】:2984-2990

【Authors】: Yin Zhang ; Yun He ; Jianling Wang ; James Caverlee

【Abstract】: We propose an adaptive hierarchical translation-based sequential recommendation called HierTrans that first extends traditional item-level relations to the category-level, to help capture dynamic sequence patterns that can generalize across users and time. Then unlike item-level based methods, we build a novel hierarchical temporal graph that contains item multi-relations at the category-level and user dynamic sequences at the item-level. Based on the graph, HierTrans adaptively aggregates the high-order multi-relations among items and dynamic user preferences to capture the dynamic joint influence for next-item recommendation. Specifically, the user translation vector in HierTrans can adaptively change based on both a user’s previous interacted items and the item relations inside the user’s sequences, as well as the user’s personal dynamic preference. Experiments on public datasets demonstrate the proposed model HierTrans consistently outperforms state-of-the-art sequential recommendation methods.

【Keywords】:

297. What Sparks Joy: The AffectVec Emotion Database.

Paper Link】 【Pages】:2991-2997

【Authors】: Shahab Raji ; Gerard de Melo

【Abstract】: Affective analysis of textual data is instrumental in understanding human communication in the modern era of social media. A number of resources have been proposed in attempts to characterize the emotions tied to words in a text. In this work, we show that we can obtain a database that goes beyond the common binary scores for emotion classification provided by past work. Instead, we harness the power of Big Data by using neural vector space models trained with large-scale supervision from co-occurrence patterns. We modify the vector space to better account for emotional associations, which then enables us to induce AffectVec, a new emotion database providing graded emotion intensity scores for English language words with regard to a fine-grained inventory of over 200 different emotion categories. Our experiments show that AffectVec outperforms existing emotion lexicons by substantial margins in intrinsic evaluations as well as for affective text classification.

【Keywords】: Computing methodologies; Artificial intelligence; Natural language processing

298. LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set Similarity Under Skew.

Paper Link】 【Pages】:2998-3004

【Authors】: Cyrus Rashtchian ; Aneesh Sharma ; David P. Woodruff

【Abstract】: All-pairs set similarity is a widely used data mining task, even for large and high-dimensional datasets. Traditionally, similarity search has focused on discovering very similar pairs, for which a variety of efficient algorithms are known. However, recent work has highlighted the importance of discovering pairs of sets with relatively small intersection sizes. For example, in a recommender system, two users may be alike even though their interests only overlap on a small percentage of items. In such systems, it is also common that some dimensions are highly-skewed, because they are very popular. Together, these two properties render previous approaches infeasible for large input sizes. To address this problem, we present a new distributed algorithm, LSF-Join, for approximate all-pairs set similarity. The core of our algorithm is a randomized selection procedure based on Locality Sensitive Filtering. In particular, our method deviates from prior approximate algorithms, which are based on Locality Sensitive Hashing. Theoretically, we show that LSF-Join efficiently finds most close pairs, even for small similarity thresholds and for skewed input sets. We prove guarantees on the communication, work, and maximum load of LSF-Join, and we also experimentally demonstrate its accuracy on multiple graphs.

【Keywords】:

299. Analyzing the Use of Audio Messages in WhatsApp Groups.

Paper Link】 【Pages】:3005-3011

【Authors】: Alexandre Maros ; Jussara M. Almeida ; Fabrício Benevenuto ; Marisa Vasconcelos

【Abstract】: WhatsApp is a free messaging app with more than one billion active monthly users which has become one of the main communication platforms in many countries, including Saudi Arabia, Germany, and Brazil. In addition to allowing the direct exchange of messages among pairs of users, the app also enables group conversations, where multiple people can interact with one another. A number of recent studies have shown that WhatsApp groups play an important role as an information dissemination platform, especially during important social mobilization events. In this paper, we build upon those prior efforts by taking a first look into the use of audio messages in WhatsApp groups, a type of content that is becoming increasingly important in the platform. We present a methodology to analyze audio messages shared in WhatsApp groups, characterizing content properties (e.g, topics and language characteristics), their propagation dynamics and the impact of different types of audios (e.g., speech versus music) on such dynamics.

【Keywords】:

300. Understanding User Behavior For Document Recommendation.

Paper Link】 【Pages】:3012-3018

【Authors】: Xuhai Xu ; Ahmed Hassan Awadallah ; Susan T. Dumais ; Farheen Omar ; Bogdan Popp ; Robert Rounthwaite ; Farnaz Jahanbakhsh

【Abstract】: Personalized document recommendation systems aim to provide users with a quick shortcut to the documents they may want to access next, usually with an explanation about why the document is recommended. Previous work explored various methods for better recommendations and better explanations in different domains. However, there are few efforts that closely study how users react to the recommended items in a document recommendation scenario. We conducted a large-scale log study of users’ interaction behavior with the explainable recommendation on one of the largest cloud document platforms office.com. Our analysis reveals a number of factors, including display position, file type, authorship, recency of last access, and most importantly, the recommendation explanations, that are associated with whether users will recognize or open the recommended documents. Moreover, we specifically focus on explanations and conduct an online experiment to investigate the influence of different explanations on user behavior. Our analysis indicates that the recommendations help users access their documents significantly faster, but sometimes users miss a recommendation and resort to other more complicated methods to open the documents. Our results suggest opportunities to improve explanations and more generally the design of systems that provide and explain recommendations for documents.

【Keywords】:

301. Influence Function based Data Poisoning Attacks to Top-N Recommender Systems.

Paper Link】 【Pages】:3019-3025

【Authors】: Minghong Fang ; Neil Zhenqiang Gong ; Jia Liu

【Abstract】: Recommender system is an essential component of web services to engage users. Popular recommender systems model user preferences and item properties using a large amount of crowdsourced user-item interaction data, e.g., rating scores; then top-N items that match the best with a user’s preference are recommended to the user. In this work, we show that an attacker can launch a data poisoning attack to a recommender system to make recommendations as the attacker desires via injecting fake users with carefully crafted user-item interaction data. Specifically, an attacker can trick a recommender system to recommend a target item to as many normal users as possible. We focus on matrix factorization based recommender systems because they have been widely deployed in industry. Given the number of fake users the attacker can inject, we formulate the crafting of rating scores for the fake users as an optimization problem. However, this optimization problem is challenging to solve as it is a non-convex integer programming problem. To address the challenge, we develop several techniques to approximately solve the optimization problem. For instance, we leverage influence function to select a subset of normal users who are influential to the recommendations and solve our formulated optimization problem based on these influential users. Our results show that our attacks are effective and outperform existing methods.

【Keywords】:

Paper Link】 【Pages】:3026-3032

【Authors】: Liang Qu ; Huaisheng Zhu ; Qiqi Duan ; Yuhui Shi

【Abstract】: Recently, graph neural networks (GNNs) have been shown to be an effective tool for learning the node representations of the networks and have achieved good performance on the semi-supervised node classification task. However, most existing GNNs methods fail to take networks’ temporal information into account, therefore, cannot be well applied to dynamic network applications such as the continuous-time link prediction task. To address this problem, we propose a Temporal Dependent Graph Neural Network (TDGNN), a simple yet effective dynamic network representation learning framework which incorporates the network temporal information into GNNs. TDGNN introduces a novel Temporal Aggregator (TDAgg) to aggregate the neighbor nodes’ features and edges’ temporal information to obtain the target node representations. Specifically, it assigns the neighbor nodes aggregation weights using an exponential distribution to bias different edges’ temporal information. The performance of the proposed method has been validated on six real-world dynamic network datasets for the continuous-time link prediction task. The experimental results show that the proposed method outperforms several state-of-the-art baselines.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

303. Are These Comments Triggering? Predicting Triggers of Toxicity in Online Discussions.

Paper Link】 【Pages】:3033-3040

【Authors】: Hind Almerekhi ; Haewoon Kwak ; Joni Salminen ; Bernard J. Jansen

【Abstract】: Understanding the causes or triggers of toxicity adds a new dimension to the prevention of toxic behavior in online discussions. In this research, we define toxicity triggers in online discussions as a non-toxic comment that lead to toxic replies. Then, we build a neural network-based prediction model for toxicity trigger. The prediction model incorporates text-based features and derived features from previous studies that pertain to shifts in sentiment, topic flow, and discussion context. Our findings show that triggers of toxicity contain identifiable features and that incorporating shift features with the discussion context can be detected with a ROC-AUC score of 0.87. We discuss implications for online communities and also possible further analysis of online toxicity and its root causes.

【Keywords】:

304. Sampling Query Variations for Learning to Rank to Improve Automatic Boolean Query Generation in Systematic Reviews.

Paper Link】 【Pages】:3041-3048

【Authors】: Harrisen Scells ; Guido Zuccon ; Mohamed A. Sharaf ; Bevan Koopman

【Abstract】: Searching medical literature for synthesis in a systematic review is a complex and labour intensive task. In this context, expert searchers construct lengthy Boolean queries. The universe of possible query variations can be massive: a single query can be composed of hundreds of field-restricted search terms/phrases or ontological concepts, each grouped by a logical operator nested to depths of sometimes five or more levels deep. With the many choices about how to construct a query, it is difficult to both formulate and recognise effective queries. To address this challenge, automatic methods have recently been explored for generating and selecting effective Boolean query variations for systematic reviews. The limiting factor of these methods is that it is computationally infeasible to process all query variations for training the methods. To overcome this, we propose novel query variation sampling methods for training Learning to Rank models to rank queries. Our results show that query sampling methods do directly impact the ability of a Learning to Rank model to effectively identify good query variations. Thus, selecting appropriate query sampling methods is a key problem for the automatic reformulation of effective Boolean queries for systematic review literature search. We find that the best sampling strategies are those which balance the diversity of queries with the quantity of queries.

【Keywords】: Information systems; Information retrieval; Information retrieval query processing

305. Learning Temporal Interaction Graph Embedding via Coupled Memory Networks.

Paper Link】 【Pages】:3049-3055

【Authors】: Zhen Zhang ; Jiajun Bu ; Martin Ester ; Jianfeng Zhang ; Chengwei Yao ; Zhao Li ; Can Wang

【Abstract】: Graph embedding has become the research focus in both academic and industrial communities due to its powerful capabilities. The majority of existing work overwhelmingly learn node embeddings in the context of static, plain or attributed, homogeneous graphs. However, many real-world applications frequently involve bipartite graphs with temporal and attributed interaction edges, named temporal interaction graphs. The temporal interactions usually imply different facets of interest and might even evolve over time, thus putting forward huge challenges in learning effective node representations. In this paper, we propose a novel framework named TigeCMN to learn node representations from a sequence of temporal interactions. Specifically, we devise two coupled memory networks to store and update node embeddings in external matrices explicitly and dynamically, which forms deep matrix representations and could enhance the expressiveness of the node embeddings. We conduct experiments on two real-world datasets and the experimental results empirically demonstrate that TigeCMN can outperform the state-of-the-arts with different gains.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

306. Beyond Clicks: Modeling Multi-Relational Item Graph for Session-Based Target Behavior Prediction.

Paper Link】 【Pages】:3056-3062

【Authors】: Wen Wang ; Wei Zhang ; Shukai Liu ; Qi Liu ; Bo Zhang ; Leyu Lin ; Hongyuan Zha

【Abstract】: Session-based target behavior prediction aims to predict the next item to be interacted with specific behavior types (e.g., clicking). Although existing methods for session-based behavior prediction leverage powerful representation learning approaches to encode items’ sequential relevance in a low-dimensional space, they suffer from several limitations. Firstly, they focus on only utilizing the same type of user behavior for prediction, but ignore the potential of taking other behavior data as auxiliary information. This is particularly crucial when the target behavior is sparse but important (e.g., buying or sharing an item). Secondly, item-to-item relations are modeled separately and locally in one behavior sequence, and they lack a principled way to globally encode these relations more effectively. To overcome these limitations, we propose a novel Multi-relational Graph Neural Network model for Session-based target behavior Prediction, namely MGNN-SPred for short. Specifically, we build a Multi-Relational Item Graph (MRIG) based on all behavior sequences from all sessions, involving target and auxiliary behavior types. Based on MRIG, MGNN-SPred learns global item-to-item relations and further obtains user preferences w.r.t. current target and auxiliary behavior sequences, respectively. In the end, MGNN-SPred leverages a gating mechanism to adaptively fuse user representations for predicting next item interacted with target behavior. The extensive experiments on two real-world datasets demonstrate the superiority of MGNN-SPred by comparing with state-of-the-art session-based prediction methods, validating the benefits of leveraging auxiliary behavior and learning item-to-item relations over MRIG.

【Keywords】: Computing methodologies; Machine learning; Machine learning approaches; Neural networks

307. An Empirical Study of Android Security Bulletins in Different Vendors.

Paper Link】 【Pages】:3063-3069

【Authors】: Sadegh Farhang ; Mehmet Bahadir Kirdan ; Aron Laszka ; Jens Grossklags

【Abstract】: Mobile devices encroach on almost every part of our lives, including work and leisure, and contain a wealth of personal and sensitive information. It is, therefore, imperative that these devices uphold high security standards. A key aspect is the security of the underlying operating system. In particular, Android plays a critical role due to being the most dominant platform in the mobile ecosystem with more than one billion active devices and due to its openness, which allows vendors to adopt and customize it. Similar to other platforms, Android maintains security by providing monthly security patches and announcing them via the Android security bulletin. To absorb this information successfully across the Android ecosystem, impeccable coordination by many different vendors is required.

【Keywords】:

308. One2Multi Graph Autoencoder for Multi-view Graph Clustering.

Paper Link】 【Pages】:3070-3076

【Authors】: Shaohua Fan ; Xiao Wang ; Chuan Shi ; Emiao Lu ; Ken Lin ; Bai Wang

【Abstract】: Multi-view graph clustering, which seeks a partition of the graph with multiple views that often provide more comprehensive yet complex information, has received considerable attention in recent years. Although some efforts have been made for multi-view graph clustering and achieve decent performances, most of them employ shallow model to deal with the complex relation within multi-view graph, which may seriously restrict the capacity for modeling multi-view graph information. In this paper, we make the first attempt to employ deep learning technique for attributed multi-view graph clustering, and propose a novel task-guided One2Multi graph autoencoder clustering framework. The One2Multi graph autoencoder is able to learn node embeddings by employing one informative graph view and content data to reconstruct multiple graph views. Hence, the shared feature representation of multiple graphs can be well captured. Furthermore, a self-training clustering objective is proposed to iteratively improve the clustering results. By integrating the self-training and autoencoder’s reconstruction into a unified framework, our model can jointly optimize the cluster label assignments and embeddings suitable for graph clustering. Experiments on real-world attributed multi-view graph datasets well validate the effectiveness of our model.

【Keywords】:

309. Improved Touch-screen Inputting Using Sequence-level Prediction Generation.

Paper Link】 【Pages】:3077-3083

【Authors】: Xin Wang ; Xu Li ; Jinxing Yu ; Mingming Sun ; Ping Li

【Abstract】: Recent years have witnessed the continuing growth of people’s dependence on touchscreen devices. As a result, input speed with the onscreen keyboard has become crucial to communication efficiency and user experience. In this work, we formally discuss the general problem of input expectation prediction with a touch-screen input method editor (IME). Taken input efficiency as the optimization target, we proposed a neural end-to-end candidates generation solution to handle automatic correction, reordering, insertion, deletion as well as completion. Evaluation metrics are also discussed base on real use scenarios. For a more thorough comparison, we also provide a statistical strategy for mapping touch coordinate sequences to text input candidates. The proposed model and baselines are evaluated on a real-world dataset. The experiment (conducted on the PaddlePaddle deep learning platform1) shows that the proposed model outperforms the baselines.

【Keywords】:

310. P-Simrank: Extending Simrank to Scale-Free Bipartite Networks.

Paper Link】 【Pages】:3084-3090

【Authors】: Prasenjit Dey ; Kunal Goel ; Rahul Agrawal

【Abstract】: The measure of similarity between nodes in a graph is a useful tool in many areas of computer science. SimRank, proposed by Jeh and Widom [7], is a classic measure of similarities of nodes in graph that has both theoretical and intuitive properties and has been extensively studied and used in many applications such as Query-Rewriting, link prediction, collaborative filtering and so on. Existing works based on Simrank primarily focus on preserving the microscopic structure, such as the second and third order proximity of the vertices, while the macroscopic scale-free property is largely ignored. Scale-free property is a critical property of any real-world web graphs where the vertex degrees follow a heavy-tailed distribution. In this paper, we introduce P-Simrank which extends the idea of Simrank to Scale-free bipartite networks. To study the efficacy of the proposed solution on a real world problem, we tested the same on the well known query-rewriting problem in sponsored search domain using bipartite click graph, similar to Simrank++ [1], which acts as our baseline. We show that Simrank++ produces sub-optimal similarity scores in case of bipartite graphs where degree distribution of vertices follow power-law. We also show how P-Simrank can be optimized for real-world large graphs. Finally, we experimentally evaluate P-Simrank algorithm against Simrank++, using actual click graphs obtained from Bing, and show that P-Simrank outperforms Simrank++ in variety of metrics.

【Keywords】:

311. Using Facebook Data to Measure Cultural Distance between Countries: The Case of Brazilian Cuisine.

Paper Link】 【Pages】:3091-3097

【Authors】: Carolina Coimbra Vieira ; Filipe Ribeiro ; Pedro Olmo Stancioli Vaz de Melo ; Fabrício Benevenuto ; Emilio Zagheni

【Abstract】: Measuring the affinity to a particular culture has been an active area of research. Countries and their residents can be characterized by many cultural aspects, such as clothing, music, art and food. As one of the central aspects, the cuisine of a country can reflect one of the dominant aspects of its culture. As such, the number of people interested in a typical national dish can be used to estimate the prevalence of that culture inside the host region. In this study, we measure the global spread of Brazilian culture across countries by exploring Facebook user’s preferences for typical Brazilian dishes through the Facebook Advertising Platform. To decide which dish will be considered typical from Brazil, we made use of spatial analysis to understand the distribution of interests around the world and to quantify how typical the dish is in Brazil and among Brazilian immigrants. This methodology can be generalized to other countries to infer cultural elements that emigrants usually take to and preserve in the countries they migrate to. Also, the interest in Brazilian typical dishes can be used to characterize countries in terms of Brazilian cultural exposition. While evaluating the cultural distance between Brazil and the countries with more Brazilian immigrants, we explore several measures of distance to compare these in the context of affinity to Brazilian cuisine. Our results revealed that these cultural distance measures can complement other metrics of distance applied to gravity-type models, for example, in order to explain flows of people between countries.

【Keywords】:

312. Structure-Feature based Graph Self-adaptive Pooling.

Paper Link】 【Pages】:3098-3104

【Authors】: Liang Zhang ; Xudong Wang ; Hongsheng Li ; Guangming Zhu ; Peiyi Shen ; Ping Li ; Xiaoyuan Lu ; Syed Afaq Ali Shah ; Mohammed Bennamoun

【Abstract】: Various methods to deal with graph data have been proposed in recent years. However, most of these methods focus on graph feature aggregation rather than graph pooling. Besides, the existing top-k selection graph pooling methods have a few problems. First, to construct the pooled graph topology, current top-k selection methods evaluate the importance of the node from a single perspective only, which is simplistic and unobjective. Second, the feature information of unselected nodes is directly lost during the pooling process, which inevitably leads to a massive loss of graph feature information. To solve these problems mentioned above, we propose a novel graph self-adaptive pooling method with the following objectives: (1) to construct a reasonable pooled graph topology, structure and feature information of the graph are considered simultaneously, which provide additional veracity and objectivity in node selection; and (2) to make the pooled nodes contain sufficiently effective graph information, node feature information is aggregated before discarding the unimportant nodes; thus, the selected nodes contain information from neighbor nodes, which can enhance the use of features of the unselected nodes. Experimental results on four different datasets demonstrate that our method is effective in graph classification and outperforms state-of-the-art graph pooling methods.

【Keywords】:

313. Solving Billion-Scale Knapsack Problems.

Paper Link】 【Pages】:3105-3111

【Authors】: Xingwen Zhang ; Feng Qi ; Zhigang Hua ; Shuang Yang

【Abstract】: Knapsack problems (KPs) are common in industry, but solving KPs is known to be NP-hard and has been tractable only at a relatively small scale. This paper examines KPs in a slightly generalized form and shows that they can be solved nearly optimally at scale via distributed algorithms. The proposed approach can be implemented fairly easily with off-the-shelf distributed computing frameworks (e.g. MPI, Hadoop, Spark). As an example, our implementation leads to one of the most efficient KP solvers known to date – capable to solve KPs at an unprecedented scale (e.g., KPs with 1 billion decision variables and 1 billion constraints can be solved within 1 hour). The system has been deployed to production and called on a daily basis, yielding significant business impacts at Ant Financial.

【Keywords】: Mathematics of computing; Mathematical analysis; Mathematical optimization; Theory of computation; Design and analysis of algorithms; Mathematical optimization

314. MineThrottle: Defending against Wasm In-Browser Cryptojacking.

Paper Link】 【Pages】:3112-3118

【Authors】: Weikang Bian ; Wei Meng ; Mingxue Zhang

【Abstract】: In-browser cryptojacking is an urgent threat to web users, where an attacker abuses the users’ computing resources without obtaining their consent. In-browser mining programs are usually developed in WebAssembly (Wasm) for its great performance. Several prior works have measured cryptojacking in the wild and proposed detection methods using static features and dynamic features. However, there exists no good defense mechanism within the user’s browser to stop the malicious drive-by mining behavior.

【Keywords】:

315. One Picture Is Worth a Thousand Words? The Pricing Power of Images in e-Commerce.

Paper Link】 【Pages】:3119-3125

【Authors】: Christof Naumzik ; Stefan Feuerriegel

【Abstract】: In e-commerce, product presentations, and particularly images, are known to provide important information for user decision-making, and yet the relationship between images and prices has not been studied. To close this research gap, we suggest a tailored web mining framework, since one must quantify the relative contribution of image content in describing prices ceteris paribus. That is, one must account for the fact that such images inherently depict heterogeneous products. In order to isolate the pricing power of image content, we suggest a three-stage framework involving deep learning and statistical inference.

【Keywords】:

316. Learning Model-Agnostic Counterfactual Explanations for Tabular Data.

Paper Link】 【Pages】:3126-3132

【Authors】: Martin Pawelczyk ; Klaus Broelemann ; Gjergji Kasneci

【Abstract】: Counterfactual explanations can be obtained by identifying the smallest change made to an input vector to influence a prediction in a positive way from a user’s viewpoint; for example, from ’loan rejected’ to ’awarded’ or from ’high risk of cardiovascular disease’ to ’low risk’. Previous approaches would not ensure that the produced counterfactuals be proximate (i.e., not local outliers) and connected to regions with substantial data density (i.e., close to correctly classified observations), two requirements known as counterfactual faithfulness. Our contribution is twofold. First, drawing ideas from the manifold learning literature, we develop a framework, called C-CHVAE, that generates faithful counterfactuals. Second, we suggest to complement the catalog of counterfactual quality measures using a criterion to quantify the degree of difficulty for a certain counterfactual suggestion. Our real world experiments suggest that faithful counterfactuals come at the cost of higher degrees of difficulty.

【Keywords】: Computing methodologies; Machine learning

317. Domain Adaptation with Category Attention Network for Deep Sentiment Analysis.

Paper Link】 【Pages】:3133-3139

【Authors】: Dongbo Xi ; Fuzhen Zhuang ; Ganbin Zhou ; Xiaohu Cheng ; Fen Lin ; Qing He

【Abstract】: Domain adaptation tasks such as cross-domain sentiment classification aim to utilize existing labeled data in the source domain and unlabeled or few labeled data in the target domain to improve the performance in the target domain via reducing the shift between the data distributions. Existing cross-domain sentiment classification methods need to distinguish pivots, i.e., the domain-shared sentiment words, and non-pivots, i.e., the domain-specific sentiment words, for excellent adaptation performance. In this paper, we first design a Category Attention Network (CAN), and then propose a model named CAN-CNN to integrate CAN and a Convolutional Neural Network (CNN). On the one hand, the model regards pivots and non-pivots as unified category attribute words and can automatically capture them to improve the domain adaptation performance; on the other hand, the model makes an attempt at interpretability to learn the transferred category attribute words. Specifically, the optimization objective of our model has three different components: 1) the supervised classification loss; 2) the distributions loss of category feature weights; 3) the domain invariance loss. Finally, the proposed model is evaluated on three public sentiment analysis datasets and the results demonstrate that CAN-CNN can outperform other various baseline methods.

【Keywords】:

Keynote Talk 3

318. Embedding the Scientific Record on the Web: Towards Automating Scientific Discoveries.

Paper Link】 【Pages】:3140

【Authors】: Yolanda Gil

【Abstract】: Future AI systems will be key contributors to science, but this is unlikely to happen unless we reinvent our current publications and embed our scientific records in the Web as structured Web objects. This implies that our scientific papers of the future will be complemented with explicit, structured descriptions of the experiments, software, data, and workflows used to reach new findings. These scientific papers of the future will not only culminate the promise of open science and reproducible research, but also enable the creation of AI systems that can ingest and organize scientific methods and processes, re-run experiments and re-analyze results, and explore their own hypothesis in systematic and unbiased ways. In this talk, I will describe guidelines for writing scientific papers of the future that embed the scientific record on the Web, and our progress on AI systems capable of using them to systematically explore experiments. I will also outline a research agenda with seven key characteristics for creating AI scientists that will exploit the Web to independently make new discoveries [1]. AI scientists have the potential to transform science and the processes of scientific discovery [2, 3].

【Keywords】: Computing methodologies; Artificial intelligence; Philosophical/theoretical foundations of artificial intelligence

319. Architectures for Autonomy: Towards an Equitable Web of Data in the Age of AI.

Paper Link】 【Pages】:3141-3142

【Authors】: Nigel Shadbolt

【Abstract】: Today, the Web connects over half the world's population, many of whom use it to stay connected to a multiplicity of vital digital public and private services, impacting every aspect of their lives. Access to the Web and underlying Internet is seen as essential for all—even a fundamental human right [7]. However, many contend that the power structure on large swaths of the Web has become inverted; they argue that instead of being run for and by users, it has been made to serve the platforms themselves, and the powerful actors that sponsor such platforms to run targeted advertising on their behalf. In such an ad-driven platform ecosystem, users, including their beliefs, data, and attention, have become traded commodities [13].

【Keywords】:

320. Democratizing Content Creation and Dissemination through AI Technology.

Paper Link】 【Pages】:3143

【Authors】: Wei-Ying Ma

【Abstract】: With the rise of mobile video, user-generated content, and social networks, there is a massive opportunity for disruptive innovations in the media and content industry. It is now a fast-changing landscape with rapid advances in AI-powered content creation, dissemination and interaction technologies. I believe the current trends are leading us towards a world where everyone is equally empowered to produce high-quality content in video, music, augmented reality or more – and to share their information, knowledge, and stories with a large global audience. This new AI- powered content platform can further lead to innovations in advertising, e-commerce, online education, and productivity. I will share the current research efforts at ByteDance connected to this emerging new platform through products such as Douyin and TikTok, and discuss the challenges and the direction of our future research.

【Keywords】: