26. CIKM 2017:Singapore

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, November 06 - 10, 2017. ACM 【DBLP Link】

Paper Num: 350 || Session Num: 53

Demonstrations (alphabetical by lead authors' last names) 29 29)
Keynote & Invited Talks 4
Session 1A: Multimedia 4
Session 1B: IR evaluation 4
Session 1C: Sentiment 4
Session 1D: Network Embedding 1 4
Session 1E: Web/App data 4
Session 1F: Graph data 4
Session 2A: Ranking 4
Session 2B: Crowdsourcing 1 4
Session 2C: Recommendation 1 4
Session 2D: Network Embedding 2 4
Session 2E: Skyline Queries 4
Session 2F: Social Media Analysis 4
Session 3A: Spatiotemporal 4
Session 3B: Short text retrieval 4
Session 3C: Community Detection 4
Session 3D: Time Series 4
Session 3E: Query processing 4
Session 3F: Temporal data 4
Session 4A: Evaluation 4
Session 4B: News and credibility 4
Session 4C: Outliers and Anomaly Detection 4
Session 4D: Graph Mining 1 4
Session 4E: Online learning, Stream mining 3
Session 5A: Tensor analysis 4
Session 5B: Application driven mining 4
Session 5C: Deep Learning 1 4
Session 6A: Crowdsourcing 2 4
Session 6B: User behavior and targeting 4
Session 6C: Deep Learning 2 4
Session 7A: Health Analytics 1 4
Session 7B: Privacy Preserving Data Mining 4
Session 7C: Social Networks 1 4
Session 7D: Application driven analysis 4
Session 7E: Text Mining 4
Session 7F: Efficient Learning 4
Session 7G: Recommendation 2 4
Session 8A: Recommendation 3 4
Session 8B: Text analysis 4
Session 8C: Adversarial IR 4
Session 8D: Health Analytics 2/ Top-k 3
Session 8E: Social Networks 2 4
Session 8F: Feature/Entity Selection 4
Session 8G: Graph Mining 2 4
Session 9A: Queries 4
Session 9B: Representation learning 4
Session 9C: Graph Mining 3 4
Session 9D: Relational Mining 4
Session 9E: User characteristics 4
Session 9F: Engagement 4
Short Papers (alphabetical by lead authors' last names) 119 119)
Workshops 4

Keynote & Invited Talks 4

1. Machine Learning @ Amazon.

【Paper Link】【Pages】:1

【Authors】: Rajeev Rastogi

【Abstract】: In this talk, I will first provide an overview of key problem areas where we are applying Machine Learning (ML) techniques within Amazon such as product demand forecasting, product search, and information extraction from reviews, and associated technical challenges. I will then talk about three specific applications where we use a variety of methods to learn semantically rich representations of data: question answering where we use deep learning techniques, product size recommendations where we use probabilistic models, and fake reviews detection where we use tensor factorization algorithms.

【Keywords】: bayesian inference; deep learning; question answering; recommendations

2. Deception Detection: When Computers Become Better than Humans.

【Paper Link】【Pages】:3

【Authors】: Rada Mihalcea

【Abstract】: Whether we like it or not, deception happens every day and everywhere: thousands of trials taking place daily around the world; little white lies: "I'm busy that day!" even if your calendar is blank; news "with a twist" (a.k.a. fake news) meant to attract the readers attraction, and get some advertisement clicks on the side; portrayed identities, on dating sites and elsewhere. Can a computer automatically detect deception in written accounts or in video recordings? In this talk, I will describe our work in building linguistic and multimodal algorithms for deception detection, targeting deceptive statements, trial videos, fake news, identity deceptions, and also going after deception in multiple cultures. I will also show how these algorithms can provide insights into what makes a good lie - and thus teach us how we can spot a liar. As it turns out, computers can be trained to identify lies in many different contexts, and they can do it much better than humans do!

【Keywords】: Keynote Talk

3. When Deep Learning Meets Transfer Learning.

【Paper Link】【Pages】:5

【Authors】: Qiang Yang

【Abstract】: Deep learning has achieved great success as evidenced by many practical applications and contests. However, deep learning developed so far also has some inherent limitations. In particular, deep learning is not yet very adaptable to different related domains and cannot handle small data. In this talk, I will give an overview of how transfer learning can help alleviate these problems. In particular, I will present some recent progress on integrating deep learning and transfer learning together and show some interesting applications in sentiment analysis, image processing and urban computing.

【Keywords】: machine learning; transfer learning

4. A Hyper-connected World.

【Paper Link】【Pages】:7

【Authors】: K. Ananth Krishnan

【Abstract】: As the world gets hyper-connected, cities are evolving into complex ecosystems, technically and behaviourally. Machines and humans interact continually, generating streams of data and behavior patterns. To be a true smart city in a hyper-connected world, cities today have to use technology like a modern enterprise: build a digital spine; become intelligent and leverage automation. However, this technology core should be people centric. In a multiple stakeholder ecosystem, city administrators, industries and citizens, will look at the city from a different perspective and expect different experiences. Finally citizen experience will be the determinant of success of a smart city. While articulating this vision, Ananth will highlight how differently businesses must orient themselves in this environment.

【Keywords】: enterprise digital spine.; hyper-connected; smart cities

Session 1A: Multimedia 4

5. Jointly Modeling Static Visual Appearance and Temporal Pattern for Unsupervised Video Hashing.

【Paper Link】【Pages】:9-17

【Authors】: Chao Li ; Yang Yang ; Jiewei Cao ; Zi Huang

【Abstract】: Recently, hashing has been evidenced as an efficient and effective method to facilitate large-scale video retrieval. Most of existing hashing methods are based on visual features, which are expected to capture the appearance of videos. The intrinsic temporal pattern embedded in videos has also shown its discriminative power for similarity search, and is explored and utilised in some recent studies. However, how to leverage the strengths in both aspects remains unknown. In this paper, we propose to jointly model static visual appearance and temporal pattern for video hash code generation, as both of them are believed to be carrying important information for learning an effective hash function. A novel unsupervised video hashing framework is designed correspondingly, where its hash function is comprised of two encoders including the temporal encoder and the appearance encoder. The two encoders are learned by self-supervision and designed to be able to reconstruct the temporal pattern of videos and visual appearance of frames respectively. Last but not least, for jointly learning of the two encoders, we impose three learning criteria including minimal binarization loss, balanced hash codes and independent hash codes. From the extensive experiments conducted on two large-scale video datasets (i.e. FCVID and ActivityNet), we have confirmed the superior performance of our method comparing to the state-of-the-art video hashing methods.

【Keywords】: deep learning; learning to hash; lstm; video hashing; visual content retrieval

6. Construction of a National Scale ENF Map using Online Multimedia Data.

【Paper Link】【Pages】:19-28

【Authors】: Hyunsoo Kim ; Young Bae Jeon ; Ji Won Yoon

【Abstract】: The frequency of power distribution networks in a power grid is called electrical network frequency (ENF). Because it provides the spatio-temporal changes of the power grid in a particular location, ENF is used in many application domains including the prediction of grid instability and blackouts, detection of system breakup, and even digital forensics. In order to build high performing applications and systems, it is necessary to capture a large-scale nationwide or worldwide ENF map. Consequently, many studies have been conducted on the distribution of specialized physical devices that capture the ENF signals. However, this approach is not practical because it requires significant effort from design to setup, moreover, it has a limitation in its efficiency to monitor and stably retain the collection equipment distributed throughout the world. Furthermore, this approach requires a significant budget. In this paper, we proposed a novel approach to constructing the worldwide ENF map by analyzing streaming data obtained by online multimedia services, such as "Youtube", "Earthcam", and "Ustream" instead of expensive specialized hardware. However, extracting accurate ENF from the streaming data is not a straightforward process because multimedia has its own noise and uncertainty. By applying several signal processing techniques, we can reduce noise and uncertainty, and improve the quality of the restored ENF. For the evaluation of this process, we compared the performance between the ENF signals restored by our proposed approach and collected by the frequency disturbance recorder (FDR) from FNET/GridEye. The experimental results show that our proposed approach outperforms in stable acquisition and management of the ENF signals compared to the conventional approach.

【Keywords】: electrical network frequency; frequency domain; multimedia data; power grid

7. Dual Learning for Cross-domain Image Captioning.

【Paper Link】【Pages】:29-38

【Authors】: Wei Zhao ; Wei Xu ; Min Yang ; Jianbo Ye ; Zhou Zhao ; Yabing Feng ; Yu Qiao

【Abstract】: Recent AI research has witnessed increasing interests in automatically generating image descriptions in text, which is coined as theimage captioning problem. Significant progresses have been made in domains where plenty of labeled training data (i.e. image-text pairs) are readily available or collected. However, obtaining rich annotated data is a time-consuming and expensive process, creating a substantial barrier for applying image captioning methods to a new domain. In this paper, we propose a cross-domain image captioning approach that uses a novel dual learning mechanism to overcome this barrier. First, we model the alignment between the neural representations of images and that of natural languages in the source domain where one can access sufficient labeled data. Second, we adjust the pre-trained model based on examining limited data (or unpaired data) in the target domain. In particular, we introduce a dual learning mechanism with a policy gradient method that generates highly rewarded captions. The mechanism simultaneously optimizes two coupled objectives: generating image descriptions in text and generating plausible images from text descriptions, with the hope that by explicitly exploiting their coupled relation, one can safeguard the performance of image captioning in the target domain. To verify the effectiveness of our model, we use MSCOCO dataset as the source domain and two other datasets (Oxford-102 and Flickr30k) as the target domains. The experimental results show that our model consistently outperforms previous methods for cross-domain image captioning.

【Keywords】: dual learning; image captioning; image synthesis; reinforcement learning

8. A New Approach to Compute CNNs for Extremely Large Images.

【Paper Link】【Pages】:39-48

【Authors】: Sai Wu ; Mengdan Zhang ; Gang Chen ; Ke Chen

【Abstract】: CNN (Convolution Neural Network) is widely used in visual analysis and achieves exceptionally high performances in image classification, face detection, object recognition, image recoloring, and other learning jobs. Using deep learning frameworks, such as Torch and Tensorflow, CNN can be efficiently computed by leveraging the power of GPU. However, one drawback of GPU is its limited memory which prohibits us from handling large images. Passing a 4K resolution image to the VGG network will result in an exception of out-of-memory for Titan-X GPU. In this paper, we propose a new approach that adopts the BSP (bulk synchronization parallel) model to compute CNNs for images of any size. Before fed to a specific CNN layer, the image is split into smaller pieces which go through the neural network separately. Then, a specific padding and normalization technique is adopted to merge sub-images back into one image. Our approach can be easily extended to support distributed multi-GPUs. In this paper, we use neural style network as our example to illustrate the effectiveness of our approach. We show that using one Titan-X GPU, we can transfer the style of an image with 10,000×10,000 pixels within 1 minute.

【Keywords】: convolution neural network; multi-gpus; style transfer

Session 1B: IR evaluation 4

9. Active Sampling for Large-scale Information Retrieval Evaluation.

【Paper Link】【Pages】:49-58

【Authors】: Dan Li ; Evangelos Kanoulas

【Abstract】: Evaluation is crucial in Information Retrieval. The development of models, tools and methods has significantly benefited from the availability of reusable test collections formed through a standardized and thoroughly tested methodology, known as the Cranfield paradigm. Constructing these collections requires obtaining relevance judgments for a pool of documents, retrieved by systems participating in an evaluation task; thus involves immense human labor. To alleviate this effort different methods for constructing collections have been proposed in the literature, falling under two broad categories: (a) sampling, and (b) active selection of documents. The former devises a smart sampling strategy by choosing only a subset of documents to be assessed and inferring evaluation measure on the basis of the obtained sample; the sampling distribution is being fixed at the beginning of the process. The latter recognizes that systems contributing documents to be judged vary in quality, and actively selects documents from good systems. The quality of systems is measured every time a new document is being judged. In this paper we seek to solve the problem of large-scale retrieval evaluation combining the two approaches. We devise an active sampling method that avoids the bias of the active selection methods towards good systems, and at the same time reduces the variance of the current sampling approaches by placing a distribution over systems, which varies as judgments become available. We validate the proposed method using TREC data and demonstrate the advantages of this new method compared to past approaches.

【Keywords】: cranfield; evaluation; horvitz-thompson estimator; sampling with varying probabilities

10. Intent Based Relevance Estimation from Click Logs.

【Paper Link】【Pages】:59-66

【Authors】: Prakash Mandayam Comar ; Srinivasan H. Sengamedu

【Abstract】: Estimating the relevance of documents based on the user feedback is an essential component of search, retrieval and ranking problems. User click modeling in search has focused primarily on factoring out the position bias. It is easy to see that the query type (generic queries vs specific queries) and user intent (purchase vs exploration) also introduce a bias in the click signal. In other words, the results not matching with the user intent will not be clicked. In this paper, we outline a technique to model the interplay of query, user intent and position bias with respect to the relevance of the retrieved search results. In particular, we define two intents namely purchase and explore, and estimate the relevance of the documents with respect to these two intents. We also relate them to the relevance estimates from considering only the position bias. We empirically demonstrate the effectiveness of the proposed approach by comparing its performance against the well-known CoEC measure and the recently proposed factor model approach for relevance estimation.

【Keywords】: user intent; relevance; ranking; search; click logs

11. A Comparison of Nuggets and Clusters for Evaluating Timeline Summaries.

【Paper Link】【Pages】:67-76

【Authors】: Gaurav Baruah ; Richard McCreadie ; Jimmy Lin

【Abstract】: There is growing interest in systems that generate timeline summaries by filtering high-volume streams of documents to retain only those that are relevant to a particular event or topic. Continued advances in algorithms and techniques for this task depend on standardized and reproducible evaluation methodologies for comparing systems. However, timeline summary evaluation is still in its infancy, with competing methodologies currently being explored in international evaluation forums such as TREC. One area of active exploration is how to explicitly represent the units of information that should appear in a "good" summary. Currently, there are two main approaches, one based on identifying nuggets in an external "ground truth", and the other based on clustering system outputs. In this paper, by building test collections that have both nugget and cluster annotations, we are able to compare these two approaches. Specifically, we address questions related to evaluation effort, differences in the final evaluation products, and correlations between scores and rankings generated by both approaches. We summarize advantages and disadvantages of nuggets and clusters to offer recommendations for future system evaluations.

【Keywords】: meta-evaluation; real-time summarization; summary evaluation; temporal summarization; trec

12. Sensitive and Scalable Online Evaluation with Theoretical Guarantees.

【Paper Link】【Pages】:77-86

【Authors】: Harrie Oosterhuis ; Maarten de Rijke

【Abstract】: Multileaved comparison methods generalize interleaved comparison methods to provide a scalable approach for comparing ranking systems based on regular user interactions. Such methods enable the increasingly rapid research and development of search engines. However, existing multileaved comparison methods that provide reliable outcomes do so by degrading the user experience during evaluation. Conversely, current multileaved comparison methods that maintain the user experience cannot guarantee correctness. Our contribution is two-fold. First, we propose a theoretical framework for systematically comparing multileaved comparison methods using the notions of considerateness, which concerns maintaining the user experience, and fidelity, which concerns reliable correct outcomes. Second, we introduce a novel multileaved comparison method, Pairwise Preference Multileaving (PPM), that performs comparisons based on document-pair preferences, and prove that it is considerate and has fidelity. We show empirically that, compared to previous multileaved comparison methods, PPM is more sensitive to user preferences and scalable with the number of rankers being compared.

【Keywords】: information retrieval; multileaving; online evaluation; ranker evaluation; theoretical guarantees

Session 1C: Sentiment 4

【Paper Link】【Pages】:87-96

【Authors】: Thibaut Thonet ; Guillaume Cabanac ; Mohand Boughanem ; Karen Pinel-Sauvagnat

【Abstract】: Social media platforms such as weblogs and social networking sites provide Internet users with an unprecedented means to express their opinions and debate on a wide range of issues. Concurrently with their growing importance in public communication, social media platforms may foster echo chambers and filter bubbles: homophily and content personalization lead users to be increasingly exposed to conforming opinions. There is therefore a need for unbiased systems able to identify and provide access to varied viewpoints. To address this task, we propose in this paper a novel unsupervised topic model, the Social Network Viewpoint Discovery Model (SNVDM). Given a specific issue (e.g., U.S. policy) as well as the text and social interactions from the users discussing this issue on a social networking site, SNVDM jointly identifies the issue's topics, the users' viewpoints, and the discourse pertaining to the different topics and viewpoints. In order to overcome the potential sparsity of the social network (i.e., some users interact with only a few other users), we propose an extension to SNVDM based on the Generalized Pólya Urn sampling scheme (SNVDM-GPU) to leverage "acquaintances of acquaintances" relationships. We benchmark the different proposed models against three baselines, namely TAM, SN-LDA, and VODUM, on a viewpoint clustering task using two real-world datasets. We thereby provide evidence that our model SNVDM and its extension SNVDM-GPU significantly outperform state-of-the-art baselines, and we show that utilizing social interactions greatly improves viewpoint clustering performance.

【Keywords】: social networks; topic modeling; viewpoint discovery

14. Aspect-level Sentiment Classification with HEAT (HiErarchical ATtention) Network.

【Paper Link】【Pages】:97-106

【Authors】: Jiajun Cheng ; Shenglin Zhao ; Jiani Zhang ; Irwin King ; Xin Zhang ; Hui Wang

【Abstract】: Aspect-level sentiment classification is a fine-grained sentiment analysis task, which aims to predict the sentiment of a text in different aspects. One key point of this task is to allocate the appropriate sentiment words for the given aspect.Recent work exploits attention neural networks to allocate sentiment words and achieves the state-of-the-art performance. However, the prior work only attends to the sentiment information and ignores the aspect-related information in the text, which may cause mismatching between the sentiment words and the aspects when an unrelated sentiment word is semantically meaningful for the given aspect. To solve this problem, we propose a HiErarchical ATtention (HEAT) network for aspect-level sentiment classification. The HEAT network contains a hierarchical attention module, consisting of aspect attention and sentiment attention. The aspect attention extracts the aspect-related information to guide the sentiment attention to better allocate aspect-specific sentiment words of the text. Moreover, the HEAT network supports to extract the aspect terms together with aspect-level sentiment classification by introducing the Bernoulli attention mechanism. To verify the proposed method, we conduct experiments on restaurant and laptop review data sets from SemEval at both the sentence level and the review level. The experimental results show that our model better allocates appropriate sentiment expressions for a given aspect benefiting from the guidance of aspect terms. Moreover, our method achieves better performance on aspect-level sentiment classification than state-of-the-art models.

【Keywords】: aspect; hierarchical attention network; sentiment classification

15. Dyadic Memory Networks for Aspect-based Sentiment Analysis.

【Paper Link】【Pages】:107-116

【Authors】: Yi Tay ; Luu Anh Tuan ; Siu Cheung Hui

【Abstract】: This paper proposes Dyadic Memory Networks (DyMemNN), a novel extension of end-to-end memory networks (memNN) for aspect-based sentiment analysis (ABSA). Originally designed for question answering tasks, memNN operates via a memory selection operation in which relevant memory pieces are adaptively selected based on the input query. In the problem of ABSA, this is analogous to aspects and documents in which the relationship between each word in the document is compared with the aspect vector. In the standard memory networks, simple dot products or feed forward neural networks are used to model the relationship between aspect and words which lacks representation learning capability. As such, our dyadic memory networks ameliorates this weakness by enabling rich dyadic interactions between aspect and word embeddings by integrating either parameterized neural tensor compositions or holographic compositions into the memory selection operation. To this end, we propose two variations of our dyadic memory networks, namely the Tensor DyMemNN and Holo DyMemNN. Overall, our two models are end-to-end neural architectures that enable rich dyadic interaction between aspect and document which intuitively leads to better performance. Via extensive experiments, we show that our proposed models achieve the state-of-the-art performance and outperform many neural architectures across six benchmark datasets.

【Keywords】: aspect; aspect-based sentiment analysis; aspect-level sentiment analysis; deep learning; neural networks; sentiment analysis

16. Modeling Language Discrepancy for Cross-Lingual Sentiment Analysis.

【Paper Link】【Pages】:117-126

【Authors】: Qiang Chen ; Chenliang Li ; Wenjie Li

【Abstract】: Language discrepancy is inherent and be part of human languages. Thereby, the same sentiment would be expressed in different patterns across different languages. Unfortunately, the language discrepancy is overlooked by existing works of cross-lingual sentiment analysis. How to accommodate the inherent language discrepancy in sentiment for better cross-lingual sentiment analysis is still an open question. In this paper, we aim to model the language discrepancy in sentiment expressions as intrinsic bilingual polarity correlations (IBPCs) for better cross-lingual sentiment analysis. Specifically, given a document of source language and its translated counterpart, we firstly devise a sentiment representation learning phase to extract monolingual sentiment representation for each document in this pair separately. Then, the two sentiment representations are transferred to be the points in a shared latent space, named hybrid sentiment space. The language discrepancy is then modeled as a fixed transfer vector under each particular polarity between the source and target languages in this hybrid sentiment space. Two relation-based bilingual sentiment transfer models (i.e., RBST-s, RBST-hp) are proposed to learn the fixped transfer vectors. The sentiment of a target-language document is then determined based on the transfer vector between it and its translated counterpart in the hybrid sentiment space. Extensive experiments over a real-world benchmark dataset demonstrate the superiority of the proposed models against several state-of-the-art alternatives.

【Keywords】: bilingual sentiment transfer model; cross-lingual sentiment analysis; language discrepancy

Session 1D: Network Embedding 1 4

17. Multi-view Clustering with Graph Embedding for Connectome Analysis.

【Paper Link】【Pages】:127-136

【Authors】: Guixiang Ma ; Lifang He ; Chun-Ta Lu ; Weixiang Shao ; Philip S. Yu ; Alex D. Leow ; Ann B. Ragin

【Abstract】: Multi-view clustering has become a widely studied problem in the area of unsupervised learning. It aims to integrate multiple views by taking advantages of the consensus and complimentary information from multiple views. Most of the existing works in multi-view clustering utilize the vector-based representation for features in each view. However, in many real-world applications, instances are represented by graphs, where those vector-based models cannot fully capture the structure of the graphs from each view. To solve this problem, in this paper we propose a Multi-view Clustering framework on graph instances with Graph Embedding (MCGE). Specifically, we model the multi-view graph data as tensors and apply tensor factorization to learn the multi-view graph embeddings, thereby capturing the local structure of graphs. We build an iterative framework by incorporating multi-view graph embedding into the multi-view clustering task on graph instances, jointly performing multi-view clustering and multi-view graph embedding simultaneously. The multi-view clustering results are used for refining the multi-view graph embedding, and the updated multi-view graph embedding results further improve the multi-view clustering. Extensive experiments on two real brain network datasets (i.e., HIV and Bipolar) demonstrate the superior performance of the proposed MCGE approach in multi-view connectome analysis for clinical investigation and application.

【Keywords】: connectome analysis; graph embedding; multi-view clustering

18. Attributed Signed Network Embedding.

【Paper Link】【Pages】:137-146

【Authors】: Suhang Wang ; Charu C. Aggarwal ; Jiliang Tang ; Huan Liu

【Abstract】: The major task of network embedding is to learn low-dimensional vector representations of social-network nodes. It facilitates many analytical tasks such as link prediction and node clustering and thus has attracted increasing attention. The majority of existing embedding algorithms are designed for unsigned social networks. However, many social media networks have both positive and negative links, for which unsigned algorithms have little utility. Recent findings in signed network analysis suggest that negative links have distinct properties and added value over positive links. This brings about both challenges and opportunities for signed network embedding. In addition, user attributes, which encode properties and interests of users, provide complementary information to network structures and have the potential to improve signed network embedding. Therefore, in this paper, we study the novel problem of signed social network embedding with attributes. We propose a novel framework SNEA, which exploits the network structure and user attributes simultaneously for network representation learning. Experimental results on link prediction and node clustering with real-world datasets demonstrate the effectiveness of SNEA.

【Keywords】: network embedding; node attributes; signed social networks

19. Enhancing the Network Embedding Quality with Structural Similarity.

【Paper Link】【Pages】:147-156

【Authors】: Tianshu Lyu ; Yuan Zhang ; Yan Zhang

【Abstract】: Neural network techniques are widely used in network embedding, boosting the result of node classification, link prediction, visualization and other tasks in both aspects of efficiency and quality. All the state of art algorithms put effort on the neighborhood information and try to make full use of it. However, it is hard to recognize core periphery structures simply based on neighborhood. In this paper, we first discuss the influence brought by random-walk based sampling strategies to the embedding results. Theoretical and experimental evidences show that random-walk based sampling strategies fail to fully capture structural equivalence. We present a new method, SNS, that performs network embeddings using structural information (namely graphlets) to enhance its quality. SNS effectively utilizes both neighbor information and local-subgraphs similarity to learn node embeddings. This is the first framework that combines these two aspects as far as we know, positively merging two important areas in graph mining and machine learning. Moreover, we investigate what kinds of local-subgraph features matter the most on the node classification task, which enables us to further improve the embedding quality. Experiments show that our algorithm outperforms other unsupervised and semi-supervised neural network embedding algorithms on several real-world datasets.

【Keywords】: graphlet; latent representation; network embedding

20. On Embedding Uncertain Graphs.

【Paper Link】【Pages】:157-166

【Authors】: Jiafeng Hu ; Reynold Cheng ; Zhipeng Huang ; Yixiang Fang ; Siqiang Luo

【Abstract】: Graph data are prevalent in communication networks, social media, and biological networks. These data, which are often noisy or inexact, can be represented by uncertain graphs, whose edges are associated with probabilities to indicate the chances that they exist. Recently, researchers have studied various algorithms (e.g., clustering, classification, and k-NN) for uncertain graphs. These solutions face two problems: (1) high dimensionality: uncertain graphs are often highly complex, which can affect the mining quality; and (2) low reusability, where an existing mining algorithm has to be redesigned to deal with uncertain graphs. To tackle these problems, we propose a solution called URGE, or UnceRtain Graph Embedding. Given an uncertain graph G, URGE generates G's embedding, or a set of low-dimensional vectors, which carry the proximity information of nodes in G. This embedding enables the dimensionality of G to be reduced, without destroying node proximity information. Due to its simplicity, existing mining solutions can be used on the embedding. We investigate two low- and high-order node proximity measures in the embedding generation process, and develop novel algorithms to enable fast evaluation. To our best knowledge, there is no prior study on the use of embedding for uncertain graphs. We have further performed extensive experiments for clustering, classification, and k-NN on several uncertain graph datasets. Our results show that URGE attains better effectiveness than current uncertain data mining algorithms, as well as state-of-the-art embedding solutions. The embedding and mining performance is also highly efficient in our experiments.

【Keywords】: graph embedding; uncertain data mining; uncertain graph

Session 1E: Web/App data 4

21. A Large Scale Prediction Engine for App Install Clicks and Conversions.

【Paper Link】【Pages】:167-175

【Authors】: Narayan Bhamidipati ; Ravi Kant ; Shaunak Mishra

【Abstract】: Predicting the probability of users clicking on app install ads and installing those apps comes with its own specific challenges. In this paper, we describe (a) how we built a scalable machine learning pipeline from scratch to predict the probability of users clicking and installing apps in response to ad impressions, (b) the novel features we developed to improve our model performance, (c) the training and scoring pipelines that were put into production, (d) our A/B testing process along with the metrics used to determine significant improvements, and (e) the results of our experiments. Our algorithmic improvements resulted in a 3X improvement in satisfaction for app install advertisers on our ad platform. In addition, we dive into how sequential model training, deep learning, and transfer learning resulted in a further 7% lift in conversion rate and 11% lift in revenue. Finally, we share the scientific, data-related, and product-related challenges that we encountered -- we expect others across the industry would greatly benefit from these considerations and our experiences when they kick-start similar efforts.

【Keywords】: advertiser satisfaction; app install; cross feature; deep learning; feature engineering; neural network; scoring latency; sequential training; transfer learning

22. Building Natural Language Interfaces to Web APIs.

【Paper Link】【Pages】:177-186

【Authors】: Yu Su ; Ahmed Hassan Awadallah ; Madian Khabsa ; Patrick Pantel ; Michael Gamon ; Mark J. Encarnación

【Abstract】: As the Web evolves towards a service-oriented architecture, application program interfaces (APIs) are becoming an increasingly important way to provide access to data, services, and devices. We study the problem of natural language interface to APIs (NL2APIs), with a focus on web APIs for web services. Such NL2APIs have many potential benefits, for example, facilitating the integration of web services into virtual assistants. We propose the first end-to-end framework to build an NL2API for a given web API. A key challenge is to collect training data, i.e., NL command-API call pairs, from which an NL2API can learn the semantic mapping from ambiguous, informal NL commands to formal API calls. We propose a novel approach to collect training data for NL2API via crowdsourcing, where crowd workers are employed to generate diversified NL commands. We optimize the crowdsourcing process to further reduce the cost. More specifically, we propose a novel hierarchical probabilistic model for the crowdsourcing process, which guides us to allocate budget to those API calls that have a high value for training NL2APIs. We apply our framework to real-world APIs, and show that it can collect high-quality training data at a low cost, and build NL2APIs with good performance from scratch. We also show that our modeling of the crowdsourcing process can improve its effectiveness, such that the training data collected via our approach leads to better performance of NL2APIs than a strong baseline.

【Keywords】: crowdsourcing; hierarchical probabilistic model; natural language interface; web api

23. UFeed: Refining Web Data Integration Based on User Feedback.

【Paper Link】【Pages】:187-196

【Authors】: Ahmed El-Roby ; Ashraf Aboulnaga

【Abstract】: One of the main challenges in large-scale data integration for relational schemas is creating an accurate mediated schema, and generating accurate semantic mappings between heterogeneous data sources and this mediated schema. Some applications can start with a moderately accurate mediated schema and mappings and refine them over time, which is referred to as the pay-as-you-go approach to data integration. Creating the mediated schema and mappings automatically to bootstrap the pay-as-you-go approach has been extensively studied. However, refining the mediated schema and mappings is still an open challenge because the data sources are usually heterogeneous and use diverse and sometimes ambiguous vocabularies. In this paper, we introduce UFeed, a system that refines relational mediated schemas and mappings based on user feedback over query answers. UFeed translates user actions into refinement operations that are applied to the mediated schema and mappings to improve their quality. We experimentally verify that UFeed improves the quality of query answers over real heterogeneous data sources extracted from the web.

【Keywords】: data integration; holistic schema matching; probabilistic schema matching; schema mapping; schema matching; user feedback

24. Extracting Records from the Web Using a Signal Processing Approach.

【Paper Link】【Pages】:197-206

【Authors】: Roberto Panerai Velloso ; Carina F. Dorneles

【Abstract】: Extracting records from web pages enables a number of important applications and has immense value due to the amount and diversity of available information that can be extracted. This problem, although vastly studied, remains open because it is not a trivial one. Due to the scale of data, a feasible approach must be both automatic and efficient (and of course effective). We present here a novel approach, fully automatic and computationally efficient, using signal processing techniques to detect regularities and patterns in the structure of web pages. Our approach segments the web page, detects the data regions within it, identifies the records boundaries and aligns the records. Results show high f-score and linearithmic time complexity behaviour.

【Keywords】: information retrieval; record alignment; record extraction; structure detection; web mining

Session 1F: Graph data 4

25. A Scalable Graph-Coarsening Based Index for Dynamic Graph Databases.

【Paper Link】【Pages】:207-216

【Authors】: Akshay Kansal ; Francesca Spezzano

【Abstract】: A graph database D is a collection of graphs. To speed up subgraph query answering on graph databases, indexes are commonly used. State-of-the-art graph database indexes do not adapt or scale well to dynamic graph database use; they are static, and their ability to prune possible search responses to meet user needs worsens over time as databases change and grow. Users can re-mine indexes to gain some improvement, but it is time consuming. Users must also tune numerous parameters on an ongoing basis to optimize performance and can inadvertently worsen the query response time if they do not choose parameters wisely. Recently, a one-pass algorithm has been developed to enhance the performance of frequent subgraphs based indexes by using the algorithm to update them regularly. However, there are some drawbacks, most notably the need to make updates as the query workload changes. In this paper, we propose a new index based on graph-coarsening to speed up subgraph query answering time in dynamic graph databases. Our index is parameter-free, query-independent, scalable,small enough to store in the main memory, and is simpler and less costly to maintain for database updates. Experimental results show that our index outperforms hybrid-indexes (i.e. indexes updated with one-pass) for query answering time in the case of social network databases, and is comparable with these indexes for frequent and infrequent queries on chemical databases. Our index can be updated up to 60 times faster in comparison to one-pass on dynamic graph databases. Moreover, our index is independent of the query workload for index update and is up to 15 times faster after hybrid-indexes are attuned to query workload.

【Keywords】: dynamic graph databases; graph coarsening; indexing; subgraph query processing

26. Natural Language Question/Answering: Let Users Talk With The Knowledge Graph.

【Paper Link】【Pages】:217-226

【Authors】: Weiguo Zheng ; Hong Cheng ; Lei Zou ; Jeffrey Xu Yu ; Kangfei Zhao

【Abstract】: The ever-increasing knowledge graphs impose an urgent demand of providing effective and easy-to-use query techniques for end users. Structured query languages, such as SPARQL, offer a powerful expression ability to query RDF datasets. However, they are difficult to use. Keywords are simple but have a very limited expression ability. Natural language question (NLQ) is promising on querying knowledge graphs. A huge challenge is how to understand the question clearly so as to translate the unstructured question into a structured query. In this paper, we present a data + oracle approach to answer NLQs over knowledge graphs. We let users verify the ambiguities during the query understanding. To reduce the interaction cost, we formalize an interaction problem and design an efficient strategy to solve the problem. We also propose a query prefetch technique by exploiting the latency in the interactions with users. Extensive experiments over the QALD dataset demonstrate that our proposed approach is effective as it outperforms state-of-the-art methods in terms of both precision and recall.

【Keywords】: interactive query; knowledge graph; natural language question and answering

27. Keyword Search on RDF Graphs - A Query Graph Assembly Approach.

【Paper Link】【Pages】:227-236

【Authors】: Shuo Han ; Lei Zou ; Jeffrey Xu Yu ; Dongyan Zhao

【Abstract】: Keyword search provides ordinary users an easy-to-use interface for querying RDF data. Given the input keywords, in this paper, we study how to assemble a query graph that is to represent user's query intention accurately and efficiently. Based on the input keywords, we first obtain the elementary query graph building blocks, such as entity/class vertices and predicate edges. Then, we formally define the query graph assembly (QGA) problem. Unfortunately, we prove theoretically that QGA is a NP-complete problem. In order to solve that, we design some heuristic lower bounds and propose a bipartite graph matching-based best-first search algorithm. The algorithm's time complexity is O(k2l ... l3l), where l is the number of the keywords and k is a tunable parameter, i.e., the maximum number of candidate entity/class vertices and predicate edges allowed to match each keyword. Although QGA is intractable, both l and k are small in practice. Furthermore, the algorithm's time complexity does not depend on the RDF graph size, which guarantees the good scalability of our system in large RDF graphs. Experiments on DBpedia and Freebase confirm the superiority of our system on both effectiveness and efficiency.

【Keywords】: graph data management; keyword search; rdf

28. Region Representation Learning via Mobility Flow.

【Paper Link】【Pages】:237-246

【Authors】: Hongjian Wang ; Zhenhui Li

【Abstract】: Increasing amount of urban data are being accumulated and released to public; this enables us to study the urban dynamics and address urban issues such as crime, traffic, and quality of living. In this paper, we are interested in learning vector representations for regions using the large-scale taxi flow data. These representations could help us better measure the relationship strengths between regions, and the relationships can be used to better model the region properties. Different from existing studies, we propose to consider both temporal dynamics and multi-hop transitions in learning the region representations. We propose to jointly learn the representations from a flow graph and a spatial graph. Such a combined graph could simulate individual movements and also addresses the data sparsity issue. We demonstrate the effectiveness of our method using three different real datasets.

【Keywords】: graph embedding; mobility flow; spatial-temporal data

Session 2A: Ranking 4

29. Learning Visual Features from Snapshots for Web Search.

【Paper Link】【Pages】:247-256

【Authors】: Yixing Fan ; Jiafeng Guo ; Yanyan Lan ; Jun Xu ; Liang Pang ; Xueqi Cheng

【Abstract】: When applying learning to rank algorithms to Web search, a large number of features are usually designed to capture the relevance signals. Most of these features are computed based on the extracted textual elements, link analysis, and user logs. However, Web pages are not solely linked texts, but have structured layout organizing a large variety of elements in different styles. Such layout itself can convey useful visual information, indicating the relevance of a Web page. For example, the query-independent layout (i.e., raw page layout) can help identify the page quality, while the query-dependent layout (i.e., page rendered with matched query words) can further tell rich structural information (e.g., size, position and proximity) of the matching signals. However, such visual information of layout has been seldom utilized in Web search in the past. In this work, we propose to learn rich visual features automatically from the layout of Web pages (i.e., Web page snapshots) for relevance ranking. Both query-independent and query-dependent snapshots are considered as the new inputs. We then propose a novel visual perception model inspired by human's visual search behaviors on page viewing to extract the visual features. This model can be learned end-to-end together with traditional human-crafted features. We also show that such visual features can be efficiently acquired in the online setting with an extended inverted indexing scheme. Experiments on benchmark collections demonstrate that learning visual features from Web page snapshots can significantly improve the performance of relevance ranking in ad-hoc Web retrieval tasks.

【Keywords】: snapshot; visual feature; web search

30. DeepRank: A New Deep Architecture for Relevance Ranking in Information Retrieval.

【Paper Link】【Pages】:257-266

【Authors】: Liang Pang ; Yanyan Lan ; Jiafeng Guo ; Jun Xu ; Jingfang Xu ; Xueqi Cheng

【Abstract】: This paper concerns a deep learning approach to relevance ranking in information retrieval (IR). Existing deep IR models such as DSSM and CDSSM directly apply neural networks to generate ranking scores, without explicit understandings of the relevance. According to the human judgement process, a relevance label is generated by the following three steps: 1) relevant locations are detected; 2) local relevances are determined; 3) local relevances are aggregated to output the relevance label. In this paper we propose a new deep learning architecture, namely DeepRank, to simulate the above human judgment process. Firstly, a detection strategy is designed to extract the relevant contexts. Then, a measure network is applied to determine the local relevances by utilizing a convolutional neural network (CNN) or two-dimensional gated recurrent units (2D-GRU). Finally, an aggregation network with sequential integration and term gating mechanism is used to produce a global relevance score. DeepRank well captures important IR characteristics, including exact/semantic matching signals, proximity heuristics, query term importance, and diverse relevance requirement. Experiments on both benchmark LETOR dataset and a large scale clickthrough data show that DeepRank can significantly outperform learning to ranking methods, and existing deep learning methods.

【Keywords】: deep learning; information retrieval; ranking; text matching

31. Learning to Un-Rank: Quantifying Search Exposure for Users in Online Communities.

【Paper Link】【Pages】:267-276

【Authors】: Asia J. Biega ; Azin Ghazimatin ; Hakan Ferhatosmanoglu ; Krishna P. Gummadi ; Gerhard Weikum

【Abstract】: Search engines in online communities such as Twitter or Facebook not only return matching posts, but also provide links to the profiles of the authors. Thus, when a user appears in the top-k results for a sensitive keyword query, she becomes widely exposed in a sensitive context. The effects of such exposure can result in a serious privacy violation, ranging from embarrassment all the way to becoming a victim of organizational discrimination. In this paper, we propose the first model for quantifying search exposure on the service provider side, casting it into a reverse k-nearest-neighbor problem. Moreover, since a single user can be exposed by a large number of queries, we also devise a learning-to-rank method for identifying the most critical queries and thus making the warnings user-friendly. We develop efficient algorithms, and present experiments with a large number of user profiles from Twitter that demonstrate the practical viability and effectiveness of our framework.

【Keywords】: information retrieval; privacy; ranking exposure; search exposure; social search

32. Balancing Speed and Quality in Online Learning to Rank for Information Retrieval.

【Paper Link】【Pages】:277-286

【Authors】: Harrie Oosterhuis ; Maarten de Rijke

【Abstract】: In Online Learning to Rank (OLTR) the aim is to find an optimal ranking model by interacting with users. When learning from user behavior, systems must interact with users while simultaneously learning from those interactions. Unlike other Learning to Rank (LTR) settings, existing research in this field has been limited to linear models. This is due to the speed-quality tradeoff that arises when selecting models: complex models are more expressive and can find the best rankings but need more user interactions to do so, a requirement that risks frustrating users during training. Conversely, simpler models can be optimized on fewer interactions and thus provide a better user experience, but they will converge towards suboptimal rankings. This tradeoff creates a deadlock, since novel models will not be able to improve either the user experience or the final convergence point, without sacrificing the other. Our contribution is twofold. First, we introduce a fast OLTR model called Sim-MGD that addresses the speed aspect of the speed-quality tradeoff. Sim-MGD ranks documents based on similarities with reference documents. It converges rapidly and, hence, gives a better user experience but it does not converge towards the optimal rankings. Second, we contribute Cascading Multileave Gradient De- scent (C-MGD) for OLTR that directly addresses the speed-quality tradeoff by using a cascade that enables combinations of the best of two worlds: fast learning and high quality final convergence. C-MGD can provide the better user experience of Sim-MGD while maintaining the same convergence as the state-of-the-art MGD model. This opens the door for future work to design new models for OLTR without having to deal with the speed-quality tradeoff.

【Keywords】: information retrieval; learning to rank; online learning to rank; reinforcement learning; user behaviour

Session 2B: Crowdsourcing 1 4

33. Crowd-enabled Pareto-Optimal Objects Finding Employing Multi-Pairwise-Comparison Questions.

【Paper Link】【Pages】:287-295

【Authors】: Chang Liu ; Yinan Zhang ; Lei Liu ; Lizhen Cui ; Dong Yuan ; Chunyan Miao

【Abstract】: Today, Pareto-optimal objects finding has been applied in various fields, such as group decision making and opinion collection. Many of the existing solutions to this problem require explicit attributes for objects. However, these attributes cannot be obtained sometimes. To address this issue, we propose an algorithm, which uses preference relations given by crowdsourcing, to find Pareto-optimal objects with shorter latency and lower monetary costs. It employs two multi-pairwise-comparison question models: BEST-form and BETTER-form questions. Multiple BEST (or BETTER) questions can be sent to crowds concurrently. Extensive experimental results show that the number of questions reduces greatly. In addition, the numerical results show that the latency is significantly shortened at a reasonable monetary cost, compared with the existing methods.

【Keywords】: crowdsourcing; pareto-optimal objects finding

34. Destination-aware Task Assignment in Spatial Crowdsourcing.

【Paper Link】【Pages】:297-306

【Authors】: Yan Zhao ; Yang Li ; Yu Wang ; Han Su ; Kai Zheng

【Abstract】: With the proliferation of GPS-enabled smart devices and increased availability of wireless network, spatial crowdsourcing (SC) has been recently proposed as a framework to automatically request workers (i.e., smart device carriers) to perform location-sensitive tasks (e.g., taking scenic photos, reporting events). In this paper we study a destination-aware task assignment problem that concerns the optimal strategy of assigning each task to proper worker such that the total number of completed tasks can be maximized whilst all workers can reach their destinations before deadlines after performing assigned tasks. Finding the global optimal assignment turns out to be an intractable problem since it does not imply optimal assignment for individual worker. Observing that the task assignment dependency only exists amongst subsets of workers, we utilize tree-decomposition technique to separate workers into independent clusters and develop an efficient depth-first search algorithm with progressive bounds to prune non-promising assignments. Our empirical studies demonstrate that our proposed technique is quite effective and settle the problem nicely.

【Keywords】: spatial crowdsourcing; spatial task assignment; user mobility

35. Crowdsourced Selection on Multi-Attribute Data.

【Paper Link】【Pages】:307-316

【Authors】: Xueping Weng ; Guoliang Li ; Huiqi Hu ; Jianhua Feng

【Abstract】: Crowdsourced selection asks the crowd to select entities that satisfy a query condition, e.g., selecting the photos of people wearing sunglasses from a given set of photos. Existing studies focus on a single query predicate and in this paper we study the crowdsourced selection problem on multi-attribute data, e.g., selecting the female photos with dark eyes and wearing sunglasses. A straightforward method asks the crowd to answer every entity by checking every predicate in the query. Obviously, this method involves huge monetary cost. Instead, we can select an optimized predicate order and ask the crowd to answer the entities following the order. Since if an entity does not satisfy a predicate, we can prune this entity without needing to ask other predicates and thus this method can reduce the cost. There are two challenges in finding the optimized predicate order. The first is how to detect the predicate order and the second is to capture correlation among different predicates. To address this problem, we propose predicate order based framework to reduce monetary cost. Firstly, we define an expectation tree to store selectivities on predicates and estimate the best predicate order. In each iteration, we estimate the best predicate order from the expectation tree, and then choose a predicate as a question to ask the crowd. After getting the result of the current predicate, we choose next predicate to ask until we get the result. We will update the expectation tree using the answer obtained from the crowd and continue to the next iteration. We also study the problem of answering multiple queries simultaneously, and reduce its cost using the correlation between queries. Finally, we propose a confidence based method to improve the quality. The experiment result shows that our predicate order based algorithm is effective and can reduce cost significantly compared with baseline approaches.

【Keywords】:

36. Select Your Questions Wisely: For Entity Resolution With Crowd Errors.

【Paper Link】【Pages】:317-326

【Authors】: Vijaya Krishna Yalavarthi ; Xiangyu Ke ; Arijit Khan

【Abstract】: Crowdsourcing is becoming increasingly important in entity resolution tasks due to their inherent complexity such as clustering of images and natural language processing. Humans can provide more insightful information for these difficult problems compared to machine-based automatic techniques. Nevertheless, human workers can make mistakes due to lack of domain expertise or seriousness, ambiguity, or even due to malicious intents. The bulk of literature usually deals with human errors via majority voting or by assigning a universal error rate over crowd workers. However, such approaches are incomplete, and often inconsistent, because the expertise of crowd workers are diverse with possible biases, thereby making it largely inappropriate to assume a universal error rate for all workers over all crowdsourcing tasks. We mitigate the above challenges by considering an uncertain graph model, where the edge probability between two records A and B denotes the ratio of crowd workers who voted YES on the question if A and B are same entity. To reflect independence across different crowdsourcing tasks, we apply the notion of possible worlds, and develop parameter-free algorithms for both next crowdsourcing and entity resolution tasks. In particular, for next crowdsourcing, we identify the record pair that maximally increases the reliability of the current clustering. Since reliability takes into account the connected-ness inside and across all clusters, this metric is more effective in deciding next questions, in comparison with state-of-the-art works, which consider local features, such as individual edges, paths, or nodes to select next crowdsourcing questions. Based on detailed empirical analysis over real-world datasets, we find that our proposed solution, PERC (probabilistic entity resolution with imperfect crowd) improves the quality by 15% and reduces the overall cost by 50% for the crowdsourcing-based entity resolution.

【Keywords】: crowdsourcing; entity resolution; human error; reliability; uncertain graphs

Session 2C: Recommendation 1 4

37. Reply With: Proactive Recommendation of Email Attachments.

【Paper Link】【Pages】:327-336

【Authors】: Christophe Van Gysel ; Bhaskar Mitra ; Matteo Venanzi ; Roy Rosemarin ; Grzegorz Kukla ; Piotr Grudzien ; Nicola Cancedda

【Abstract】: Email responses often contain items---such as a file or a hyperlink to an external document---that are attached to or included inline in the body of the message. Analysis of an enterprise email corpus reveals that 35% of the time when users include these items as part of their response, the attachable item is already present in their inbox or sent folder. A modern email client can proactively retrieve relevant attachable items from the user's past emails based on the context of the current conversation, and recommend them for inclusion, to reduce the time and effort involved in composing the response. In this paper, we propose a weakly supervised learning framework for recommending attachable items to the user. As email search systems are commonly available, we constrain the recommendation task to formulating effective search queries from the context of the conversations. The query is submitted to an existing IR system to retrieve relevant items for attachment. We also present a novel strategy for generating labels from an email corpus---without the need for manual annotations---that can be used to train and evaluate the query formulation model. In addition, we describe a deep convolutional neural network that demonstrates satisfactory performance on this query formulation task when evaluated on the publicly available Avocado dataset and a proprietary dataset of internal emails obtained through an employee participation program.

【Keywords】: email attachment recommendation; email overload; neural networks; proactive retrieval; query extraction; query formulation

【Paper Link】【Pages】:337-346

【Authors】: Xiao Lin ; Min Zhang ; Yongfeng Zhang ; Yiqun Liu ; Shaoping Ma

【Abstract】: User feedback in the form of movie-watching history, item ratings, or product consumption is very helpful in training recommender systems. However, relatively few interactions between items and users can be observed. Instances of missing user--item entries are caused by the user not seeing the item (although the actual preference to the item could still be positive) or the user seeing the item but not liking it. Separating these two cases enables missing interactions to be modeled with finer granularity, and thus reflects user preferences more accurately. However, most previous studies on the modeling of missing instances have not fully considered the case where the user has not seen the item. Social connections are known to be helpful for modeling users' potential preferences more extensively, although a similar visibility problem exists in accurately identifying social relationships. That is, when two users are unaware of each other's existence, they have no opportunity to connect. In this paper, we propose a novel user preference model for recommender systems that considers the visibility of both items and social relationships. Furthermore, the two kinds of information are coordinated in a unified model inspired by the idea of transfer learning. Extensive experiments have been conducted on three real-world datasets in comparison with five state-of-the-art approaches. The encouraging performance of the proposed system verifies the effectiveness of social knowledge transfer and the modeling of both item and social visibilities.

【Keywords】: implicit feedback; recommender system; social network

【Paper Link】【Pages】:347-356

【Authors】: Hongwei Wang ; Jia Wang ; Miao Zhao ; Jiannong Cao ; Minyi Guo

【Abstract】: Online voting is an emerging feature in social networks, in which users can express their attitudes toward various issues and show their unique interest. Online voting imposes new challenges on recommendation, because the propagation of votings heavily depends on the structure of social networks as well as the content of votings. In this paper, we investigate how to utilize these two factors in a comprehensive manner when doing voting recommendation. First, due to the fact that existing text mining methods such as topic model and semantic model cannot well process the content of votings that is typically short and ambiguous, we propose a novel Topic-Enhanced Word Embedding (TEWE) method to learn word and document representation by jointly considering their topics and semantics. Then we propose our Joint Topic-Semantic-aware social Matrix Factorization (JTS-MF) model for voting recommendation. JTS-MF model calculates similarity among users and votings by combining their TEWE representation and structural information of social networks, and preserves this topic-semantic-social similarity during matrix factorization. To evaluate the performance of TEWE representation and JTS-MF model, we conduct extensive experiments on real online voting dataset. The results prove the efficacy of our approach against several state-of-the-art baselines.

【Keywords】: matrix factorization; online voting; recommender systems; topic-enhanced word embedding

【Paper Link】【Pages】:357-366

【Authors】: Xin Wang ; Steven C. H. Hoi ; Chenghao Liu ; Martin Ester

【Abstract】: Social recommendation has been an active research topic over the last decade, based on the assumption that social information from friendship networks is beneficial for improving recommendation accuracy, especially when dealing with cold-start users who lack sufficient past behavior information for accurate recommendation. However, it is nontrivial to use such information, since some of a person's friends may share similar preferences in certain aspects, but others may be totally irrelevant for recommendations. Thus one challenge is to explore and exploit the extend to which a user trusts his/her friends when utilizing social information to improve recommendations. On the other hand, most existing social recommendation models are non-interactive in that their algorithmic strategies are based on batch learning methodology, which learns to train the model in an offline manner from a collection of training data which are accumulated from users? historical interactions with the recommender systems. In the real world, new users may leave the systems for the reason of being recommended with boring items before enough data is collected for training a good model, which results in an inefficient customer retention. To tackle these challenges, we propose a novel method for interactive social recommendation, which not only simultaneously explores user preferences and exploits the effectiveness of personalization in an interactive way, but also adaptively learns different weights for different friends. In addition, we also give analyses on the complexity and regret of the proposed model. Extensive experiments on three real-world datasets illustrate the improvement of our proposed method against the state-of-the-art algorithms.

【Keywords】: exploration-exploitation; personalization; recommender systems; social recommendation; user behavior modeling

Session 2D: Network Embedding 2 4

41. From Properties to Links: Deep Network Embedding on Incomplete Graphs.

【Paper Link】【Pages】:367-376

【Authors】: Dejian Yang ; Senzhang Wang ; Chaozhuo Li ; Xiaoming Zhang ; Zhoujun Li

【Abstract】: As an effective way of learning node representations in networks, network embedding has attracted increasing research interests recently. Most existing approaches use shallow models and only work on static networks by extracting local or global topology information of each node as the algorithm input. It is challenging for such approaches to learn a desirable node representation on incomplete graphs with a large number of missing links or on dynamic graphs with new nodes joining in. It is even challenging for them to deeply fuse other types of data such as node properties into the learning process to help better represent the nodes with insufficient links. In this paper, we for the first time study the problem of network embedding on incomplete networks. We propose a Multi-View Correlation-learning based Deep Network Embedding method named MVC-DNE to incorporate both the network structure and the node properties for more effectively and efficiently perform network embedding on incomplete networks. Specifically, we consider the topology structure of the network and the node properties as two correlated views. The insight is that the learned representation vector of a node should reflect its characteristics in both views. Under a multi-view correlation learning based deep autoencoder framework, the structure view and property view embeddings are integrated and mutually reinforced through both self-view and cross-view learning. As MVC-DNE can learn a representation mapping function, it can directly generate the representation vectors for the new nodes without retraining the model. Thus it is especially more efficient than previous methods. Empirically, we evaluate MVC-DNE over three real network datasets on two data mining applications, and the results demonstrate that MVC-DNE significantly outperforms state-of-the-art methods.

【Keywords】: deep learning; incomplete graph; network embedding

42. Learning Community Embedding with Community Detection and Node Embedding on Graphs.

【Paper Link】【Pages】:377-386

【Authors】: Sandro Cavallari ; Vincent W. Zheng ; HongYun Cai ; Kevin Chen-Chuan Chang ; Erik Cambria

【Abstract】: In this paper, we study an important yet largely under-explored setting of graph embedding, i.e., embedding communities instead of each individual nodes. We find that community embedding is not only useful for community-level applications such as graph visualization, but also beneficial to both community detection and node classification. To learn such embedding, our insight hinges upon a closed loop among community embedding, community detection and node embedding. On the one hand, node embedding can help improve community detection, which outputs good communities for fitting better community embedding. On the other hand, community embedding can be used to optimize the node embedding by introducing a community-aware high-order proximity. Guided by this insight, we propose a novel community embedding framework that jointly solves the three tasks together. We evaluate such a framework on multiple real-world datasets, and show that it improves graph visualization and outperforms state-of-the-art baselines in various application tasks, e.g., community detection and node classification.

【Keywords】: community embedding; graph embedding

43. Attributed Network Embedding for Learning in a Dynamic Environment.

【Paper Link】【Pages】:387-396

【Authors】: Jundong Li ; Harsh Dani ; Xia Hu ; Jiliang Tang ; Yi Chang ; Huan Liu

【Abstract】: Network embedding leverages the node proximity manifested to learn a low-dimensional node vector representation for each node in the network. The learned embeddings could advance various learning tasks such as node classification, network clustering, and link prediction. Most, if not all, of the existing works, are overwhelmingly performed in the context of plain and static networks. Nonetheless, in reality, network structure often evolves over time with addition/deletion of links and nodes. Also, a vast majority of real-world networks are associated with a rich set of node attributes, and their attribute values are also naturally changing, with the emerging of new content patterns and the fading of old content patterns. These changing characteristics motivate us to seek an effective embedding representation to capture network and attribute evolving patterns, which is of fundamental importance for learning in a dynamic environment. To our best knowledge, we are the first to tackle this problem with the following two challenges: (1) the inherently correlated network and node attributes could be noisy and incomplete, it necessitates a robust consensus representation to capture their individual properties and correlations; (2) the embedding learning needs to be performed in an online fashion to adapt to the changes accordingly. In this paper, we tackle this problem by proposing a novel dynamic attributed network embedding framework - DANE. In particular, DANE first provides an offline method for a consensus embedding and then leverages matrix perturbation theory to maintain the freshness of the end embedding results in an online manner. We perform extensive experiments on both synthetic and real attributed networks to corroborate the effectiveness and efficiency of the proposed framework.

【Keywords】: attributed networks; dynamic networks; network embedding

44. Learning Node Embeddings in Interaction Graphs.

【Paper Link】【Pages】:397-406

【Authors】: Yao Zhang ; Yun Xiong ; Xiangnan Kong ; Yangyong Zhu

【Abstract】: Node embedding techniques have gained prominence since they produce continuous and low-dimensional features, which are effective for various tasks. Most existing approaches learn node embeddings by exploring the structure of networks and are mainly focused on static non-attributed graphs. However, many real-world applications, such as stock markets and public review websites, involve bipartite graphs with dynamic and attributed edges, called attributed interaction graphs. Different from conventional graph data, attributed interaction graphs involve two kinds of entities (e.g. investors/stocks and users/businesses) and edges of temporal interactions with attributes (e.g. transactions and reviews). In this paper, we study the problem of node embedding in attributed interaction graphs. Learning embeddings in interaction graphs is highly challenging due to the dynamics and heterogeneous attributes of edges. Different from conventional static graphs, in attributed interaction graphs, each edge can have totally different meanings when the interaction is at different times or associated with different attributes. We propose a deep node embedding method called IGE (Interaction Graph Embedding). IGE is composed of three neural networks: an encoding network is proposed to transform attributes into a fixed-length vector to deal with the heterogeneity of attributes; then encoded attribute vectors interact with nodes multiplicatively in two coupled prediction networks that investigate the temporal dependency by treating incident edges of a node as the analogy of a sentence in word embedding methods. The encoding network can be specifically designed for different datasets as long as it is differentiable, in which case it can be trained together with prediction networks by back-propagation. We evaluate our proposed method and various comparing methods on four real-world datasets. The experimental results prove the effectiveness of the learned embeddings by IGE on both node clustering and classification tasks.

【Keywords】: attributed interaction graph; graph mining; node embedding; representation learning

Session 2E: Skyline Queries 4

45. Efficient Computation of Subspace Skyline over Categorical Domains.

【Paper Link】【Pages】:407-416

【Authors】: Md Farhadur Rahman ; Abolfazl Asudeh ; Nick Koudas ; Gautam Das

【Abstract】: Platforms such as AirBnB, Zillow, Yelp, and related sites have transformed the way we search for accommodation, restaurants, etc. The underlying datasets in such applications have numerous attributes that are mostly Boolean or Categorical. Discovering the skyline of such datasets over a subset of attributes would identify entries that stand out while enabling numerous applications. There are only a few algorithms designed to compute the skyline over categorical attributes, yet are applicable only when the number of attributes is small. In this paper, we place the problem of skyline discovery over categorical attributes into perspective and design efficient algorithms for two cases. (i) In the absence of indices, we propose two algorithms, ST-S and ST-P, that exploit the categorical characteristics of the datasets, organizing tuples in a tree data structure, supporting efficient dominance tests over the candidate set. (ii) We then consider the existence of widely used precomputed sorted lists. After discussing several approaches, and studying their limitations, we propose TA-SKY, a novel threshold style algorithm that utilizes sorted lists. Moreover, we further optimize TA-SKY and explore its progressive nature, making it suitable for applications with strict interactive requirements. In addition to the extensive theoretical analysis of the proposed algorithms, we conduct a comprehensive experimental evaluation of the combination of real (including the entire AirBnB data collection) and synthetic datasets to study the practicality of the proposed algorithms. The results showcase the superior performance of our techniques, outperforming applicable approaches by orders of magnitude.

【Keywords】: categorical domains; sorted list; subspace skyline computation; tree

46. Fast Algorithms for Pareto Optimal Group-based Skyline.

【Paper Link】【Pages】:417-426

【Authors】: Wenhui Yu ; Zheng Qin ; Jinfei Liu ; Li Xiong ; Xu Chen ; Huidi Zhang

【Abstract】: Skyline, aiming at finding a Pareto optimal subset of points in a multi-dimensional dataset, has gained great interest due to its extensive use for multi-criteria analysis and decision making. Skyline consists of all points that are not dominated by, or not worse than other points. It is a candidate set of optimal solution, which depends on a specific evaluation criterion for optimum. However, conventional skyline queries, which return individual points, are inadequate in group querying case since optimal combinations are required. To address this gap, we study the skyline computation in group case and propose fast methods to find the group-based skyline (G-skyline), which contains Pareto optimal groups. For computing the front k skyline layers, we lay out an efficient approach that does the search concurrently on each dimension and investigates each point in subspace. After that, we present a novel structure to construct the G-skyline with a queue of combinations of the first-layer points. Experimental results show that our algorithms are several orders of magnitude faster than the previous work.

【Keywords】: combination queue; concurrent search; group skyline; multiple skyline layers; subspace skyline

47. Probabilistic Skyline on Incomplete Data.

【Paper Link】【Pages】:427-436

【Authors】: Kaiqi Zhang ; Hong Gao ; Xixian Han ; Zhipeng Cai ; Jianzhong Li

【Abstract】: The skyline query is important in database community. In recent years, the researches on incomplete data have been increasingly considered, especially for the skyline query. However, the existing skyline definition on incomplete data cannot provide users with valuable references. In this paper, we propose a novel skyline definition utilizing probabilistic model on incomplete data where each point has a probability to be in the skyline. In particular, it returnsK points with the highest skyline probabilities. Meanwhile, it is a big challenge to compute probabilistic skyline on incomplete data. We propose an efficient algorithm PISkyline, which utilizes two pruning strategies to reduce the number of points and adopts two optimizations to accelerate probability computation for each point. Nevertheless, PISkyline is susceptible to the order of input data and there is still a great deal of room for optimization. We develop a point-level sorting technique by adjusting the order of accessing points to further improve the efficiency of PISkyline. Our experimental results demonstrate that our algorithms are tens of times faster than the naive algorithm on both synthetic and real datasets.

【Keywords】: incomplete data; probabilistic skyline; query processing

48. Communication-Efficient Distributed Skyline Computation.

【Paper Link】【Pages】:437-446

【Authors】: Haoyu Zhang ; Qin Zhang

【Abstract】: In this paper we study skyline queries in the distributed computational model, where we have s remote sites and a central coordinator; each site holds a piece of data, and the coordinator wants to compute the skyline of the union of the s datasets. The computation is in terms of rounds, and the goal is to minimize both the total communication cost and the round cost. We first give an algorithm with a small communication cost but potentially a large round cost; we show information-theoretically that the communication cost is optimal even if we allow an infinite number of communication rounds. We next give algorithms with smooth communication-round tradeoffs. We also show a strong lower bound for the communication cost if we can only use one round of communication. Finally, we demonstrate the superiority of our algorithms over existing ones by an extensive set of experiments on both synthetic and real world datasets.

【Keywords】: communication-efficient algorithms; distributed computation; skyline computation

49. Bringing Salary Transparency to the World: Computing Robust Compensation Insights via LinkedIn Salary.

【Paper Link】【Pages】:447-455

【Authors】: Krishnaram Kenthapadi ; Stuart Ambler ; Liang Zhang ; Deepak Agarwal

【Abstract】: The recently launched LinkedIn Salary product has been designed with the goal of providing compensation insights to the world's professionals and thereby helping them optimize their earning potential. We describe the overall design and architecture of the statistical modeling system underlying this product. We focus on the unique data mining challenges while designing and implementing the system, and describe the modeling components such as Bayesian hierarchical smoothing that help to compute and present robust compensation insights to users. We report on extensive evaluation with nearly one year of de-identified compensation data collected from over one million LinkedIn users, thereby demonstrating the efficacy of the statistical models. We also highlight the lessons learned through the deployment of our system at LinkedIn.

【Keywords】: bayesian hierarchical smoothing; linkedin salary; outlier detection; privacy-preserving statistical modeling; robust compensation insights

50. Efficient Document Filtering Using Vector Space Topic Expansion and Pattern-Mining: The Case of Event Detection in Microposts.

【Paper Link】【Pages】:457-466

【Authors】: Julia Proskurnia ; Ruslan Mavlyutov ; Carlos Castillo ; Karl Aberer ; Philippe Cudré-Mauroux

【Abstract】: Automatically extracting information from social media is challenging given that social content is often noisy, ambiguous, and inconsistent. However, as many stories break on social channels first before being picked up by mainstream media, developing methods to better handle social content is of utmost importance. In this paper, we propose a robust and effective approach to automatically identify microposts related to a specific topic defined by a small sample of reference documents. Our framework extracts clusters of semantically similar microposts that overlap with the reference documents, by extracting combinations of key features that define those clusters through frequent pattern mining. This allows us to construct compact and interpretable representations of the topic, dramatically decreasing the computational burden compared to classical clustering and k-NN-based machine learning techniques and producing highly-competitive results even with small training sets (less than 1'000 training objects). Our method is efficient and scales gracefully with large sets of incoming microposts. We experimentally validate our approach on a large corpus of over 60M microposts, showing that it significantly outperforms state-of-the-art techniques.

【Keywords】: event detection; frequent patterns mining; microposts; semantic attributes

51. LARM: A Lifetime Aware Regression Model for Predicting YouTube Video Popularity.

【Paper Link】【Pages】:467-476

【Authors】: Changsha Ma ; Zhisheng Yan ; Chang Wen Chen

【Abstract】: Online content popularity prediction provides substantial value to a broad range of applications in the end-to-end social media systems, from network resource allocation to targeted advertising. While using historical popularity can predict the near-term popularity with a reasonable accuracy, the bursty nature of online content popularity evolution makes it difficult to capture the correlation between historical data and future data in the long term. Although various existing efforts have been made toward long-term prediction, they need to accumulate a long enough historical data before the prediction and their model assumptions cannot be applied to the complex YouTube networks with inherent unpredictability. In this paper, we aim to achieve fast prediction of long-term video popularity in the complex YouTube networks. We propose LARM, a lifetime aware regression model, representing the first work that leverages content lifetime to compensate the insufficiency of historical data without assumptions of network structure. The proposed LARM is empowered by a lifetime metric that is both predictable via early-accessible features and adaptable to different observation intervals, as well as a set of specialized regression models to handle different classes of videos with different lifetime. We validate LARM on two YouTube data sets with hourly and daily observation intervals. Experimental results indicate that LARM outperforms several non-trivial baselines from the literature by up to 20% and 18% of prediction error reduction in the two data sets.

【Keywords】: popularity prediction; regression model; social media; time series analysis; youtube

52. Modeling Affinity based Popularity Dynamics.

【Paper Link】【Pages】:477-486

【Authors】: Minkyoung Kim ; Daniel A. McFarland ; Jure Leskovec

【Abstract】: Information items draw collective attention across a heterogeneous social system, leading to great disparities of popularity. Unveiling underlying diffusion processes is very challenging, since a social system consists of time-evolving subgroups interacting and exerting disproportionate influences on an individual item's popularity. In this study, we propose the Affinity Poisson Process model (APP) which models popularity dynamics, by incorporating (1) affinities between subgroups, (2) heterogeneous preferential attachment, and (3) subgroup-level time decay. As a case study, we apply our proposed model to scholarly publications in computer science. Our model outperforms the state of the art approach in predicting citation volumes of individual papers. More importantly, the proposed model enables us to uncover popularity dynamics driven by intra- and inter-subgroup interactions, which has been neglected in prior work. We expect that our model can afford interpretable insights on the attention economy in terms of affinity and aging effect.

【Keywords】: affinity; diffusion process; interdisciplinary citations; poisson process; popularity dynamics

Session 3A: Spatiotemporal 4

53. Scenic Routes Now: Efficiently Solving the Time-Dependent Arc Orienteering Problem.

【Paper Link】【Pages】:487-496

【Authors】: Ying Lu ; Gregor Jossé ; Tobias Emrich ; Ugur Demiryurek ; Matthias Renz ; Cyrus Shahabi ; Matthias Schubert

【Abstract】: Due to the availability of large transportation (e.g., road network sensor data) and transportation-related (e.g., pollution, crime) data as well as the ubiquity of car navigation systems, recent route planning techniques need to optimize for multiple criteria (e.g., travel time or distance, utility/value such as safety or attractiveness). In this paper, we introduce a novel problem called Twofold Time-Dependent Arc Orienteering Problem (2TD-AOP), which seeks to find a path from a source to a destination maximizing an accumulated value (e.g., attractiveness of the path) while not exceeding a cost budget (e.g., total travel time). 2TD-AOP has many applications in spatial crowdsourcing, real-time delivery, and online navigation systems (e.g., safest path, most scenic path). Although 2TD-AOP can be framed as a variant of AOP, existing AOP approaches cannot solve 2TD-AOP accurately as they assume that travel-times and values of network edges are constant. However, in real-world the travel-times and values are time-dependent, where the actual travel time and utility of an edge depend on the arrival time to the edge. We first discuss the practicality of this novel problem by demonstrating the benefits of considering time-dependency, empirically. Subsequently, we show that optimal solutions are infeasible (NP-hard) and solutions to the static problem are often invalid (i.e., exceed the cost budget). Therefore, we propose an efficient approximate solution with spatial pruning techniques, optimized for fast response systems. Experiments on a large-scale, fine-grained, real-world road network demonstrate that our approach always produces valid paths, is orders of magnitude faster than any optimal solution with acceptable accumulated value.

【Keywords】: arc orienteering problem; road network; scenic path; time dependent

54. Modeling Temporal-Spatial Correlations for Crime Prediction.

【Paper Link】【Pages】:497-506

【Authors】: Xiangyu Zhao ; Jiliang Tang

【Abstract】: Crime prediction plays a crucial role in improving public security and reducing the financial loss of crimes. The vast majority of traditional algorithms performed the prediction by leveraging demographic data, which could fail to capture the dynamics of crimes in urban. In the era of big data, we have witnessed advanced ways to collect and integrate fine-grained urban, mobile, and public service data that contains various crime-related sources and rich temporal-spatial information. Such information provides better understandings about the dynamics of crimes and has potentials to advance crime prediction. In this paper, we exploit temporal-spatial correlations in urban data for crime prediction. In particular, we validate the existence of temporal-spatial correlations in crime and develop a principled approach to model these correlations into the coherent framework TCP for crime prediction. The experimental results on real-world data demonstrate the effectiveness of the proposed framework. Further experiments have been conducted to understand the importance of temporal-spatial correlations in crime prediction.

【Keywords】: crime prediction; crime prevention; temporal-spatial correlation

55. Spatiotemporal Event Forecasting from Incomplete Hyper-local Price Data.

【Paper Link】【Pages】:507-516

【Authors】: Xuchao Zhang ; Liang Zhao ; Arnold P. Boedihardjo ; Chang-Tien Lu ; Naren Ramakrishnan

【Abstract】: Hyper-local pricing data, e.g., about foods and commodities, exhibit subtle spatiotemporal variations that can be useful as crucial precursors of future events. Three major challenges in modeling such pricing data include: i) temporal dependencies underlying features; ii) spatiotemporal missing values; and iii) constraints underlying economic phenomena. These challenges hinder traditional event forecasting models from being applied effectively. This paper proposes a novel spatiotemporal event forecasting model that concurrently addresses the above challenges. Specifically, given continuous price data, a new soft time-lagged model is designed to select temporally dependent features. To handle missing values, we propose a data tensor completion method based on price domain knowledge. The parameters of the new model are optimized using a novel algorithm based on the Alternative Direction Methods of Multipliers (ADMM). Extensive experimental evaluations on multiple datasets demonstrate the effectiveness of our proposed approach.

【Keywords】: event forecasting; hyper-local price; optimization; spatiotemporal data mining

56. Exploiting Spatio-Temporal User Behaviors for User Linkage.

【Paper Link】【Pages】:517-526

【Authors】: Wei Chen ; Hongzhi Yin ; Weiqing Wang ; Lei Zhao ; Wen Hua ; Xiaofang Zhou

【Abstract】: Cross-device and cross-domain user linkage have been attracting a lot of attention recently. An important branch of the study is to achieve user linkage with spatio-temporal data generated by the ubiquitous GPS-enabled devices. The main task in this problem is twofold, i.e., how to extract the representative features of a user; how to measure the similarities between users with the extracted features. To tackle the problem, we propose a novel model STUL (Spatio-Temporal User Linkage) that consists of the following two components. 1) Extract users - spatial features with a density based clustering method, and extract the users - temporal features with the Gaussian Mixture Model. To link user pairs more precisely, we assign different weights to the extracted features, by lightening the common features and highlighting the discriminative features. 2) Propose novel approaches to measure the similarities between users based on the extracted features, and return the pair-wise users with similarity scores higher than a predefined threshold. We have conducted extensive experiments on three real-world datasets, and the results demonstrate the superiority of our proposed STUL over the state-of-the-art methods.

【Keywords】: cross-domain; spatio-temporal behaviors; user linkage

Session 3B: Short text retrieval 4

57. Similarity-based Distant Supervision for Definition Retrieval.

【Paper Link】【Pages】:527-536

【Authors】: Jiepu Jiang ; James Allan

【Abstract】: Recognizing definition sentences from free text corpora often requires hand-crafted patterns or explicitly labeled training instances. We present a distant supervision approach addressing this challenge without using explicitly labeled data. We use plausibly good but imperfect definition sentences from Wikipedia as references to annotate sentences in a target corpus based on text similarity measures such as ROUGE. Experimental results show our approach is highly effective, generating noisy but large, useful, and localized training instances. Definition sentence retrieval models trained using the synthesized training examples are more effective than those learned from manual judgments of a few thousand sentences. We also examine different text similarity measures for annotation, including both unsupervised and supervised ones. We show that our method can significantly benefit from supervised text similarity measures learned from either external training data (from the SemEval Semantic Text Similarity task) or local ones (a few hundred judged sentences on the target corpus). Our method offers a cheap, effective, and flexible solution to this task and can benefit a broad range of applications such as web search engines and QA systems.

【Keywords】: definition sentence retrieval; definitional question answering; distant supervision; semantic textual similarity

58. Hybrid BiLSTM-Siamese network for FAQ Assistance.

【Paper Link】【Pages】:537-545

【Authors】: Prerna Khurana ; Puneet Agarwal ; Gautam Shroff ; Lovekesh Vig ; Ashwin Srinivasan

【Abstract】: We describe an automated assistant for answering frequently asked questions; our system has been deployed, and is currently answering HR-related queries in two different areas (leave management and health insurance) to a large number of users. The needs of a large global corporate lead us to model a frequently asked question (FAQ) to be an equivalence class of actually asked questions, for which there is a common answer (certified as being consistent with the organization's policy). When a new question is posed to our system, it finds the class of question, and responds with the answer for the class. At this point, the system is either correct (gives correct answer); or incorrect (gives wrong answer); or incomplete (says "I don't know''). We employ a hybrid deep-learning architecture in which a BiLSTM-based classifier is combined with second BiLSTM-based Siamese network in an iterative manner: Questions for which the classifier makes an error during training are used to generate a set of misclassified question-question pairs. These, along with correct pairs, are used to train the Siamese network to drive apart the (hidden) representations of the misclassified pairs. We present experimental results from our deployment showing that our iteratively trained hybrid network: (a) results in better performance than using just a classifier network, or just a Siamese network; (b) performs better than state-of-the art sentence classifiers in the two areas in which it has been deployed, in terms of both accuracy as well as precision-recall tradeoff; and (c) also performs well on a benchmark public dataset. We also observe that using question-question pairs in our hybrid network, results in marginally better performance than using question-to-answer pairs. Finally, estimates of precision and recall from the deployment of our automated assistant suggest that we can expect the burden on our HR department to drop from answering about 6000 queries a day to about 1000.

【Keywords】: bilstm; chatbot; deep learning; siamese network

59. Regularized and Retrofitted models for Learning Sentence Representation with Context.

【Paper Link】【Pages】:547-556

【Authors】: Tanay Kumar Saha ; Shafiq R. Joty ; Naeemul Hassan ; Mohammad Al Hasan

【Abstract】: Vector representation of sentences is important for many text processing tasks that involve classifying, clustering, or ranking sentences. For solving these tasks, bag-of-word based representation has been used for a long time. In recent years, distributed representation of sentences learned by neural models from unlabeled data has been shown to outperform traditional bag-of-words representations. However, most existing methods belonging to the neural models consider only the content of a sentence, and disregard its relations with other sentences in the context. In this paper, we first characterize two types of contexts depending on their scope and utility. We then propose two approaches to incorporate contextual information into content-based models. We evaluate our sentence representation models in a setup, where context is available to infer sentence vectors. Experimental results demonstrate that our proposed models outshine existing models on three fundamental tasks, such as, classifying, clustering, and ranking sentences.

【Keywords】: classification; clustering; discourse; distributed representation of sentences; feature learning; ranking; retrofitting; sen2vec

60. Talking to Your TV: Context-Aware Voice Search with Hierarchical Recurrent Neural Networks.

【Paper Link】【Pages】:557-566

【Authors】: Jinfeng Rao ; Ferhan Türe ; Hua He ; Oliver Jojic ; Jimmy Lin

【Abstract】: We tackle the novel problem of navigational voice queries posed against an entertainment system, where viewers interact with a voice-enabled remote controller to specify the TV program to watch. This is a difficult problem for several reasons: such queries are short, even shorter than comparable voice queries in other domains, which offers fewer opportunities for deciphering user intent. Furthermore, ambiguity is exacerbated by underlying speech recognition errors. We address these challenges by integrating word- and character-level query representations and by modeling voice search sessions to capture the contextual dependencies in query sequences. Both are accomplished with a probabilistic framework in which recurrent and feedforward neural network modules are organized in a hierarchical manner. From a raw dataset of 32M voice queries from 2.5M viewers on the Comcast Xfinity X1 entertainment system, we extracted data to train and test our models. We demonstrate the benefits of our hybrid representation and context-aware model, which significantly outperforms competitive baselines that use learning to rank as well as neural networks.

【Keywords】: context modeling; lstm; navigational voice queries; voice search sessions

Session 3C: Community Detection 4

61. GPU-Accelerated Graph Clustering via Parallel Label Propagation.

【Paper Link】【Pages】:567-576

【Authors】: Yusuke Kozawa ; Toshiyuki Amagasa ; Hiroyuki Kitagawa

【Abstract】: Graph clustering has recently attracted much attention as a technique to extract community structures from various kinds of graph data. Since available graph data becomes increasingly large, the acceleration of graph clustering is an important issue for handling large-scale graphs. To this end, this paper proposes a fast graph clustering method using GPUs. The proposed method is based on parallelization of label propagation, one of the fastest graph clustering algorithms. Our method has the following three characteristics: (1) efficient parallelization: the algorithm of label propagation is transformed into a sequence of data-parallel primitives; (2) load balance: the method takes into account load balancing by adopting the primitives that make the load among threads and blocks well balanced; and (3) out-of-core processing: we also develop algorithms to efficiently deal with large-scale datasets that do not fit into GPU memory. Moreover, this GPU out-of-core algorithm is extended to simultaneously exploit both CPUs and GPUs for further performance gain. Extensive experiments with real-world and synthetic datasets show that our proposed method outperforms an existing parallel CPU implementation by a factor of up to 14.3 without sacrificing accuracy.

【Keywords】: community detection; gpu; graph clustering; label propagation

62. Temporally Like-minded User Community Identification through Neural Embeddings.

【Paper Link】【Pages】:577-586

【Authors】: Hossein Fani ; Ebrahim Bagheri ; Weichang Du

【Abstract】: We propose a neural embedding approach to identify temporally like-minded user communities, i.e., those communities of users who have similar temporal alignment in their topics of interest. Like-minded user communities in social networks are usually identified by either considering explicit structural connections between users (link analysis), users' topics of interest expressed in their posted contents (content analysis), or in tandem. In such communities, however, the users' rich temporal behavior towards topics of interest is overlooked. Only few recent research efforts consider the time dimension and define like-minded user communities as groups of users who share not only similar topical interests but also similar temporal behavior. Temporal like-minded user communities find application in areas such as recommender systems where relevant items are recommended to the users at the right time. In this paper, we tackle the problem of identifying temporally like-minded user communities by leveraging unsupervised feature learning (embeddings). Specifically, we learn a mapping from the user space to a low-dimensional vector space of features that incorporate both topics of interest and their temporal nature. We demonstrate the efficacy of our proposed approach on a Twitter dataset in the context of three applications: news recommendation, user prediction and community selection, where our work is able to outperform the state-of-the-art on important information retrieval metrics.

【Keywords】: community detection; neural embedding; social network analysis

63. Community-Based Network Alignment for Large Attributed Network.

【Paper Link】【Pages】:587-596

【Authors】: Zheng Chen ; Xinli Yu ; Bo Song ; Jianliang Gao ; Xiaohua Hu ; Wei-Shih Yang

【Abstract】: Network alignment is becoming an active topic in network data analysis. Despite extensive research, we realize that efficient use of topological and attribute information for large attributed network alignment has not been sufficiently addressed in previous studies. In this paper, based on Stochastic Block Model (SBM) and Dirichlet-multinomial, we propose "divide-and-conquer" models CAlign that jointly consider network alignment, community discovery and community alignment in one framework for large networks with node attributes, in an effort to reduce both the computation time and memory usage while achieving better or competitive performance. It is provable that the algorithms derived from our model have sub-quadratic time complexity and linear space complexity on a network with small densification power, which is true for most real-world networks. Experiments show CAlign is superior to two recent state-of-art models in terms of accuracy, time and memory on large networks, and CAlign is capable of handling millions of nodes on a modern desktop machine.

【Keywords】: Attributed Network Alignment; Community Discovery; Dirichilet-Mutinomial; Large Network; Stochastic Block Model

64. A Non-negative Symmetric Encoder-Decoder Approach for Community Detection.

【Paper Link】【Pages】:597-606

【Authors】: Bing-Jie Sun ; Huawei Shen ; Jinhua Gao ; Wentao Ouyang ; Xueqi Cheng

【Abstract】: Community detection or graph clustering is crucial to understanding the structure of complex networks and extracting relevant knowledge from networked data. Latent factor model, e.g., non-negative matrix factorization and mixed membership block model, is one of the most successful methods for community detection. Latent factor models for community detection aim to find a distributed and generally low-dimensional representation, or coding, that captures the structural regularity of network and reflects the community membership of nodes. Existing latent factor models are mainly based on reconstructing a network from the representation of its nodes, namely network decoder, while constraining the representation to have certain desirable properties. These methods, however, lack an encoder that transforms nodes into their representation. Consequently, they fail to give a clear explanation about the meaning of a community and suffer from undesired computational problems. In this paper, we propose a non-negative symmetric encoder-decoder approach for community detection. By explicitly integrating a decoder and an encoder into a unified loss function, the proposed approach achieves better performance over state-of-the-art latent factor models for community detection task. Moreover, different from existing methods that explicitly impose the sparsity constraint on the representation of nodes, the proposed approach implicitly achieves the sparsity of node representation through its symmetric and non-negative properties, making the optimization much easier than competing methods based on sparse matrix factorization.

【Keywords】: community detection; encoder-decoder; latent factor model

Session 3D: Time Series 4

65. Fast Word Recognition for Noise channel-based Models in Scenarios with Noise Specific Domain Knowledge.

【Paper Link】【Pages】:607-616

【Authors】: Marco Cristo ; Raíza Hanada ; André Luiz da Costa Carvalho ; Fernando Anglada Lores ; Maria da Graça Campos Pimentel

【Abstract】: Word recognition is a challenging task faced by many applications, specially in very noisy scenarios. This problem is usually seen as the transmission of a word through a noisy-channel, such that it is necessary to determine which known word of a lexicon is the received string. To be feasible, just a reduced set of candidate words are selected. They are usually chosen if they can be transformed into the input string by applying up to k character edit operations. To rank the candidates, the most effective estimates use domain knowledge about noise sources and error distributions, extracted from real use data. In scenarios with much noise, however, such estimates, and the index strategies normally required, do not scale well as they grow exponentially with k and the lexicon size. In this work, we propose very efficient methods for word recognition in very noisy scenarios which support effective edit-based distance algorithms in a Mor-Fraenkel index, searchable using a minimum perfect hashing. The method allows the early processing of most promising candidates, such that fast pruned searches present negligible loss in word ranking quality. We also propose a linear heuristic for estimating edit-based distances which take advantage of information already provided by the index. Our methods achieve precision similar to a state-of-the-art approach, being about ten times faster.

【Keywords】: approximate searching; eye-based typing; noise channel model; word recognition

66. Detecting Multiple Periods and Periodic Patterns in Event Time Sequences.

【Paper Link】【Pages】:617-626

【Authors】: Quan Yuan ; Jingbo Shang ; Xin Cao ; Chao Zhang ; Xinhe Geng ; Jiawei Han

【Abstract】: Periodicity is prevalent in physical world, and many events involve more than one periods, eg individual's mobility, tide pattern, and massive transportation utilization. Knowing the true periods of events can benefit a number of applications, such as traffic prediction, time-aware recommendation and advertisement, and anomaly detection. However, detecting multiple periods is a very challenging task due to not only the interwoven periodic patterns but also the low quality of event tracking records. In this paper, we study the problem of discovering all true periods and the corresponded occurring patterns of an event from a noisy and incomplete observation sequence. We devise a novel scoring function, by maximizing which we can identify the true periodic patterns involved in the sequence. We prove that, however, optimizing the objective function is an NP-hard problem. To address this challenge, we develop a heuristic algorithm named Timeslot Coverage Model (TiCom), for identifying the periods and periodic patterns approximately. The results of extensive experiments on both synthetic and real-life datasets show that our model outperforms the state-of-the-art baselines significantly in various tasks, including period detection, periodic pattern identification, and anomaly detection.

【Keywords】: anomaly detection; np hard; periodicity detection; sequence mining

67. Finding Periodic Discrete Events in Noisy Streams.

【Paper Link】【Pages】:627-636

【Authors】: Abhirup Ghosh ; Christopher Lucas ; Rik Sarkar

【Abstract】: Periodic phenomena are ubiquitous, but detecting and predicting periodic events can be difficult in noisy environments. We describe a model of periodic events that covers both idealized and realistic scenarios characterized by multiple kinds of noise. The model incorporates false-positive events and the possibility that the underlying period and phase of the events change over time. We then describe a particle filter that can efficiently and accurately estimate the parameters of the process generating periodic events intermingled with independent noise events. The system has a small memory footprint, and, unlike alternative methods, its computational complexity is constant in the number of events that have been observed. As a result, it can be applied in low-resource settings that require real-time performance over long periods of time. In experiments on real and simulated data we find that it outperforms existing methods in accuracy and can track changes in periodicity and other characteristics in dynamic event streams.

【Keywords】: particle filter; periodicity; temporal sequence mining

68. Fast and Accurate Time Series Classification with WEASEL.

【Paper Link】【Pages】:637-646

【Authors】: Patrick Schäfer ; Ulf Leser

【Abstract】: Time series (TS) occur in many scientific and commercial applications, ranging from earth surveillance to industry automation to the smart grids. An important type of TS analysis is classification, which can, for instance, improve energy load forecasting in smart grids by detecting the types of electronic devices based on their energy consumption profiles recorded by automatic sensors. Such sensor-driven applications are very often characterized by (a) very long TS and (b) very large TS datasets needing classification. However, current methods to time series classification (TSC) cannot cope with such data volumes at acceptable accuracy; they are either scalable but offer only inferior classification quality, or they achieve state-of-the-art classification quality but cannot scale to large data volumes. In this paper, we present WEASEL (Word ExtrAction for time SEries cLassification), a novel TSC method which is both fast and accurate. Like other state-of-the-art TSC methods, WEASEL transforms time series into feature vectors, using a sliding-window approach, which are then analyzed through a machine learning classifier. The novelty of WEASEL lies in its specific method for deriving features, resulting in a much smaller yet much more discriminative feature set. On the popular UCR benchmark of 85 TS datasets, WEASEL is more accurate than the best current non-ensemble algorithms at orders-of-magnitude lower classification and training times, and it is almost as accurate as ensemble classifiers, whose computational complexity makes them inapplicable even for mid-size datasets. The outstanding robustness of WEASEL is also confirmed by experiments on two real smart grid datasets, where it out-of-the-box achieves almost the same accuracy as highly tuned, domain-specific methods.

【Keywords】: bag-of-patterns; classification; feature selection; time series; word co-occurrences

Session 3E: Query processing 4

69. QLever: A Query Engine for Efficient SPARQL+Text Search.

【Paper Link】【Pages】:647-656

【Authors】: Hannah Bast ; Björn Buchhold

【Abstract】: We present QLever, a query engine for efficient combined search on a knowledge base and a text corpus, in which named entities from the knowledge base have been identified (that is, recognized and disambiguated). The query language is SPARQL extended by two QLever-specific predicates ql:contains-entity and ql:contains-word, which can express the occurrence of an entity or word (the object of the predicate) in a text record (the subject of the predicate). We evaluate QLever on two large datasets, including FACC (the ClueWeb12 corpus linked to Freebase). We compare against three state-of-the-art query engines for knowledge bases with varying support for text search: RDF-3X, Virtuoso, Broccoli. Query times are competitive and often faster on the pure SPARQL queries, and several orders of magnitude faster on the SPARQL+Text queries. Index size is larger for pure SPARQL queries, but smaller for SPARQL+Text queries.

【Keywords】: efficiency; indexing; sparql+text

70. A Study of Main-Memory Hash Joins on Many-core Processor: A Case with Intel Knights Landing Architecture.

【Paper Link】【Pages】:657-666

【Authors】: Xuntao Cheng ; Bingsheng He ; Xiaoli Du ; Chiew Tong Lau

【Abstract】: Advanced processor architectures have been driving new designs, implementations and optimizations of main-memory hash join algorithms recently. The newly released Intel Xeon Phi many-core processor of the Knights Landing architecture (KNL) embraces interesting hardware features such as many low-frequency out-of-order cores connected on a 2D mesh, and high-bandwidth multi-channel memory (MCDRAM). In this paper, we experimentally revisit the state-of-the-art main-memory hash join algorithms to study how the new hardware features of KNL affect the algorithmic design and tuning as well as to identify the opportunities for further performance improvement on KNL. Our experiments show that, although many existing optimizations are still valid on KNL with proper tuning, even the state-of-the-art algorithms have severely underutilized the memory bandwidth and other hardware resources.

【Keywords】: database operators; hash join algorithms; many-core processor

71. PQBF: I/O-Efficient Approximate Nearest Neighbor Search by Product Quantization.

【Paper Link】【Pages】:667-676

【Authors】: Yingfan Liu ; Hong Cheng ; Jiangtao Cui

【Abstract】: Approximate nearest neighbor (ANN) search in high-dimensional space plays an essential role in many multimedia applications. Recently, product quantization (PQ) based methods for ANN search have attracted enormous attention in the community of computer vision, due to its good balance between accuracy and space requirement. PQ based methods embed a high-dimensional vector into a short binary code (called PQ code), and the squared Euclidean distance is estimated by asymmetric quantizer distance (AQD) with pretty high precision. Thus, ANN search in the original space can be converted to similarity search on AQD using the PQ approach. All existing PQ methods are in-memory solutions, which may not handle massive data if they cannot fit entirely in memory. In this paper, we propose an I/O-efficient PQ based solution for ANN search. We design an index called PQB+-forest to support efficient similarity search on AQD. PQB+-forest first creates a number of partitions of the PQ codes by a coarse quantizer and then builds a B+-tree, called PQB+-tree, for each partition. The search process is greatly expedited by focusing on a few selected partitions that are closest to the query, as well as by the pruning power of PQB+-trees. According to the experiments conducted on two large-scale data sets containing up to 1 billion vectors, our method outperforms its competitors, including the state-of-the-art PQ method and the state-of-the-art LSH methods for ANN search.

【Keywords】: approximate nearest neighbor search; b+-tree; product quantization

72. ANS-Based Index Compression.

【Paper Link】【Pages】:677-686

【Authors】: Alistair Moffat ; Matthias Petri

【Abstract】: Techniques for effectively representing the postings lists associated with inverted indexes have been studied for many years. Here we combine the recently developed "asymmetric numeral systems" (ANS) approach to entropy coding and a range of previous index compression methods, including VByte, Simple, and Packed. The ANS mechanism allows each of them to provide markedly improved compression effectiveness, at the cost of slower decoding rates. Using the 426 GB GOV2 collection, we show that the combination of blocking and ANS-based entropy-coding against a set of 16 magnitude-based probability models yields compression effectiveness superior to most previous mechanisms, while still providing reasonable decoding speed.

【Keywords】: asymmetric numeral systems; entropy coder; index compression; inverted index; postings list

Session 3F: Temporal data 4

73. Covering the Optimal Time Window Over Temporal Data.

【Paper Link】【Pages】:687-696

【Authors】: Bin Cao ; Chenyu Hou ; Jing Fan

【Abstract】: In this paper, we propose a new problem: covering the optimal time window over temporal data. Given a duration constraint d and a set of users where each user has multiple time intervals, the goal is to find all time windows which (1) are greater than or equal to the duration d, and (2) can be covered by the intervals from as many as possible users. This problem can be applied to real scenarios where people need to determine the best time for maximizing the number of people to be involved in an activity, e.g., the meeting organization and the online live video broadcasting. As far as we know, there is no existing algorithm that can solve the problem directly. In this paper, we propose two algorithms to solve the problem, the first one is considered as a baseline algorithm called sliding time window (STW), where we utilize the start and end points of all users - intervals to construct time windows satisfying duration d. And then we calculate the number of users whose intervals can cover the current time window. The second method, named TLI, is designed based on the the data structures from the Timeline Index in SAP HANA. In TLI algorithm, we conduct three consecutive phases to achieve the purpose of efficiency improvement, namely construction of Timeline Index, calculation of valid user set and calculation of time windows. Within the third phase, we prune the number of time windows by keeping track of the number of users in current optimal time window, which can help shrink the search space. Through extensive experimental evaluations, we find TLI algorithm outperforms STW two orders of magnitude in terms of querying time.

【Keywords】: optimal time window covering; temporal data; time interval; timeline index

74. Scaling Probabilistic Temporal Query Evaluation.

【Paper Link】【Pages】:697-706

【Authors】: Melisachew Wudage Chekol

【Abstract】: Open information extraction has driven automatic construction of (temporal) knowledge graphs (e.g. YAGO) that maintain probabilistic (temporal) facts and inference rules. One of the most important tasks in these knowledge graphs is query evaluation. This task is well known to be #P-hard. One of the bottlenecks of probabilistic (temporal) query evaluation is finding efficient ways of grounding the query and inference rules, to generate a factor graph that can be used for approximate query evaluation or to retrieve lineages of queries for exact evaluation. In this work, we propose the PRATiQUE (PRobAbilistic Temporal QUery Evaluation) framework for scalable temporal query evaluation. It harnesses the structure of temporal inference rules for efficient in-database grounding, i.e., it uses partitions to store structurally equivalent rules. Besides,PRATiQUE leverages a state-of-the-art Gibbs sampler to compute marginal probabilities of query answers. We report on an extensive experimental evaluation, which confirms the efficiency of our proposal.

【Keywords】: probabilistic; query evaluation; temporal knowledge graphs

75. Efficient Discovery of Abnormal Event Sequences in Enterprise Security Systems.

【Paper Link】【Pages】:707-715

【Authors】: Boxiang Dong ; Zhengzhang Chen ; Wendy Hui Wang ; Lu An Tang ; Kai Zhang ; Ying Lin ; Zhichun Li ; Haifeng Chen

【Abstract】: Intrusion detection system (IDS) is an important part of enterprise security system architecture. In particular, anomaly-based IDS has been widely applied to detect single abnormal process events that deviate from the majority. However, intrusion activity usually consists of a series of low-level heterogeneous events. The gap between low-level process events and high-level intrusion activities makes it particularly challenging to identify process events that are truly involved in a real malicious activity, and especially considering the massive 'noisy' events filling the event sequences. Hence, the existing work that focus on detecting single events can hardly achieve high detection accuracy. In this work, we formulate a novel problem in intrusion detection - suspicious event sequence discovery, and propose GID, an efficient graph-based intrusion detection technique that can identify abnormal event sequences from massive heterogeneous process traces with high accuracy. We fully implement GID and deploy it into a real-world enterprise security system, and it greatly helps detect the advanced threats and optimize the incident response. Executing GID on both static and streaming data shows that GID is efficient (processes about 2 million records per minute) and accurate for intrusion detection.

【Keywords】: anomaly detection; enterprise security system; graph modeling; intrusion detection

76. Temporal Analog Retrieval using Transformation over Dual Hierarchical Structures.

【Paper Link】【Pages】:717-726

【Authors】: Yating Zhang ; Adam Jatowt ; Katsumi Tanaka

【Abstract】: In recent years, we have witnessed a rapid increase of text con- tent stored in digital archives such as newspaper archives or web archives. Many old documents have been converted to digital form and made accessible online. Due to the passage of time, it is however difficult to effectively perform search within such collections. Users, especially younger ones, may have problems in finding appropriate keywords to perform effective search due to the terminology gap arising between their knowledge and the unfamiliar domain of archival collections. In this paper, we provide a general framework to bridge different domains across-time and, by this, to facilitate search and comparison as if carried in user's familiar domain (i.e., the present). In particular, we propose to find analogical terms across temporal text collections by applying a series of transformation procedures. We develop a cluster-biased transformation technique which makes use of hierarchical cluster structures built on the temporally distributed document collections. Our methods do not need any specially prepared training data and can be applied to diverse collections and time periods. We test the performance of the proposed approaches on the collections separated by both short (e.g., 20 years) and long time gaps (70 years), and we report improvements in range of 18%-27% over short and 56%-92% over long periods when compared to state-of-the-art baselines.

【Keywords】: cluster-biased; dual hierarchical structure; heterogeneous document collections; temporal analog

Session 4A: Evaluation 4

77. Does That Mean You're Happy?: RNN-based Modeling of User Interaction Sequences to Detect Good Abandonment.

【Paper Link】【Pages】:727-736

【Authors】: Kyle Williams ; Imed Zitouni

【Abstract】: Queries for which there are no clicks are known as abandoned queries. Differentiating between good and bad abandonment queries has become an important task in search engine evaluation since it allows for better measurement of search engine features that do not require users to click. Examples of these features include answers on the SERP and detailed Web result snippets. In this paper, we investigate how sequences of user interactions on the SERP differ between good and bad abandonment. To do this, we study the behavior patterns on a labeled dataset of abandoned queries and find that they differ in several ways, such as in the number of user interactions and the nature of those interactions. Based on this insight, we frame good abandonment detection as a sequence classification problem. We use a Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN) to model the sequence of user interactions and show that it performs significantly better than other baselines when detecting good abandonment, achieving 71% accuracy. Our findings have implications for search engine evaluation.

【Keywords】: good abandonment; lstm; mouse movements; satisfaction; user interaction modeling

78. Deep Sequential Models for Task Satisfaction Prediction.

【Paper Link】【Pages】:737-746

【Authors】: Rishabh Mehrotra ; Ahmed Hassan Awadallah ; Milad Shokouhi ; Emine Yilmaz ; Imed Zitouni ; Ahmed El Kholy ; Madian Khabsa

【Abstract】: Detecting and understanding implicit signals of user satisfaction are essential for experimentation aimed at predicting searcher satisfaction. As retrieval systems have advanced, search tasks have steadily emerged as accurate units not only to capture searcher's goals but also in understanding how well a system is able to help the user achieve that goal. However, a major portion of existing work on modeling searcher satisfaction has focused on query level satisfaction. The few existing approaches for task satisfaction prediction have narrowly focused on simple tasks aimed at solving atomic information needs. In this work we go beyond such atomic tasks and consider the problem of predicting user's satisfaction when engaged in complex search tasks composed of many different queries and subtasks. We begin by considering holistic view of user interactions with the search engine result page (SERP) and extract detailed interaction sequences of their activity. We then look at query level abstraction and propose a novel deep sequential architecture which leverages the extracted interaction sequences to predict query level satisfaction. Further, we enrich this model with auxiliary features which have been traditionally used for satisfaction prediction and propose a unified multi-view model which combines the benefit of user interaction sequences with auxiliary features. Finally, we go beyond query level abstraction and consider query sequences issued by the user in order to complete a complex task, to make task level satisfaction predictions. We propose a number of functional composition techniques which take into account query level satisfaction estimates along with the query sequence to predict task level satisfaction. Through rigorous experiments, we demonstrate that the proposed deep sequential models significantly outperform established baselines at both query and task satisfaction prediction. Our findings have implications on metric development for gauging user satisfaction and on designing systems which help users accomplish complex search tasks.

【Keywords】: lstm; search tasks; user interactions

79. Adaptive Persistence for Search Effectiveness Measures.

【Paper Link】【Pages】:747-756

【Authors】: Jiepu Jiang ; James Allan

【Abstract】: Many search effectiveness evaluation measures penalize the importance of results at lower ranks. This is usually explained as an attempt to model users' persistence when sequentially examining results---lower ranked results are less important because users are less likely persistent enough to read them. The persistence parameters are usually set to cope with the target cohort and tasks. But during a particular evaluation round, the same parameters are applied to evaluate different ranked lists. In contrast, we present work that adapts the persistence factor according to the ranking and relevance of the ranked lists being evaluated. This is to model that rational users change their browsing behavior according to the search result page, e.g., users avoid wasting time (a low persistence level) if the results look apparently off-topic. Experimental results show that this approach better fits observed user behavior and correlates with users' ratings on their search performance.

【Keywords】: persistence; search effectiveness evaluation measure; user model

80. Beyond Success Rate: Utility as a Search Quality Metric for Online Experiments.

【Paper Link】【Pages】:757-765

【Authors】: Widad Machmouchi ; Ahmed Hassan Awadallah ; Imed Zitouni ; Georg Buscher

【Abstract】: User satisfaction metrics are an integral part of search engine development as they help system developers to understand and evaluate the quality of the user experience. Research to date has mostly focused on predicting success or frustration as a proxy for satisfaction. However, users' search experience is more complex than merely being either successful or not. As such, using success rate as a measure of satisfaction can be limiting. In this work, we propose the use of utility as a measure of searcher satisfaction. This concept represents the fulfillment a user receives from con-suming a service and explains how users aim to gain optimal overall satisfaction. Our utility metrics measure the user satisfac-tion by aggregating all their interaction with the search engine. These interactions are represented as a timeline of actions and their dwelltimes, where each action is classified as having a posi-tive or negative effect on the user. We examine sessions mined from Bing logs, with multi-point scale assessment of searcher satisfaction and show that utility is a better proxy for satisfaction compared to success. Leveraging that data, we design metrics of searcher satisfaction that assess the overall utility accumulated by a user during her search session. We use real user traffic from millions of users in an A/B setting to compare utility metrics to success rate metrics. We show that utility is a better metric for evaluating searcher satisfaction with the search engine, and a more sensitive and accurate metric when compared to predicting success. These metrics are currently adopted as the top-level met-ric for evaluating the thousands of A/B experiments that are run on Bing each year.

【Keywords】: Search satisfaction; effort; session.; evaluation; utility

Session 4B: News and credibility 4

81. Linking News across Multiple Streams for Timeliness Analysis.

【Paper Link】【Pages】:767-776

【Authors】: Ida Mele ; Seyed Ali Bahrainian ; Fabio Crestani

【Abstract】: Linking multiple news streams based on the reported events and analyzing the streams' temporal publishing patterns are two very important tasks for information analysis, discovering newsworthy stories, studying the event evolution, and detecting untrustworthy sources of information. In this paper, we propose techniques for cross-linking news streams based on the reported events with the purpose of analyzing the temporal dependencies among streams. Our research tackles two main issues: (1) how news streams are connected as reporting an event or the evolution of the same event and (2) how timely the newswires report related events using different publishing platforms. Our approach is based on dynamic topic modeling for detecting and tracking events over the timeline and on clustering news according to the events. We leverage the event-based clustering to link news across different streams and present two scoring functions for ranking the streams based on their timeliness in publishing news about a specific event.

【Keywords】: dynamic topic modeling; event mining; news streams; temporal analysis

82. Growing Story Forest Online from Massive Breaking News.

【Paper Link】【Pages】:777-785

【Authors】: Bang Liu ; Di Niu ; Kunfeng Lai ; Linglong Kong ; Yu Xu

【Abstract】: We describe our experience of implementing a news content organization system at Tencent that discovers events from vast streams of breaking news and evolves news story structures in an online fashion. Our real-world system has distinct requirements in contrast to previous studies on topic detection and tracking (TDT) and event timeline or graph generation, in that we 1) need to accurately and quickly extract distinguishable events from massive streams of long text documents that cover diverse topics and contain highly redundant information, and 2) must develop the structures of event stories in an online manner, without repeatedly restructuring previously formed stories, in order to guarantee a consistent user viewing experience. In solving these challenges, we propose Story Forest, a set of online schemes that automatically clusters streaming documents into events, while connecting related events in growing trees to tell evolving stories. We conducted extensive evaluation based on 60 GB of real-world Chinese news data, although our ideas are not language-dependent and can easily be extended to other languages, through detailed pilot user experience studies. The results demonstrate the superior capability of Story Forest to accurately identify events and organize news text into a logical structure that is appealing to human readers, compared to multiple existing algorithm frameworks.

【Keywords】: information retrieval; online story tree; text clustering

83. iFACT: An Interactive Framework to Assess Claims from Tweets.

【Paper Link】【Pages】:787-796

【Authors】: Wee-Yong Lim ; Mong-Li Lee ; Wynne Hsu

【Abstract】: Posts by users on microblogs such as Twitter provide diverse real-time updates to major events. Unfortunately, not all the information are credible. Previous works that assess the credibility of information in Twitter have focused on extracting features from the Tweets. In this work, we present an interactive framework called iFACT for assessing the credibility of claims from tweets. The proposed framework collects independent evidence from web search results (WSR) and identify the dependencies between claims. It utilizes features from the search results to determine the probabilities that a claim is credible, not credible or inconclusive. Finally, the dependencies between claims are used to adjust the likelihood estimates of a claim being credible, not credible or inconclusive. iFACT allows users to be engaged in the credibility assessment process by providing feedback as to whether the web search results are relevant, support or contradict a claim. Experiment results on multiple real world datasets demonstrate the effectiveness of WSR features and its ability to generalize to claims of new events. Case studies show the usefulness of claim dependencies and how the proposed approach can give explanation to the credibility assessment process.

【Keywords】: credibility

84. CSI: A Hybrid Deep Model for Fake News Detection.

【Paper Link】【Pages】:797-806

【Authors】: Natali Ruchansky ; Sungyong Seo ; Yan Liu

【Abstract】: The topic of fake news has drawn attention both from the public and the academic communities. Such misinformation has the potential of affecting public opinion, providing an opportunity for malicious parties to manipulate the outcomes of public events such as elections. Because such high stakes are at play, automatically detecting fake news is an important, yet challenging problem that is not yet well understood. Nevertheless, there are three generally agreed upon characteristics of fake news: the text of an article, the user response it receives, and the source users promoting it. Existing work has largely focused on tailoring solutions to one particular characteristic which has limited their success and generality. In this work, we propose a model that combines all three characteristics for a more accurate and automated prediction. Specifically, we incorporate the behavior of both parties, users and articles, and the group behavior of users who propagate fake news. Motivated by the three characteristics, we propose a model called CSI which is composed of three modules: Capture, Score, and Integrate. The first module is based on the response and text; it uses a Recurrent Neural Network to capture the temporal pattern of user activity on a given article. The second module learns the source characteristic based on the behavior of users, and the two are integrated with the third module to classify an article as fake or not. Experimental analysis on real-world data demonstrates that CSI achieves higher accuracy than existing models, and extracts meaningful latent representations of both users and articles.

【Keywords】: deep learning; fake news detection; group anomaly detection; neural network; social networks; temporal analysis

Session 4C: Outliers and Anomaly Detection 4

85. Selective Value Coupling Learning for Detecting Outliers in High-Dimensional Categorical Data.

【Paper Link】【Pages】:807-816

【Authors】: Guansong Pang ; Hongzuo Xu ; Longbing Cao ; Wentao Zhao

【Abstract】: This paper introduces a novel framework, namely SelectVC and its instance POP, for learning selective value couplings (i.e., interactions between the full value set and a set of outlying values) to identify outliers in high-dimensional categorical data. Existing outlier detection methods work on a full data space or feature subspaces that are identified independently from subsequent outlier scoring. As a result, they are significantly challenged by overwhelming irrelevant features in high-dimensional data due to the noise brought by the irrelevant features and its huge search space. In contrast, SelectVC works on a clean and condensed data space spanned by selective value couplings by jointly optimizing outlying value selection and value outlierness scoring. Its instance POP defines a value outlierness scoring function by modeling a partial outlierness propagation process to capture the selective value couplings. POP further defines a top-k outlying value selection method to ensure its scalability to the huge search space. We show that POP (i) significantly outperforms five state-of-the-art full space- or subspace-based outlier detectors and their combinations with three feature selection methods on 12 real-world high-dimensional data sets with different levels of irrelevant features; and (ii) obtains good scalability, stable performance w.r.t. k, and fast convergence rate.

【Keywords】: categorical data; coupling learning; feature selection; high-dimensional data; outlier detection

86. Outlier Detection in Sparse Data with Factorization Machines.

【Paper Link】【Pages】:817-826

【Authors】: Mengxiao Zhu ; Charu C. Aggarwal ; Shuai Ma ; Hui Zhang ; Jinpeng Huai

【Abstract】: In sparse data, a large fraction of the entries take on zero values. Some examples of sparse data include short text snippets (such as tweets in Twitter) or some feature representations of categorical data sets with a large number of values, in which traditional methods for outlier detection typically fail because of the difficulty of computing distances. To address this, it is important to use the latent relations between such values. Factorization machines represent a natural methodology for this, and are naturally designed for the massive-domain setting because of their emphasis on sparse data sets. In this study, we propose an outlier detection approach for sparse data with factorization machines. Factorization machines are also efficient due to their linear complexity in the number of non-zero values. In fact, because of their efficiency, they can even be extended to traditional settings for numerical data by an appropriate feature engineering effort. We show that our approach is both effective and efficient for sparse categorical, short text and numerical data by an extensive experimental study.

【Keywords】: factorization machines; outlier detection; sparse data

87. Anomaly Detection in Dynamic Networks using Multi-view Time-Series Hypersphere Learning.

【Paper Link】【Pages】:827-836

【Authors】: Xian Teng ; Yu-Ru Lin ; Xidao Wen

【Abstract】: Detecting anomalous patterns from dynamic and multi-attributed network systems has been a challenging problem due to the complication of temporal dynamics and the variations reflected in multiple data sources. We propose a Multi-view Time-Series Hypersphere Learning (MTHL) approach that leverages multi-view learning and support vector description to tackle this problem. Given a dynamic network with time-varying edge and node properties, MTHL projects multi-view time-series data into a shared latent subspace, and then learns a compact hypersphere surrounding normal samples with soft constraints. The learned hypersphere allows for effectively distinguishing normal and abnormal cases. We further propose an efficient, two-stage alternating optimization algorithm as a solution to the MTHL. Extensive experiments are conducted on both synthetic and real datasets. Results demonstrate that our method outperforms the state-of-the-art baseline methods in detecting three types of events that involve (i) time-varying features alone, (ii) time-aggregated features alone, as well as (iii) both features. Moreover, our approach exhibits consistent and good performance in face of issues including noises, anomaly pollution in training phase and data imbalance.

【Keywords】: anomaly detection; dynamic networks; multi-view learning; time- series mining; urban computing

88. A Fast Trajectory Outlier Detection Approach via Driving Behavior Modeling.

【Paper Link】【Pages】:837-846

【Authors】: Hao Wu ; Weiwei Sun ; Baihua Zheng

【Abstract】: Trajectory outlier detection is a fundamental building block for many location-based service (LBS) applications, with a large application base. We dedicate this paper on detecting the outliers from vehicle trajectories efficiently and effectively. In addition, we want our solution to be able to issue an alarm early when an outlier trajectory is only partially observed (i.e., the trajectory has not yet reached the destination). Most existing works study the problem on general Euclidean trajectories and require accesses to the historical trajectory database or computations on the distance metric that are very expensive. Furthermore, few of existing works consider some specific characteristics of vehicles trajectories (e.g., their movements are constrained by the underlying road networks), and majority of them require the input of complete trajectories. Motivated by this, we propose a vehicle outlier detection approach namely DB-TOD which is based on probabilistic model via modeling the driving behavior/preferences from the set of historical trajectories. We design outlier detection algorithms on both complete trajectory and partial one. Our probabilistic model-based approach makes detecting trajectory outlier extremely efficient while preserving the effectiveness, contributed by the relatively accurate model on driving behavior. We conduct comprehensive experiments using real datasets and the results justify both effectiveness and efficiency of our approach.

【Keywords】: driving behavior; inverse reinforcement learning; outlier detection; trajectory data processing

Session 4D: Graph Mining 1 4

89. BL-ECD: Broad Learning based Enterprise Community Detection via Hierarchical Structure Fusion.

【Paper Link】【Pages】:859-868

【Authors】: Jiawei Zhang ; Limeng Cui ; Philip S. Yu ; Yuanhua Lv

【Abstract】: Employees in companies can be divided into different social communities, and those who frequently socialize with each other will be treated as close friends and are grouped in the same community. In the enterprise context, a large amount of information about the employees is available in both (1) offline company internal sources and (2) online enterprise social networks (ESNs). Each of the information sources also contain multiple categories of employees' socialization activities at the same time. In this paper, we propose to detect the social communities of the employees in companies based on the broad learning setting with both these online and offline information sources simultaneously, and the problem is formally called the "Broad Learning based Enterprise Community Detection" (BL-ECD) problem. To address the problem, a novel broad learning based community detection framework named "HeterogeneoUs Multi-sOurce ClusteRing" (HUMOR) is introduced in this paper. Based on the various enterprise social intimacy measures introduced in this paper, HUMOR detects a set of micro community structures of the employees based on each of the socialization activities respectively. To obtain the (globally) consistent community structure of employees in the company, HUMOR further fuses these micro community structures via two broad learning phases: (1) intra-fusion of micro community structures to obtain the online and offline (locally) consistent communities respectively, and (2) inter-fusion of the online and offline communities to achieve the (globally) consistent community structure of employees. Extensive experiments conducted on real-world enterprise datasets demonstrate our method can perform very well in addressing the BL-ECD problem.

【Keywords】: aligned social networks; broad learning; community detection; enterprise social networks; heterogeneous information network; network embedding

90. Highly Efficient Mining of Overlapping Clusters in Signed Weighted Networks.

【Paper Link】【Pages】:869-878

【Authors】: Tuan-Anh Hoang ; Ee-Peng Lim

【Abstract】: In many practical contexts, networks are weighted as their links are assigned numerical weights representing relationship strengths or intensities of inter-node interaction. Moreover, the links' weight can be positive or negative, depending on the relationship or interaction between the connected nodes. The existing methods for network clustering however are not ideal for handling very large signed weighted networks. In this paper, we present a novel method called LPOCSIN (short for "Linear Programming based Overlapping Clustering on Signed Weighted Networks") for efficient mining of overlapping clusters in signed weighted networks. Different from existing methods that rely on computationally expensive cluster cohesiveness measures, LPOCSIN utilizes a simple yet effective one. Using this measure, we transform the cluster assignment problem into a series of alternating linear programs, and further propose a highly efficient procedure for solving those alternating problems. We evaluate LPOCSIN and other state-of-the-art methods by extensive experiments covering a wide range of synthetic and real networks. The experiments show that LPOCSIN significantly outperforms the other methods in recovering ground-truth clusters while being an order of magnitude faster than the most efficient state-of-the-art method.

【Keywords】: overlapping clustering; signed network; weighted network

91. To Be Connected, or Not to Be Connected: That is the Minimum Inefficiency Subgraph Problem.

【Paper Link】【Pages】:879-888

【Authors】: Natali Ruchansky ; Francesco Bonchi ; David García-Soriano ; Francesco Gullo ; Nicolas Kourtellis

【Abstract】: We study the problem of extracting a selective connector for a given set of query vertices Q subset of V in a graph G = (V,E). A selective connector is a subgraph of G which exhibits some cohesiveness property, and contains the query vertices but does not necessarily connect them all. Relaxing the connectedness requirement allows the connector to detect multiple communities and to be tolerant to outliers. We achieve this by introducing the new measure of network inefficiency and by instantiating our search for a selective connector as the problem of finding the minimum inefficiency subgraph. We show that the minimum inefficiency subgraph problem is NP-hard, and devise efficient algorithms to approximate it. By means of several case studies in a variety of application domains (such as human brain, cancer, and food networks), we show that our minimum inefficiency subgraph produces high-quality solutions, exhibiting all the desired behaviors of a selective connector.

【Keywords】: biomedical and healthcare applications; brain network; community; data mining; graph mining; seed set expansion; social network analysis

92. MGAE: Marginalized Graph Autoencoder for Graph Clustering.

【Paper Link】【Pages】:889-898

【Authors】: Chun Wang ; Shirui Pan ; Guodong Long ; Xingquan Zhu ; Jing Jiang

【Abstract】: Graph clustering aims to discovercommunity structures in networks, the task being fundamentally challenging mainly because the topology structure and the content of the graphs are difficult to represent for clustering analysis. Recently, graph clustering has moved from traditional shallow methods to deep learning approaches, thanks to the unique feature representation learning capability of deep learning. However, existing deep approaches for graph clustering can only exploit the structure information, while ignoring the content information associated with the nodes in a graph. In this paper, we propose a novel marginalized graph autoencoder (MGAE) algorithm for graph clustering. The key innovation of MGAE is that it advances the autoencoder to the graph domain, so graph representation learning can be carried out not only in a purely unsupervised setting by leveraging structure and content information, it can also be stacked in a deep fashion to learn effective representation. From a technical viewpoint, we propose a marginalized graph convolutional network to corrupt network node content, allowing node content to interact with network features, and marginalizes the corrupted features in a graph autoencoder context to learn graph feature representations. The learned features are fed into the spectral clustering algorithm for graph clustering. Experimental results on benchmark datasets demonstrate the superior performance of MGAE, compared to numerous baselines.

【Keywords】: autoencoder; graph autoencoder; graph clustering; graph convolutional network; network representation

Session 4E: Online learning, Stream mining 3

93. BoostVHT: Boosting Distributed Streaming Decision Trees.

【Paper Link】【Pages】:899-908

【Authors】: Theodore Vasiloudis ; Foteini Beligianni ; Gianmarco De Francisci Morales

【Abstract】: Online boosting improves the accuracy of classifiers for unbounded streams of data by chaining them into an ensemble. Due to its sequential nature, boosting has proven hard to parallelize, even more so in the online setting. This paper introduces BoostVHT, a technique to parallelize online boosting algorithms. Our proposal leverages a recently-developed model-parallel learning algorithm for streaming decision trees as a base learner. This design allows to neatly separate the model boosting from its training. As a result, BoostVHT provides a flexible learning framework which can employ any existing online boosting algorithm, while at the same time it can leverage the computing power of modern parallel and distributed cluster environments. We implement our technique on Apache SAMOA, an open-source platform for mining big data streams that can be run on several distributed execution engines, and demonstrate order of magnitude speedups compared to the state-of-the-art.

【Keywords】: boosting; decision trees; distributed systems; online learning

94. Stream Aggregation Through Order Sampling.

【Paper Link】【Pages】:909-918

【Authors】: Nick G. Duffield ; Yunhong Xu ; Liangzhen Xia ; Nesreen K. Ahmed ; Minlan Yu

【Abstract】: This paper introduces a new single-pass reservoir weighted-sampling stream aggregation algorithm, Priority-Based Aggregation (PBA). While order sampling is a powerful and efficient method for weighted sampling from a stream of uniquely keyed items, there is no current algorithm that realizes the benefits of order sampling in the context of stream aggregation over non-unique keys. A naive approach to order sample regardless of key then aggregate the results is hopelessly inefficient. In distinction, our proposed algorithm uses a single persistent random variable across the lifetime of each key in the cache, and maintains unbiased estimates of the key aggregates that can be queried at any point in the stream. The basic approach can be supplemented with a Sample and Hold pre-sampling stage with a sampling rate adaptation controlled by PBA. This approach represents a considerable reduction in computational complexity compared with the state of the art in adapting Sample and Hold to operate with a fixed cache size. Concerning statistical properties, we prove that PBA provides unbiased estimates of the true aggregates. We analyze the computational complexity of PBA and its variants, and provide a detailed evaluation of its accuracy on synthetic and trace data. Weighted relative error is reduced by 40% to 65% at sampling rates of 5% to 17%, relative to Adaptive Sample and Hold; there is also substantial improvement for rank queries.

【Keywords】: Aggregation; Heavy Hitters; Priority Sampling; Subset Sums

95. FUSION: An Online Method for Multistream Classification.

【Paper Link】【Pages】:919-928

【Authors】: Ahsanul Haque ; Zhuoyi Wang ; Swarup Chandra ; Bo Dong ; Latifur Khan ; Kevin W. Hamlen

【Abstract】: Traditional data stream classification assumes that data is generated from a single non-stationary process. On the contrary, multistream classification problem involves two independent non-stationary data generating processes. One of them is the source stream that continuously generates labeled data. The other one is the target stream that generates unlabeled test data from the same domain. The distribution represented by the source stream data is biased compared to that of the target stream. Moreover, these streams may have asynchronous concept drifts between them. The multistream classification problem is to predict the class labels of target stream instances by utilizing labeled data from the source stream. This kind of scenario is often observed in real-world applications due to scarcity of labeled data. The only existing approach for multistream classification uses separate drift detection on the streams for addressing the asynchronous concept drift problem. If a concept drift is detected in any of the streams, it uses an expensive batch technique for data shift adaptation. These add significant execution overhead, and limit its usability. In this paper, we propose an efficient solution for multistream classification by fusing drift detection into online data shift adaptation. We study the theoretical convergence rate and computational complexity of the proposed approach. Moreover, empirical results on benchmark data sets indicate significantly improved performance over the baseline methods.

【Keywords】: asynchronous concept drift; data shift adaptation; direct density ratio estimation; multistream classification

Session 5A: Tensor analysis 4

96. Maintaining Densest Subsets Efficiently in Evolving Hypergraphs.

【Paper Link】【Pages】:929-938

【Authors】: Shuguang Hu ; Xiaowei Wu ; T.-H. Hubert Chan

【Abstract】: In this paper we study the densest subgraph problem, which plays a key role in many graph mining applications. The goal of the problem is to find a subset of nodes that induces a graph with maximum average degree. The problem has been extensively studied in the past few decades under a variety of different settings. Several exact and approximation algorithms were proposed. However, as normal graph can only model objects with pairwise relationships, the densest subgraph problem fails in identifying communities under relationships that involve more than 2 objects, e.g., in a network connecting authors by publications. We consider in this work the densest subgraph problem in hypergraphs, which generalizes the problem to a wider class of networks in which edges might have different cardinalities and contain more than 2 nodes. We present two exact algorithms and a near-linear time r-approximation algorithm for the problem, where r is the maximum cardinality of an edge in the hypergraph. We also consider the dynamic version of the problem, in which an adversary can insert or delete an edge from the hypergraph in each round and the goal is to maintain efficiently an approximation of the densest subgraph. We present two dynamic approximation algorithms in this paper with amortized polog update time, for any ε > 0. For the case when there are only insertions, the approximation ratio we maintain is r(1+ε), while for the fully dynamic case, the ratio is r2(1+ε). Extensive experiments are performed on large real datasets to validate the effectiveness and efficiency of our algorithms.

【Keywords】: densest subgraph; dynamic data structure; graph mining

97. Coupled Sparse Matrix Factorization for Response Time Prediction in Logistics Services.

【Paper Link】【Pages】:939-947

【Authors】: Yuqi Wang ; Jiannong Cao ; Lifang He ; Wengen Li ; Lichao Sun ; Philip S. Yu

【Abstract】: Nowadays, there is an emerging way of connecting logistics orders and van drivers, where it is crucial to predict the order response time. Accurate prediction of order response time would not only facilitate decision making on order dispatching, but also pave ways for applications such as supply-demand analysis and driver scheduling, leading to high system efficiency. In this work, we forecast order response time on current day by fusing data from order history and driver historical locations. Specifically, we propose Coupled Sparse Matrix Factorization (CSMF) to deal with the heterogeneous fusion and data sparsity challenges raised in this problem. CSMF jointly learns from multiple heterogeneous sparse data through the proposed weight setting mechanism therein. Experiments on real-world datasets demonstrate the effectiveness of our approach, compared to various baseline methods. The performances of many variants of the proposed method are also presented to show the effectiveness of each component.

【Keywords】: coupled matrix factorization; logistics services; response time prediction; sparse matrix factorization

98. Tensor Rank Estimation and Completion via CP-based Nuclear Norm.

【Paper Link】【Pages】:949-958

【Authors】: Qiquan Shi ; Haiping Lu ; Yiu-ming Cheung

【Abstract】: Tensor completion (TC) is a challenging problem of recovering missing entries of a tensor from its partial observation. One main TC approach is based on CP/Tucker decomposition. However, this approach often requires the determination of a tensor rank a priori. This rank estimation problem is difficult in practice. Several Bayesian solutions have been proposed but they often under/over-estimate the tensor rank while being quite slow. To address this problem of rank estimation with missing entries, we view the weight vector of the orthogonal CP decomposition of a tensor to be analogous to the vector of singular values of a matrix. Subsequently, we define a new CP-based tensor nuclear norm as the $L_1$-norm of this weight vector. We then propose Tensor Rank Estimation based on $L_1$-regularized orthogonal CP decomposition (TREL1) for both CP-rank and Tucker-rank. Specifically, we incorporate a regularization with CP-based tensor nuclear norm when minimizing the reconstruction error in TC to automatically determine the rank of an incomplete tensor. Experimental results on both synthetic and real data show that: 1) Given sufficient observed entries, TREL1 can estimate the true rank (both CP-rank and Tucker-rank) of incomplete tensors well; 2) The rank estimated by TREL1 can consistently improve recovery accuracy of decomposition-based TC methods; 3) TREL1 is not sensitive to its parameters in general and more efficient than existing rank estimation methods.

【Keywords】: cp decomposition; cp-based tensor nuclear norm; tensor completion; tensor rank estimation

99. Smart Infrastructure Maintenance Using Incremental Tensor Analysis: Extended Abstract.

【Paper Link】【Pages】:959-967

【Authors】: Nguyen Lu Dang Khoa ; Ali Anaissi ; Yang Wang

【Abstract】: Civil infrastructures are key to the flow of people and goods in urban environments. Structural Health Monitoring (SHM) is a condition-based maintenance technology, which provides and predicts actionable information on the current and future states of infrastructures. SHM data are usually multi-way data which are produced by multiple highly correlated sensors. Tensor decomposition allows the learning from such data in temporal, spatial and feature modes at the same time. However, to facilitate a real time response for online learning, incremental tensor update need to be used when new data come in, rather than doing the decomposition in a batch manner. This work proposed a method called onlineCP-ALS to incrementally update tensor component matrices, followed by a self-tuning one-class support vector machine for online damage identification. Moreover, a robust clustering technique was applied on the tensor space for online substructure grouping and anomaly detection. These methods were applied to data from lab-based structures and also data collected from the Sydney Harbour Bridge in Australia. We obtained accurate damage detection accuracies for all these datasets. Damage locations were also captured correctly, and different levels of damage severity were well estimated. Furthermore, the clustering technique was able to detect spatial anomalies, which were associated with sensor and instrumentation issues. Our proposed method was efficient and much faster than the batch approach.

【Keywords】: anomaly detection; clustering; incremental tensor analysis; smart infrastructures; structural health monitoring

Session 5B: Application driven mining 4

100. Collaborative Filtering as a Case-Study for Model Parallelism on Bulk Synchronous Systems.

【Paper Link】【Pages】:969-977

【Authors】: Ariyam Das ; Ishan Upadhyaya ; Xiangrui Meng ; Ameet Talwalkar

【Abstract】: Industrial-scale machine learning applications often train and maintain massive models that can be on the order of hundreds of millions to billions of parameters. Model parallelism thus plays a significant role to support these machine learning tasks. Recent work in this area has been dominated by parameter server architectures that follow an asynchronous computation model, introducing added complexity and approximation in order to scale to massive workloads. In this work, we explore model parallelism in the distributed bulk-synchronous parallel (BSP) setting, leveraging some recent progress made in the area of high performance computing, in order to address these complexity and approximation issues. Using collaborative filtering as a case-study, we introduce an efficient model parallel industrial scale algorithm for alternating least squares (ALS), along with a highly optimized implementation of ALS that serves as the default implementation in MLlib, Apache Spark's machine learning library. Our extensive empirical evaluation demonstrates that our implementation in MLlib compares favorably to the leading open-source parameter server framework, and our implementation scales to massive problems on the order of 50 billion ratings and close to 1 billion parameters.

【Keywords】: alternating least squares (als); apache spark; bulk synchronous parallel (bsp) systems; collaborative filtering; model parallelism; parameter servers

101. Modeling Student Learning Styles in MOOCs.

【Paper Link】【Pages】:979-988

【Authors】: Yuling Shi ; Zhiyong Peng ; Hongning Wang

【Abstract】: The recorded student activities in Massive Open Online Course (MOOC) provide us a unique opportunity to model their learning behaviors, identify their particular learning intents, and enable personalized assistance and guidance in online education. In this work, based on a thorough qualitative study of students' behaviors recorded in two MOOC courses with large student enrollments, we develop a non-parametric Bayesian model to capture students' sequential learning activities in a generative manner. Homogeneity of students' learning behaviors is captured by clustering them into latent student groups, where shared model structure characterizes the transitional patterns, intensity and temporal distribution of their learning activities. In the meanwhile, heterogeneity is captured by clustering students into different groups. Both qualitative and quantitative studies on those two MOOC courses confirmed the effectiveness of the proposed model in identifying students' learning behavior patterns and clustering them into related groups for predictive analysis. The identified student groups accurately predict student retention, course satisfaction and demographics.

【Keywords】: behavior modeling; moocs; probabilistic modeling; sequential data mining

102. Tracking Knowledge Proficiency of Students with Educational Priors.

【Paper Link】【Pages】:989-998

【Authors】: Yuying Chen ; Qi Liu ; Zhenya Huang ; Le Wu ; Enhong Chen ; Run-ze Wu ; Yu Su ; Guoping Hu

【Abstract】: Diagnosing students' knowledge proficiency, i.e., the mastery degrees of a particular knowledge point in exercises, is a crucial issue for numerous educational applications, e.g., targeted knowledge training and exercise recommendation. Educational theories have converged that students learn and forget knowledge from time to time. Thus, it is necessary to track their mastery of knowledge over time. However, traditional methods in this area either ignored the explanatory power of the diagnosis results on knowledge points or relied on a static assumption. To this end, in this paper, we devise an explanatory probabilistic approach to track the knowledge proficiency of students over time by leveraging educational priors. Specifically, we first associate each exercise with a knowledge vector in which each element represents an explicit knowledge point by leveraging educational priors (i.e., Q-matrix ). Correspondingly, each student is represented as a knowledge vector at each time in a same knowledge space. Second, given the student knowledge vector over time, we borrow two classical educational theories (i.e., Learning curve and Forgetting curve ) as priors to capture the change of each student's proficiency over time. After that, we design a probabilistic matrix factorization framework by combining student and exercise priors for tracking student knowledge proficiency. Extensive experiments on three real-world datasets demonstrate both the effectiveness and explanatory power of our proposed model.

【Keywords】: dynamic modeling; educational priors; explanatory power; knowledge diagnosis

103. Spreadsheet Property Detection With Rule-assisted Active Learning.

【Paper Link】【Pages】:999-1008

【Authors】: Zhe Chen ; Sasha Dadiomov ; Richard Wesley ; Gang Xiao ; Daniel Cory ; Michael J. Cafarella ; Jock Mackinlay

【Abstract】: Spreadsheets are a critical and widely-used data management tool. Converting spreadsheet data into relational tables would bring benefits to a number of fields, including public policy, public health, and economics. Research to date has focused on designing domain-specific languages to describe transformation processes or automatically converting a specific type of spreadsheets. To handle a larger variety of spreadsheets, we have to identify various spreadsheet properties, which correspond to a series of transformation programs that contribute towards a general framework that converts spreadsheets to relational tables. In this paper, we focus on the problem of spreadsheet property detection. We propose a hybrid approach of building a variety of spreadsheet property detectors to reduce the amount of required human labeling effort. Our approach integrates an active learning framework with crude, easy-to-write, user-provided rules to save human labeling effort by generating additional high-quality labeled data especially in the initial training stage. Using a bagging-like technique, Our approach can also tolerate lower-quality user-provided rules. Our experiments show that when compared to a standard active learning approach, we reduced the training data needed to reach the performance plateau by 34-44% when a human provides relatively high-quality rules, and by a comparable amount with low-quality rules. A study on a large-scale web-crawled spreadsheet dataset demonstrates that it is crucial to detect a variety of spreadsheet properties in order to transform a large portion of the spreadsheets into a relational form.

【Keywords】: active learning; data cleaning; spreadsheets

Session 5C: Deep Learning 1 4

104. Learning Knowledge Embeddings by Combining Limit-based Scoring Loss.

【Paper Link】【Pages】:1009-1018

【Authors】: Xiaofei Zhou ; Qiannan Zhu ; Ping Liu ; Li Guo

【Abstract】: In knowledge graph embedding models, the margin-based ranking loss as the common loss function is usually used to encourage discrimination between golden triplets and incorrect triplets, which has proved effective in many translation-based models for knowledge graph embedding. However, we find that the loss function cannot ensure the fact that the scoring of correct triplets must be low enough to fulfill the translation. In this paper, we present a limit-based scoring loss to provide lower scoring of a golden triplet, and then to extend two basic translation models TransE and TransH, separately to TransE-RS and TransH-RS by combining limit-based scoring loss with margin-based ranking loss. Both the presented models have low complexities of parameters benefiting for application on large scale graphs. In experiments, we evaluate our models on two typical tasks including triplet classification and link prediction, and also analyze the scoring distributions of positive and negative triplets by different models. Experimental results show that the introduced limit-based scoring loss is effective to improve the capacities of knowledge graph embedding.

【Keywords】: data mining; image recognition; machine learning; nlp

105. Length Adaptive Recurrent Model for Text Classification.

【Paper Link】【Pages】:1019-1027

【Authors】: Zhengjie Huang ; Zi Ye ; Shuangyin Li ; Rong Pan

【Abstract】: In recent years, recurrent neural networks have been widely used for various text classification tasks. However, most of the recurrent architectures will not assign a class label to a text until they read the last word, while human beings are able to determine the text class before reading the whole text. In this paper, we propose a Length Adaptive Recurrent Model (LARM) which can automatically determine the minimum text length that is necessary to perform the classification. With three parts includingReader, Predictor andAgent, our model is designed to read a text word by word, and terminate the process when the adequate information has been caught for the text classification task. The experimental results show that our model has comparable or even better performance compared to the vanilla LSTM when both are fed with partial text input. Besides, we can speed up text classification by truncating the text when sufficient evidence is found for classification. Furthermore, we also visualize our model and show that our model works like human beings, who can gradually come up with the general idea of a text while reading texts sequentially.

【Keywords】: recurrent neural network; text classification

106. Multi-Task Neural Network for Non-discrete Attribute Prediction in Knowledge Graphs.

【Paper Link】【Pages】:1029-1038

【Authors】: Yi Tay ; Luu Anh Tuan ; Minh C. Phan ; Siu Cheung Hui

【Abstract】: Many popular knowledge graphs such as Freebase, YAGO or DBPedia maintain a list of non-discrete attributes for each entity. Intuitively, these attributes such as height, price or population count are able to richly characterize entities in knowledge graphs. This additional source of information may help to alleviate the inherent sparsity and incompleteness problem that are prevalent in knowledge graphs. Unfortunately, many state-of-the-art relational learning models ignore this information due to the challenging nature of dealing with non-discrete data types in the inherently binary-natured knowledge graphs. In this paper, we propose a novel multi-task neural network approach for both encoding and prediction of non-discrete attribute information in a relational setting. Specifically, we train a neural network for triplet prediction along with a separate network for attribute value regression. Via multi-task learning, we are able to learn representations of entities, relations and attributes that encode information about both tasks. Moreover, such attributes are not only central to many predictive tasks as an information source but also as a prediction target. Therefore, models that are able to encode, incorporate and predict such information in a relational learning context are highly attractive as well. We show that our approach outperforms many state-of-the-art methods for the tasks of relational triplet classification and attribute value prediction.

【Keywords】: artificial intelligence; entities; knowledge graphs; machine learning; neural networks; relational learning

107. Movie Fill in the Blank with Adaptive Temporal Attention and Description Update.

【Paper Link】【Pages】:1039-1048

【Authors】: Jie Chen ; Jie Shao ; Fumin Shen ; Chengkun He ; Lianli Gao ; Heng Tao Shen

【Abstract】: Recently, a new type of video understanding task called Movie-Fill-in-the-Blank (MovieFIB) has attracted many research attentions. Given a pair of movie clip and description with one blank word as input, MovieFIB aims to automatically predict the blank word. Because of the advantage in processing sequence data, Long-Short Term Memory (LSTM) has been used as a key component in existing MovieFIB methods to generate representations of videos and descriptions. However, most of these methods fail to emphasize the salient parts of videos. To address this problem, in this paper we propose to use a novel LSTM network called LSTM with Linguistic gate (LSTMwL), which exploits adaptive temporal attention for MovieFIB. Specifically, we first use LSTM to produce video features, which are then used to update the text representation. Finally, we put the updated text into two opposite directional LSTMwL layers to infer the blank word. Experimental results demonstrate that our approach outperforms state-of-the-art models for MovieFIB.

【Keywords】: adaptive temporal attention; description update; question answering

Session 6A: Crowdsourcing 2 4

【Paper Link】【Pages】:1049-1057

【Authors】: Rupinder Paul Khandpur ; Taoran Ji ; Steve T. K. Jan ; Gang Wang ; Chang-Tien Lu ; Naren Ramakrishnan

【Abstract】: Social media is often viewed as a sensor into various societal events such as disease outbreaks, protests, and elections. We describe the use of social media as a crowdsourced sensor to gain insight into ongoing cyber-attacks. Our approach detects a broad range of cyber-attacks (e.g., distributed denial of service (DDoS) attacks, data breaches, and account hijacking) in a weakly supervised manner using just a small set of seed event triggers and requires no training or labeled samples. A new query expansion strategy based on convolution kernels and dependency parses helps model semantic structure and aids in identifying key event characteristics. Through a large-scale analysis over Twitter, we demonstrate that our approach consistently identifies and encodes events, outperforming existing methods.

【Keywords】: cyber attacks; cyber security; dynamic query expansion; event detection; social media; twitter

109. Budgeted Task Scheduling for Crowdsourced Knowledge Acquisition.

【Paper Link】【Pages】:1059-1068

【Authors】: Tao Han ; Hailong Sun ; Yangqiu Song ; Zizhe Wang ; Xudong Liu

【Abstract】: Knowledge acquisition (e.g. through labeling) is one of the most successful applications in crowdsourcing. In practice, collecting as specific as possible knowledge via crowdsourcing is very useful since specific knowledge can be generalized easily if we have a knowledge base, but it is difficult to infer specific knowledge from general knowledge. Meanwhile, tasks for acquiring more specific knowledge can be more difficult for workers, thus need more answers to infer high-quality results. Given a limited budget, assigning workers to difficult tasks will be more effective for the goal of specific knowledge acquisition. However, existing crowdsourcing task scheduling cannot incorporate the specificity of workers' answers. In this paper, we present a new framework for task scheduling with the limited budget, targeting an effective solution to more specific knowledge acquisition. We propose novel criteria for evaluating the quality of specificity-dependent answers and result inference algorithms to aggregate more specific answers with budget constraints. We have implemented our framework with real crowdsourcing data and platform, and have achieved significant performance improvement compared with existing approaches.

【Keywords】: crowdsourcing; knowledge acquisition; task scheduling

110. Hyper Questions: Unsupervised Targeting of a Few Experts in Crowdsourcing.

【Paper Link】【Pages】:1069-1078

【Authors】: Jiyi Li ; Yukino Baba ; Hisashi Kashima

【Abstract】: Quality control is one of the major problems in crowdsourcing. One of the primary approaches to rectify this issue is to assign the same task to different workers and then aggregate their answers to obtain a reliable answer. In addition to simple aggregation approaches such as majority voting, various sophisticated probabilistic models have been proposed. However, given that most of the existing methods operate by strengthening the opinions of the majority, these models often fail when the tasks require highly specialized knowledge and the ability of a large majority of the workers is inadequate. In this paper, we focus on an important class of answer aggregation problems in which majority voting fails and propose the concept of hyper questions to devise effective aggregation methods. A hyper question is a set of single questions, and our key idea is that experts are more likely to provide correct answers to all of the single questions included in a hyper question than non-experts. Thus, experts are more likely to reach consensus on the hyper questions than non-experts, which strengthen their influences. We incorporate the concept of hyper questions into existing answer aggregation methods. The results of our experiments conducted using both synthetic datasets and real datasets demonstrate that our simple and easily usable approach works effectively in cases where only a few experts are available.

【Keywords】: answer aggregation; crowdsourcing; heterogeneous-answer multiple-choice questions; hyper question

【Paper Link】【Pages】:1079-1088

【Authors】: Yusan Lin ; Peifeng Yin ; Wang-Chien Lee

【Abstract】: Offering products in the forms of menu bundles is a common practice in marketing to attract customers and maximize revenues. In crowdfunding platforms such as Kickstarter, rewards also play an important part in influencing project success. Designing rewards consisting of the appropriate items is a challenging yet crucial task for the project creators. However, prior research has not considered the strategies project creators take to offer and bundle the rewards, making it hard to study the impact of reward designs on project success. In this paper, we raise a novel research question: understanding project creators' decisions of reward designs to level their chance to succeed. We approach this by modeling the design behavior of project creators, and identifying the behaviors that lead to project success. We propose a probabilistic generative model, Menu-Offering-Bundle (MOB) model, to capture the offering and bundling decisions of project creators based on collected data of 14K crowdfunding projects and their 149K reward bundles across a half-year period. Our proposed model is shown to capture the offering and bundling topics, outperform the baselines in predicting reward designs. We also find that the learned offering and bundling topics carry distinguishable meanings and provide insights of key factors on project success.

【Keywords】: bundling decision; crowdfunding; menu bundle; offering decision

Session 6B: User behavior and targeting 4

112. Forecasting Ad-Impressions on Online Retail Websites using Non-homogeneous Hawkes Processes.

【Paper Link】【Pages】:1089-1098

【Authors】: Krunal Parmar ; Samuel Bushi ; Sourangshu Bhattacharya ; Surender Kumar

【Abstract】: Promotional listing of products or advertisements is a major source of revenue for online retail companies. These advertisements are often sold in the guaranteed delivery market, serving of which critically depends on the ability to predict supply or potential impressions from a target segment of users. In this paper, we study the problem of predicting user visits or potential ad-impressions to online retail websites, based on historical time-stamps. We explore the time-series and temporal point process models. We find that a successful model must encompass three properties of the data: (1) temporally non-homgeneous rates, (2) self excitation and (3) handling special events. We propose a novel non-homogeneous Hawkes process based model for the same, and new algorithm for fitting this model without overfitting the self-excitation part. We validate the proposed model and algorithm using mulitple large scale ad-serving dataset from a top online retail company in India.

【Keywords】: hawkes processes; non-stationary time-series forecasting; online advertising; supply forecasting

113. Volume Ranking and Sequential Selection in Programmatic Display Advertising.

【Paper Link】【Pages】:1099-1107

【Authors】: Yuxuan Song ; Kan Ren ; Han Cai ; Weinan Zhang ; Yong Yu

【Abstract】: Programmatic display advertising, which enables advertisers to make real-time decisions on individual ad display opportunities so as to achieve a precise audience marketing, has become a key technique for online advertising. However, the constrained budget setting still restricts unlimited ad impressions. As a result, a smart strategy for ad impression selection is necessary for the advertisers to maximize positive user responses such as clicks or conversions, under the constraints of both ad volume and campaign budget. In this paper, we borrow in the idea of top-N ranking and filtering techniques from information retrieval and propose an effective ad impression volume ranking method for each ad campaign, followed by a sequential selection strategy considering the remaining ad volume and budget, to smoothly deliver the volume filtering while maximizing campaign efficiency. The extensive experiments on two benchmarking datasets and a commercial ad platform demonstrate large performance superiority of our proposed solution over traditional methods, especially under tight budgets.

【Keywords】: article recommendation; noise contrastive estimation; text representation; transfer learning; word2vec

114. On Migratory Behavior in Video Consumption.

【Paper Link】【Pages】:1109-1118

【Authors】: Huan Yan ; Tzu-Heng Lin ; Gang Wang ; Yong Li ; Haitao Zheng ; Depeng Jin ; Ben Y. Zhao

【Abstract】: Today's video streaming market is crowded with various content providers (CPs). For individual CPs, understanding user behavior, in particular how users migrate among different CPs, is crucial for improving users' on-site experience and the CP's chance of success. In this paper, we take a data-driven approach to analyze and model user migration behavior in video streaming, i.e., users switching content provider during active sessions. Based on a large ISP dataset over two months (6 major content providers, 3.8 million users, and 315 million video requests), we study common migration patterns and reasons of migration. We find that migratory behavior is prevalent: 66% of users switch CPs with an average switching frequency of 13%. In addition, migration behaviors are highly diverse: regardless large or small CPs, they all have dedicated groups of users who like to switch to them for certain types of videos. Regarding reasons of migration, we find CP service quality rarely causes migration, while a few popular videos play a bigger role. Nearly 60% of cross-site migrations are landed to 0.14% top videos. Finally, we validate our findings by building an accurate regression model to predict user migration frequency, and discuss the implications of our results to CPs.

【Keywords】: migratory behavior; video consumption

115. FM-Hawkes: A Hawkes Process Based Approach for Modeling Online Activity Correlations.

【Paper Link】【Pages】:1119-1128

【Authors】: Sha Li ; Xiaofeng Gao ; Weiming Bao ; Guihai Chen

【Abstract】: Understanding and predicting user behavior on online platforms has proved to be of significant value, with applications spanning from targeted advertising, political campaigning, anomaly detection to user self-monitoring. With the growing functionality and flexibility of online platforms, users can now accomplish a variety of tasks online. This advancement has rendered many previous works that focus on modeling a single type of activity obsolete. In this work, we target this new problem by modeling the interplay between the time series of different types of activities and apply our model to predict future user behavior. Our model, FM-Hawkes, stands for Fourier-based kernel multi-dimensional Hawkes process. Specifically, we model the multiple activity time series as a multi-dimensional Hawkes process. The correlations between different types of activities are then captured by the influence factor. As for the temporal triggering kernel, we observe that the intensity function consists of numerous kernel functions with time shift. Thus, we employ a Fourier transformation based non-parametric estimation. Our model is not bound to any particular platform and explicitly interprets the causal relationship between actions. By applying our model to real-life datasets, we confirm that the mutual excitation effect between different activities prevails among users. Prediction results show our superiority over models that do not consider action types and flexible kernels

【Keywords】: point processes; time series analysis; user activity modeling

Session 6C: Deep Learning 2 4

116. Deep Learning Based Forecasting of Critical Infrastructure Data.

【Paper Link】【Pages】:1129-1138

【Authors】: Zahra Zohrevand ; Uwe Glässer ; Mohammad A. Tayebi ; Hamed Yaghoubi Shahir ; Mehdi Shirmaleki ; Amir Yaghoubi Shahir

【Abstract】: Intelligent monitoring and control of critical infrastructure such as electric power grids, public water utilities and transportation systems produces massive volumes of time series data from heterogeneous sensor networks. Time Series Forecasting (TSF) is essential for system safety and security, and also for improving the efficiency and quality of service delivery. Being highly dependent on various external factors, the observed system behavior is usually stochastic, which makes the next value prediction a tricky and challenging task that usually needs customized methods. In this paper we propose a novel deep learning based framework for time series analysis and prediction by ensembling parametric and nonparametric methods. Our approach takes advantage of extracting features at different time scales, which improves accuracy without compromising reliability in comparison with the state-of-the-art methods. Our experimental evaluation using real-world SCADA data from a municipal water management system shows that our proposed method outperforms the baseline methods evaluated here.

【Keywords】: critical infrastructure protection; deep learning; multimodal learning; time series forecasting

117. Augmented Variational Autoencoders for Collaborative Filtering with Auxiliary Information.

【Paper Link】【Pages】:1139-1148

【Authors】: Wonsung Lee ; Kyungwoo Song ; Il-Chul Moon

【Abstract】: Recommender systems offer critical services in the age of mass information. A good recommender system selects a certain item for a specific user by recognizing why the user might like the item. This awareness implies that the system should model the background of the items and the users. This background modeling for recommendation is tackled through the various models of collaborative filtering with auxiliary information. This paper presents variational approaches for collaborative filtering to deal with auxiliary information. The proposed methods encompass variational autoencoders through augmenting structures to model the auxiliary information and to model the implicit user feedback. This augmentation includes the ladder network and the generative adversarial network to extract the low-dimensional representations influenced by the auxiliary information. These two augmentations are the first trial in the venue of the variational autoencoders, and we demonstrate their significant improvement on the performances in the applications of the collaborative filtering.

【Keywords】: collaborative filtering; deep learning; generative adversarial networks; recommender systems; variational autoencoders

118. DeepHawkes: Bridging the Gap between Prediction and Understanding of Information Cascades.

【Paper Link】【Pages】:1149-1158

【Authors】: Qi Cao ; Huawei Shen ; Keting Cen ; Wentao Ouyang ; Xueqi Cheng

【Abstract】: Online social media remarkably facilitates the production and delivery of information, intensifying the competition among vast information for users' attention and highlighting the importance of predicting the popularity of information. Existing approaches for popularity prediction fall into two paradigms: feature-based approaches and generative approaches. Feature-based approaches extract various features (e.g., user, content, structural, and temporal features), and predict the future popularity of information by training a regression/classification model. Their predictive performance heavily depends on the quality of hand-crafted features. In contrast, generative approaches devote to characterizing and modeling the process that a piece of information accrues attentions, offering us high ease to understand the underlying mechanisms governing the popularity dynamics of information cascades. But they have less desirable predictive power since they are not optimized for popularity prediction. In this paper, we propose DeepHawkes to combat the defects of existing methods, leveraging end-to-end deep learning to make an analogy to interpretable factors of Hawkes process --- a widely-used generative process to model information cascade. DeepHawkes inherits the high interpretability of Hawkes process and possesses the high predictive power of deep learning methods, bridging the gap between prediction and understanding of information cascades. We verify the effectiveness of DeepHawkes by applying it to predict retweet cascades of Sina Weibo and citation cascades of a longitudinal citation dataset. Experimental results demonstrate that DeepHawkes outperforms both feature-based and generative approaches.

【Keywords】: end-to-end deep learning; hawkes process; information cascade; interpretable factors; popularity prediction

119. CNN-IETS: A CNN-based Probabilistic Approach for Information Extraction by Text Segmentation.

【Paper Link】【Pages】:1159-1168

【Authors】: Meng Hu ; Zhixu Li ; Yongxin Shen ; An Liu ; Guanfeng Liu ; Kai Zheng ; Lei Zhao

【Abstract】: Information Extraction by Text Segmentation (IETS) aims at segmenting text inputs to extract implicit data values contained in them.The state-of-art IETS approaches mainly rely on machine learning techniques, either supervised or unsupervised.However, while the supervised approaches require a large labelled training data, the performance of the unsupervised ones could be unstable on different data sets.To overcome their weaknesses, this paper introduces CNN-IETS, a novel unsupervised probabilistic approach that takes the advantages of pre-existing data and a Convolution Neural Network (CNN)-based probabilistic classification model. While using the CNN model can ease the burden of selecting high-quality features in associating text segments with attributes of a given domain, the pre-existing data as a domain knowledge base can provide training data with a comprehensive list of features for building the CNN model.Given an input text, we do initial segmentation (according to the occurrences of these words in the knowledge base) to generate text segments for CNN classification with probabilities. Then, based on the probabilistic CNN classification results, we work on finding the most probable labelling way to the whole input text.As a complementary, a bidirectional sequencing model learned on-demand from test data is finally deployed to do further adjustment to some problematic labelled segments.Our experimental study conducted on several real data collections shows that CNN-IETS improves the extraction quality of state-of-art approaches by more than 10%.

【Keywords】: convolution neural network; iets; information extraction

Session 7A: Health Analytics 1 4

120. A Personalized Predictive Framework for Multivariate Clinical Time Series via Adaptive Model Selection.

【Paper Link】【Pages】:1169-1177

【Authors】: Zitao Liu ; Milos Hauskrecht

【Abstract】: Building of an accurate predictive model of clinical time series for a patient is critical for understanding of the patient condition, its dynamics, and optimal patient management. Unfortunately, this process is not straightforward. First, patient-specific variations are typically large and population-based models derived or learned from many different patients are often unable to support accurate predictions for each individual patient. Moreover, time series observed for one patient at any point in time may be too short and insufficient to learn a high-quality patient-specific model just from the patient's own data. To address these problems we propose, develop and experiment with a new adaptive forecasting framework for building multivariate clinical time series models for a patient and for supporting patient-specific predictions. The framework relies on the adaptive model switching approach that at any point in time selects the most promising time series model out of the pool of many possible models, and consequently, combines advantages of the population, patient-specific and short-term individualized predictive models. We demonstrate that the adaptive model switching framework is very promising approach to support personalized time series prediction, and that it is able to outperform predictions based on pure population and patient-specific models, as well as, other patient-specific model adaptation strategies.

【Keywords】: forecasting; multivariate time series; personalized medicine

121. DiagTree: Diagnostic Tree for Differential Diagnosis.

【Paper Link】【Pages】:1179-1188

【Authors】: Yejin Kim ; Jingyun Choi ; Yosep Chong ; Xiaoqian Jiang ; Hwanjo Yu

【Abstract】: Differential diagnosis is detection of one disease among similar diseases using evidence such as pathologic tests. A Partially Observed Markov Decision Process (POMDP) formulates the complex differential diagnosis process into a probabilistic decision-making model. However, differential diagnosis is not often fully formulated as POMDP because model construction does not consider the cost (or time) to finish the diagnosis process, or the practical convention on clinical tests. We propose a Diagnostic Tree (DiagTree), a new framework for diagnosing diseases, which combines several tests to reduce the diagnosis time and to incorporate real-world constraints into discrete optimization. DiagTree consists of multiple tests in internal nodes and posterior probabilities ("confidences") that the patient suffers the disease listed at each leaf node. The confidences are computed after a series of test results is applied in internal nodes. DiagTree is built to maximize the confidences at leaf nodes and to minimize the decision process time. We formulate this problem as integer programming and solve it by the Branch-and-Bound method and a greedy approach. We apply DiagTree to immunohistochemistry profiles to detect lymphoid neoplasms. We evaluate the accuracy and cost of the diagnosis rules from DiagTree compared to those obtained using rules that clinicians derived from their experience. DiagTree detected diseases with high accuracy and also reduced the diagnosis cost (or time) compared to the existing rules of clinicians. DiagTree can support clinicians by suggesting a simple diagnosis process with high accuracy and low cost among test candidates.

【Keywords】: decision process; discrete optimization; early diagnosis

122. Fine-grained Patient Similarity Measuring using Deep Metric Learning.

【Paper Link】【Pages】:1189-1198

【Authors】: Jiazhi Ni ; Jie Liu ; Chenxin Zhang ; Dan Ye ; Zhirou Ma

【Abstract】: Patient similarity measuring plays a significant role in many healthcare applications, such as cohort study and treatment comparative effectiveness research. Existing methods mainly rely on supervised metric learning method to study patient similarity from Electronic Health Records (EHRs), facing the challenge of differentiating patients with a large number of fine-grained disease categories. Deep metric learning has gained noticeable success in fine-grained image categorization problem, however, it cannot be directly applied to classification of patients with hierarchical disease labels. In this paper, we present a novel three layer patient similarity deep metric learning framework (PSDML) by optimizing quadruple loss improved from triplet loss, to learn an embedding distance for disease classification among the patients. The context semantic relation of multi diagnosis labels encoding by ICD-10 is taken into account to compute the supervised distance of patients. To solve the diagnosis class imbalance, patient tuples that violate deep metric learning framework loss constraints are chosen prior as samples to accelerate the convergence of the neural network. We conducted KNN multi label classification experiment using the learned similarity metric on the real EHRs about stroke disease collected by Chinese Stroke Data Center. The results demonstrate substantial improvement over the baselines.

【Keywords】: Deep Metric Learning; Distance Metric Learning; Multi Label Classification; Patient Similarity

123. Differentially Private Regression for Discrete-Time Survival Analysis.

【Paper Link】【Pages】:1199-1208

【Authors】: Thông T. Nguyên ; Siu Cheung Hui

【Abstract】: In survival analysis, regression models are used to understand the effects of explanatory variables (e.g., age, sex, weight, etc.) to the survival probability. However, for sensitive survival data such as medical data, there are serious concerns about the privacy of individuals in the data set when medical data is used to fit the regression models. The closest work addressing such privacy concerns is the work on Cox regression which linearly projects the original data to a lower dimensional space. However, the weakness of this approach is that there is no formal privacy guarantee for such projection. In this work, we aim to propose solutions for the regression problem in survival analysis with the protection of differential privacy which is a golden standard of privacy protection in data privacy research. To this end, we extend the Output Perturbation and Objective Perturbation approaches which are originally proposed to protect differential privacy for the Empirical Risk Minimization (ERM) problems. In addition, we also propose a novel sampling approach based on the Markov Chain Monte Carlo (MCMC) method to practically guarantee differential privacy with better accuracy. We show that our proposed approaches achieve good accuracy as compared to the non-private results while guaranteeing differential privacy for individuals in the private data set.

【Keywords】: differential privacy; discrete-time models; regression models; survival analysis

Session 7B: Privacy Preserving Data Mining 4

【Paper Link】【Pages】:1209-1218

【Authors】: Huandong Wang ; Chen Gao ; Yong Li ; Zhi-Li Zhang ; Depeng Jin

【Abstract】: It is well-known that online services resort to various cookies to track users through users' online service identifiers (IDs) - in other words, when users access online services, various "fingerprints" are left behind in the cyberspace. As they roam around in the physical world while accessing online services via mobile devices, users also leave a series of "footprints" -- i.e., hints about their physical locations - in the physical world. This poses a potent new threat to user privacy: one can potentially correlate the "fingerprints" left by the users in the cyberspace with "footprints" left in the physical world to infer and reveal leakage of user physical world privacy, such as frequent user locations or mobility trajectories in the physical world - we refer to this problem as user physical world privacy leakage via user cyberspace privacy leakage. In this paper we address the following fundamental question: what kind - and how much - of user physical world privacy might be leaked if we could get hold of such diverse network datasets even without any physical location information. In order to conduct an in-depth investigation of these questions, we utilize the network data collected via a DPI system at the routers within one of the largest Internet operator in Shanghai, China over a duration of one month. We decompose the fundamental question into the three problems: i) linkage of various online user IDs belonging to the same person via mobility pattern mining; ii) physical location classification via aggregate user mobility patterns over time; and iii) tracking user physical mobility. By developing novel and effective methods for solving each of these problems, we demonstrate that the question of user physical world privacy leakage via user cyberspace privacy leakage is not hypothetical, but indeed poses a real potent threat to user privacy.

【Keywords】: cookie; cyber-physical systems; privacy; trajectories

125. Privacy-Preserving Collaborative Deep Learning with Application to Human Activity Recognition.

【Paper Link】【Pages】:1219-1228

【Authors】: Lingjuan Lyu ; Xuanli He ; Yee Wei Law ; Marimuthu Palaniswami

【Abstract】: The proliferation of wearable devices has contributed to the emergence of mobile crowdsensing, which leverages the power of the crowd to collect and report data to a third party for large-scale sensing and collaborative learning. However, since the third party may not be honest, privacy poses a major concern. In this paper, we address this concern with a two-stage privacy-preserving scheme called RG-RP: the first stage is designed to mitigate maximum a posteriori (MAP) estimation attacks by perturbing each participant's data through a nonlinear function called repeated Gompertz (RG); while the second stage aims to maintain accuracy and reduce transmission energy by projecting high-dimensional data to a lower dimension, using a row-orthogonal random projection (RP) matrix. The proposed RG-RP scheme delivers better recovery resistance to MAP estimation attacks than most state-of-the-art techniques on both synthetic and real-world datasets. For collaborative learning, we proposed a novel LSTM-CNN model combining the merits of Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN). Our experiments on two representative movement datasets captured by wearable sensors demonstrate that the proposed LSTM-CNN model outperforms standalone LSTM, CNN and Deep Belief Network. Together, RG+RP and LSTM-CNN provide a privacy-preserving collaborative learning framework that is both accurate and privacy-preserving.

【Keywords】: collaborative learning; deep learning; mobile crowdsensing; privacy-preserving

126. Privacy Aware Temporal Profiling of Emails in Distributed Setup.

【Paper Link】【Pages】:1229-1238

【Authors】: Sutapa Mondal ; Manish Shukla ; Sachin Lodha

【Abstract】: The enterprise email promises to be a rich source for knowledge discovery. This is made possible due to the direct nature of communication, support for diverse media types, active participation of entities and presence of chronological ordering of messages. Also, the enterprise emails are more trustworthy than external emails due to their formal nature. This data source has not been fully tapped. In fact, the existing work on profiling of emails focuses primarily on expertise identification and retrieval. Even in these studies, the researchers have made some restrictive assumptions. For instance, in many of the formulations, the underlying system assumes a centralized data repository, and the communication network is complete. They do not account for individual biases in an email while mining and aggregating results. Furthermore, email holds fair amount of personal and organizational sensitive information. None of the existing work on email profiling suggests anything on alleviating the individual and organizational privacy concerns. In this paper, we propose a system for building an individual's perceived knowledge profile "What she knows?" ), trends profile "In which direction and how far her expertise has grown?" ), and team profile "What all her teammates know?"). The proposed system operates in a distributed network and performs analysis of emails residing on a time-varying local email database, with no prior assumptions about the environment. It also takes care of missing nodes in a partial communication network, by deducing their profile from perceived profiles of its peers and their common interest. We developed a two-pass aggregation algorithm for combining results from individual nodes and drawing useful insights. A graph based algorithm is used for calculating spread (reach) and popularity (recall) for further improving the output of the aggregation algorithm. The results show that the two pass aggregation step is sufficient in majority of the cases, and a hybrid of email content and graph-based approach works well in a distributed setup.

【Keywords】: data privacy; distributed retrieval; enterprise search; information extraction; temporal profiling

127. Name Disambiguation in Anonymized Graphs using Network Embedding.

【Paper Link】【Pages】:1239-1248

【Authors】: Baichuan Zhang ; Mohammad Al Hasan

【Abstract】: In real-world, our DNA is unique but many people share names. This phenomenon often causes erroneous aggregation of documents of multiple persons who are namesake of one another. Such mistakes deteriorate the performance of document retrieval, web search, and more seriously, cause improper attribution of credit or blame in digital forensic. To resolve this issue, the name disambiguation task is designed which aims to partition the documents associated with a name reference such that each partition contains documents pertaining to a unique real-life person. Existing solutions to this task substantially rely on feature engineering, such as biographical feature extraction, or construction of auxiliary features from Wikipedia. However, for many scenarios, such features may be costly to obtain or unavailable due to the risk of privacy violation. In this work, we propose a novel name disambiguation method. Our proposed method is non-intrusive of privacy because instead of using attributes pertaining to a real-life person, our method leverages only relational data in the form of anonymized graphs. In the methodological aspect, the proposed method uses a novel representation learning model to embed each document in a low dimensional vector space where name disambiguation can be solved by a hierarchical agglomerative clustering algorithm. Our experimental results demonstrate that the proposed method is significantly better than the existing name disambiguation methods working in a similar setting.

【Keywords】: clustering; name disambiguation; neural network embedding

【Paper Link】【Pages】:1249-1258

【Authors】: Rui Dong ; Yizhou Sun ; Lu Wang ; Yupeng Gu ; Yuan Zhong

【Abstract】: Social media websites have become a popular outlet for online users to express their opinions on controversial issues, such as gun control and abortion. Understanding users' stances and their arguments is a critical task for policy-making process and public deliberation. Existing methods rely on large amounts of human annotation for predicting stance on issues of interest, which is expensive and hard to scale to new problems. In this work, we present a weakly-guided user stance modeling framework which simultaneously considers two types of information: what do you say (via stance-based content generative model) and how do you behave (via social interaction-based graph regularization). We experiment with two types of social media data: news comments and discussion forum posts. Our model uniformly outperforms a logistic regression-based supervised method on stance-based link prediction for unseen users on news comments. Our method also achieves better or comparable stance prediction performance for discussion forum users, when compared with state-of-the-art supervised systems. Meanwhile, separate word distributions are learned for users of opposite stances. This potentially helps with better understanding and interpretation of conflicting arguments for controversial issues.

【Keywords】: online behavior mining; social computing; social media; user stance prediction

【Paper Link】【Pages】:1259-1267

【Authors】: Yujie Fan ; Yiming Zhang ; Yanfang Ye ; Xin Li ; Wanhong Zheng

【Abstract】: Opioid (e.g., heroin and morphine) addiction has become one of the largest and deadliest epidemics in the United States. To combat such deadly epidemic, there is an urgent need for novel tools and methodologies to gain new insights into the behavioral processes of opioid abuse and addiction. The role of social media in biomedical knowledge mining has turned into increasingly significant in recent years. In this paper, we propose a novel framework named AutoDOA to automatically detect the opioid addicts from Twitter, which can potentially assist in sharpening our understanding toward the behavioral process of opioid abuse and addiction. In AutoDOA, to model the users and posted tweets as well as their rich relationships, a structured heterogeneous information network (HIN) is first constructed. Then meta-path based approach is used to formulate similarity measures over users and different similarities are aggregated using Laplacian scores. Based on HIN and the combined meta-path, to reduce the cost of acquiring labeled examples for supervised learning, a transductive classification model is built for automatic opioid addict detection. To the best of our knowledge, this is the first work to apply transductive classification in HIN into drug-addiction domain. Comprehensive experiments on real sample collections from Twitter are conducted to validate the effectiveness of our developed system AutoDOA in opioid addict detection by comparisons with other alternate methods. The results and case studies also demonstrate that knowledge from daily-life social media data mining could support a better practice of opioid addiction prevention and treatment.

【Keywords】: heterogeneous information network; opioid addict detection; social media; transductive classification

【Paper Link】【Pages】:1269-1278

【Authors】: Zhiwei Wang ; Tyler Derr ; Dawei Yin ; Jiliang Tang

【Abstract】: It has become increasingly popular to use mobile social networking applications for weight loss and management. Users can not only create profiles and maintain their records but also perform a variety of social activities that shatter the barrier to share or seek information. Due to the open and connected nature, these applications produce massive data that consists of rich weight-related information which offers immense opportunities for us to enable advanced research on weight loss. In this paper, we conduct the initial investigation to understand weight loss with a large-scale mobile social networking dataset with near 10 million users. In particular, we study individual and social factors related to weight loss and reveal a number of interesting findings that help us build a meaningful model to predict weight loss automatically. The experimental results demonstrate the effectiveness of the proposed model and the significance of social factors in weight loss.

【Keywords】: mobile applications; social network analysis; weight loss

131. Tweet Geolocation: Leveraging Location, User and Peer Signals.

【Paper Link】【Pages】:1279-1288

【Authors】: Wen-Haw Chong ; Ee-Peng Lim

【Abstract】: Which venue is a tweet posted from? We referred this as fine-grained geolocation. To solve this problem effectively, we develop novel techniques to exploit each posting user's content history. This is motivated by our finding that most users do not share their visitation history, but have ample content history from tweet posts. We formulate fine-grained geolocation as a ranking problem whereby given a test tweet, we rank candidate venues. We propose several models that leverage on three types of signals from locations, users and peers. Firstly, the location signals are words that are indicative of venues. We propose a location-indicative weighting scheme to capture this. Next we exploit user signals from each user's content history to enrich the very limited content of their tweets which have been targeted for geolocation. The intuition is that the user's other tweets may have been from the test venue or related venues, thus providing informative words. In this regard, we propose query expansion as the enrichment approach. Finally, we exploit the signals from peer users who have similar content history and thus potentially similar visitation behavior as the users of the test tweets. This suggests collaborative filtering where visitation information is propagated via content similarities. We proposed several models incorporating different combinations of the three signals. Our experiments show that the best model incorporates all three signals. It performs 6% to 40% better than the baselines depending on the metric and dataset.

【Keywords】: collaborative filtering; query expansion; tweet geolocation

Session 7D: Application driven analysis 4

132. A Two-step Information Accumulation Strategy for Learning from Highly Imbalanced Data.

【Paper Link】【Pages】:1289-1298

【Authors】: Bin Liu ; Min Zhang ; Weizhi Ma ; Xin Li ; Yiqun Liu ; Shaoping Ma

【Abstract】: Highly imbalanced data is common in the real world and it is important but difficult to train an effective classifier. In this paper, Our major point is that the imbalance is the observed phenomenon but not the cause of the problem. The challenge is that useful information is been overshadowed in the large scale of data in both majority and minority classes. We propose a novel two-step strategy, Information Accumulation, which first selects the most discriminative data by the Zooming-in phase, and then leverages unlabeled data by pseudo active learning and self-training in the phase of Learning from Learned Results. Comparative experiments are conducted on large-scale highly imbalanced real customer service data on complaint detection task (where less than 2% of data is positive). The results on eight state-of-the-art classification algorithms show that significant improvements are observed on the performances of all algorithms with Information Accumulation(for example, the F-Measure score of Xgboost is increased by 197% from 0.115 to 0.347), which demonstrates the effectiveness and general applicability of the proposed strategy. This work explores a new idea on dealing with highly imbalanced data that we do not aim to balance the training examples as usual, but focus on finding the most discriminative information from labeled data and the learning results of unlabeled data.

【Keywords】: complaint call detection; imbalanced learning; text classification

133. Understanding Database Performance Inefficiencies in Real-world Web Applications.

【Paper Link】【Pages】:1299-1308

【Authors】: Cong Yan ; Alvin Cheung ; Junwen Yang ; Shan Lu

【Abstract】: Many modern database-backed web applications are built upon Object Relational Mapping (ORM) frameworks. While such frame- works ease application development by abstracting persistent data as objects, such convenience comes with a performance cost. In this paper, we studied 27 real-world open-source applications built on top of the popular Ruby on Rails ORM framework, with the goal to understand the database-related performance inefficiencies in these applications. We discovered a number of inefficiencies rang- ing from physical design issues to how queries are expressed in the application code. We applied static program analysis to identify and measure how prevalent these issues are, then suggested techniques to alleviate these issues and measured the potential performance gain as a result. These techniques significantly reduce database query time (up to 91%) and the webpage response time (up to 98%). Our study provides guidance to the design of future database engines and ORM frameworks to support database application that are performant yet without sacrificing programmability.

【Keywords】:

134. Data Driven Chiller Plant Energy Optimization with Domain Knowledge.

【Paper Link】【Pages】:1309-1317

【Authors】: Hoang Dung Vu ; Kok-Soon Chai ; Bryan Keating ; Nurislam Tursynbek ; Boyan Xu ; Kaige Yang ; Xiaoyan Yang ; Zhenjie Zhang

【Abstract】: Refrigeration and chiller optimization is an important and well studied topic in mechanical engineering, mostly taking advantage of physical models, designed on top of over-simplified assumptions, over the equipments. Conventional optimization techniques using physical models make decisions of online parameter tuning, based on very limited information of hardware specifications and external conditions, e.g., outdoor weather. In recent years, new generation of sensors is becoming essential part of new chiller plants, for the first time allowing the system administrators to continuously monitor the running status of all equipments in a timely and accurate way. The explosive growth of data flowing to databases, driven by the increasing analytical power by machine learning and data mining, unveils new possibilities of data-driven approaches for real-time chiller plant optimization. This paper presents our research and industrial experience on the adoption of data models and optimizations on chiller plant and discusses the lessons learnt from our practice on real world plants. Instead of employing complex machine learning models, we emphasize the incorporation of appropriate domain knowledge into data analysis tools, which turns out to be the key performance improver over state-of-the-art deep learning techniques by a significant margin. Our empirical evaluation on a real world chiller plant achieves savings by more than 7% on daily power consumption.

【Keywords】: chiller plant; data driven; energy saving; optimization

135. Partitioning Orders in Online Shopping Services.

【Paper Link】【Pages】:1319-1328

【Authors】: Sreenivas Gollapudi ; Ravi Kumar ; Debmalya Panigrahi ; Rina Panigrahy

【Abstract】: The rapid growth of the Internet has led to the widespread use of newer and richer models of online shopping and delivery services. The race to efficient large scale on-demand delivery has transformed such services into complex networks of shoppers (typically working in the stores), stores, and consumers. The efficiency of processing orders in stores is critical to the profitability of the business model. Motivated by this setting, we consider the following problem: given a set of shopping orders each consisting of a few items, how to best partition the orders among a given number of shoppers working for an online shopping service? Formulating this as an optimization problem, we propose a family of simple and efficient algorithms that admit natural constraints such as number of items a shopper can process in this setting. In addition to showing provable guarantees for the algorithms, we also demonstrate their efficiency in practice on real-world data, outperforming strong baselines.

【Keywords】: e-commerce; order partitioning; vehicle routing

Session 7E: Text Mining 4

136. Taxonomy Induction Using Hypernym Subsequences.

【Paper Link】【Pages】:1329-1338

【Authors】: Amit Gupta ; Rémi Lebret ; Hamza Harkous ; Karl Aberer

【Abstract】: We propose a novel, semi-supervised approach towards domain taxonomy induction from an input vocabulary of seed terms. Unlike all previous approaches, which typically extract direct hypernym edges for terms, our approach utilizes a novel probabilistic framework to extract hypernym subsequences. Taxonomy induction from extracted subsequences is cast as an instance of the minimum-cost flow problem on a carefully designed directed graph. Through experiments, we demonstrate that our approach outperforms state-of-the-art taxonomy induction approaches across four languages. Importantly, we also show that our approach is robust to the presence of noise in the input vocabulary. To the best of our knowledge, this robustness has not been empirically proven in any previous approach.

【Keywords】: algorithms; flow networks; knowledge acquisition; minimum-cost flow optimization; taxonomy induction; term taxonomies

137. Unsupervised Concept Categorization and Extraction from Scientific Document Titles.

【Paper Link】【Pages】:1339-1348

【Authors】: Adit Krishnan ; Aravind Sankar ; Shi Zhi ; Jiawei Han

【Abstract】: This paper studies the automated categorization and extraction of scientific concepts from titles of scientific articles, in order to gain a deeper understanding of their key contributions and facilitate the construction of a generic academic knowledgebase. Towards this goal, we propose an unsupervised, domain-independent, and scalable two-phase algorithm to type and extract key concept mentions into aspects of interest (e.g., Techniques, Applications, etc.). In the first phase of our algorithm we proposePhraseType, a probabilistic generative model which exploits textual features and limited POS tags to broadly segment text snippets into aspect-typed phrases. We extend this model to simultaneously learn aspect-specific features and identify academic domains in multi-domain corpora, since the two tasks mutually enhance each other. In the second phase, we propose an approach based on adaptor grammars to extract fine grained concept mentions from the aspect-typed phrases without the need for any external resources or human effort, in a purely data-driven manner. We apply our technique to study literature from diverse scientific domains and show significant gains over state-of-the-art concept extraction techniques. We also present a qualitative analysis of the results obtained.

【Keywords】: adaptor grammar; concept extraction; probabilistic model

138. MIKE: Keyphrase Extraction by Integrating Multidimensional Information.

【Paper Link】【Pages】:1349-1358

【Authors】: Yuxiang Zhang ; Yaocheng Chang ; Xiaoqing Liu ; Sujatha Das Gollapalli ; Xiaoli Li ; Chunjing Xiao

【Abstract】: Traditional supervised keyphrase extraction models depend on the features of labelled keyphrases while prevailing unsupervised models mainly rely on structure of the word graph, with candidate words as nodes and edges capturing the co-occurrence information between words. However, systematically integrating all these multidimensional heterogeneous information into a unified model is relatively unexplored. In this paper, we focus on how to effectively exploit multidimensional information to improve the keyphrase extraction performance (MIKE). Specifically, we propose a random-walk parametric model, MIKE, that learns the latent representation for a candidate keyphrase that captures the mutual influences among all information, and simultaneously optimizes the parameters and ranking scores of candidates in the word graph. We use the gradient-descent algorithm to optimize our model and show the comprehensive experiments with two publicly-available WWW and KDD datasets in Computer Science. Experimental results demonstrate that our approach significantly outperforms the state-of-the-art graph-based keyphrase extraction approaches.

【Keywords】: graph-based keyphrase extraction approach; keyphrase extraction; multidimensional information; parametric model

139. QALink: Enriching Text Documents with Relevant Q&A Site Contents.

【Paper Link】【Pages】:1359-1368

【Authors】: Yixuan Tang ; Weilong Huang ; Qi Liu ; Anthony K. H. Tung ; Xiaoli Wang ; Jisong Yang ; Beibei Zhang

【Abstract】: With rapid development of Q&A sites such as Quora and StackExchange, high quality question-answer pairs have been produced by users. These Q&A contents cover a wide range of topics, and they are useful for users to resolve queries and obtain new knowledge. Meanwhile, when people are reading digital documents, they may encounter reading problems such as lack of background information and unclear illustration of concepts. We believe that Q&A sites offer high-quality contents which can serve as rich supplements to digital documents. In this paper, we devise a rigorous formulation of the novel text enrichment problem, and design an end-to-end system named QALink which assigns the most relevant Q&A contents to the corresponding section of the document. We first present a new segmentation approach to model each document with a hierarchical structure. Based on the hierarchy, queries are constructed to retrieve and rank related question-answer pairs. Both syntactical and semantic features are adopted in our system. The empirical evaluation results indicate that QALink is able to effectively enrich text documents with relevant Q&A contents to help people better understand the documents.

【Keywords】: hierarchical text segmentation; learning to rank; probabilistic information retrieval; q&a sites; text enrichment

Session 7F: Efficient Learning 4

140. Sequence Modeling with Hierarchical Deep Generative Models with Dual Memory.

【Paper Link】【Pages】:1369-1378

【Authors】: Yanan Zheng ; Lijie Wen ; Jianmin Wang ; Jun Yan ; Lei Ji

【Abstract】: Deep Generative Models (DGMs) are able to extract high-level representations from massive unlabeled data and are explainable from a probabilistic perspective. Such characteristics favor sequence modeling tasks. However, it still remains a huge challenge to model sequences with DGMs. Unlike real-valued data that can be directly fed into models, sequence data consist of discrete elements and require being transformed into certain representations first. This leads to the following two challenges. First, high-level features are sensitive to small variations of inputs as well as the way of representing data. Second, the models are more likely to lose long-term information during multiple transformations. In this paper, we propose a Hierarchical Deep Generative Model With Dual Memory to address the two challenges. Furthermore, we provide a method to efficiently perform inference and learning on the model. The proposed model extends basic DGMs with an improved hierarchically organized multi-layer architecture. Besides, our model incorporates memories along dual directions, respectively denoted as broad memory and deep memory. The model is trained end-to-end by optimizing a variational lower bound on data log-likelihood using the improved stochastic variational method. We perform experiments on several tasks with various datasets and obtain excellent results. The results of language modeling show our method significantly outperforms state-of-the-art results in terms of generative performance. Extended experiments including document modeling and sentiment analysis, prove the high-effectiveness of dual memory mechanism and latent representations. Text random generation provides a straightforward perception for advantages of our model.

【Keywords】: dual memory mechanism; hierarchical deep generative models; inference and learning; sequence modeling

141. Active Learning for Large-Scale Entity Resolution.

【Paper Link】【Pages】:1379-1388

【Authors】: Kun Qian ; Lucian Popa ; Prithviraj Sen

【Abstract】: Entity resolution (ER) is the task of identifying different representations of the same real-world object across datasets. Designing and tuning ER algorithms is an error-prone, labor-intensive process, which can significantly benefit from data-driven, automated learning methods. Our focus is on "big data'' scenarios where the primary challenges include 1) identifying, out of a potentially massive set, a small subset of informative examples to be labeled by the user, 2) using the labeled examples to efficiently learn ER algorithms that achieve both high precision and high recall, and 3) executing the learned algorithm to determine duplicates at scale. Recent work on learning ER algorithms has employed active learning to partially address the above challenges by aiming to learn ER rules in the form of conjunctions of matching predicates, under precision guarantees. While successful in learning a single rule, prior work has been less successful in learning multiple rules that are sufficiently different from each other, thus missing opportunities for improving recall. In this paper, we introduce an active learning system that learns, at scale, multiple rules each having significant coverage of the space of duplicates, thus leading to high recall, in addition to high-precision. We show the superiority of our system on real-world ER scenarios of sizes up to tens of millions of records, over state-of-the-art active learning methods that learn either rules or committees of statistical classifiers for ER, and even over sophisticated methods based on first-order probabilistic models.

【Keywords】: entity resolution; large-scale data cleansing

142. Indexable Bayesian Personalized Ranking for Efficient Top-k Recommendation.

【Paper Link】【Pages】:1389-1398

【Authors】: Dung D. Le ; Hady W. Lauw

【Abstract】: Top-k recommendation seeks to deliver a personalized recommendation list of k items to a user. The dual objectives are (1) accuracy in identifying the items a user is likely to prefer, and (2) efficiency in constructing the recommendation list in real time. One direction towards retrieval efficiency is to formulate retrieval as approximate k nearest neighbor (kNN) search aided by indexing schemes, such as locality-sensitive hashing, spatial trees, and inverted index. These schemes, applied on the output representations of recommendation algorithms, speed up the retrieval process by automatically discarding a large number of potentially irrelevant items when given a user query vector. However, many previous recommendation algorithms produce representations that may not necessarily align well with the structural properties of these indexing schemes, eventually resulting in a significant loss of accuracy post-indexing. In this paper, we introduce Indexable Bayesian Personalized Ranking (IBPR) that learns from ordinal preference to produce representation that is inherently compatible with the aforesaid indices. Experiments on publicly available datasets show superior performance of the proposed model compared to state-of-the-art methods on top-k recommendation retrieval task, achieving significant speedup while maintaining high accuracy.

【Keywords】: indexing; retrieval efficiency; top-k recommendation

143. Latency Reduction via Decision Tree Based Query Construction.

【Paper Link】【Pages】:1399-1407

【Authors】: Aman Grover ; Dhruv Arya ; Ganesh Venkataraman

【Abstract】: LinkedIn as a professional network serves the career needs of 450 Million plus members. The task of job recommendation system is to nd the suitable job among a corpus of several million jobs and serve this in real time under tight latency constraints. Job search involves nding suitable job listings given a user, query and context. Typical scoring function for both search and recommendations involves evaluating a function that matches various elds in the job description with various elds in the member pro le. This in turn translates to evaluating a function with several thousands of features to get the right ranking. In recommendations, evaluating all the jobs in the corpus for all members is not possible given the latency constraints. On the other hand, reducing the candidate set could potentially involve loss of relevant jobs. We present a way to model the underlying complex ranking function via decision trees. The branches within the decision trees are query clauses and hence the decision trees can be mapped on to real time queries. We developed an o ine framework which evaluates the quality of the decision tree with respect to latency and recall. We tested the approach on job search and recommendations on LinkedIn and A/B tests show signi cant improvements in member engagement and latency. Our techniques helped reduce job search latency by over 67% and our recommendations latency by over 55%. Our techniques show 3.5% improvement in applications from job recommendations primarily due to reduced timeouts from upstream services. As of writing the approach powers all of job search and recommendations on LinkedIn.

【Keywords】: information retrieval; personalized search; recommender systems

Session 7G: Recommendation 2 4

144. Broad Learning based Multi-Source Collaborative Recommendation.

【Paper Link】【Pages】:1409-1418

【Authors】: Junxing Zhu ; Jiawei Zhang ; Lifang He ; Quanyuan Wu ; Bin Zhou ; Chenwei Zhang ; Philip S. Yu

【Abstract】: Anchor links connect information entities, such as entities of movies or products, across networks from different sources, and thus information in these networks can be transferred directly via anchor links. Therefore, anchor links have great value to many cross-network applications, such as cross-network social link prediction and cross-network recommendation. In this paper, we focus on studying the recommendation problem that can provide ratings of items or services. To address the problem, we propose a Cross-network Collaborative Matrix Factorization (CCMF) recommendation framework based on broad learning setting, which can effectively integrate multi-source information and alleviate the sparse information problem in each individual network. Based on item anchor links CCMF can fuse item similarity information and item latent information across networks from different sources. And different from most of the traditional works, CCMF can make multi-source recommendation tasks collaborate together via the information transfer based on the broad learning setting. During the transfer process, a novel cross-network similarity transfer method is applied to keep the consistency of item similarities between two different networks, and a domain adaptation matrix is used to overcome the domain difference problem. We conduct experiments to compare the proposed CCMF method with both classic and state-of-the-art recommendation techniques. The experimental results illustrate that CCMF outperforms other methods in different experimental circumstances, and has great advantages on dealing with different data sparse problems.

【Keywords】: anchor links; cross-network; matrix factorization; recommendation

145. Neural Attentive Session-based Recommendation.

【Paper Link】【Pages】:1419-1428

【Authors】: Jing Li ; Pengjie Ren ; Zhumin Chen ; Zhaochun Ren ; Tao Lian ; Jun Ma

【Abstract】: Given e-commerce scenarios that user profiles are invisible, session-based recommendation is proposed to generate recommendation results from short sessions. Previous work only considers the user's sequential behavior in the current session, whereas the user's main purpose in the current session is not emphasized. In this paper, we propose a novel neural networks framework, i.e., Neural Attentive Recommendation Machine (NARM), to tackle this problem. Specifically, we explore a hybrid encoder with an attention mechanism to model the user's sequential behavior and capture the user's main purpose in the current session, which are combined as a unified session representation later. We then compute the recommendation scores for each candidate item with a bi-linear matching scheme based on this unified session representation. We train NARM by jointly learning the item and session representations as well as their matchings. We carried out extensive experiments on two benchmark datasets. Our experimental results show that NARM outperforms state-of-the-art baselines on both datasets. Furthermore, we also find that NARM achieves a significant improvement on long sessions, which demonstrates its advantages in modeling the user's sequential behavior and main purpose simultaneously.

【Keywords】: attention mechanism; recurrent neural networks; sequential behavior; session-based recommendation

146. A Deep Recurrent Collaborative Filtering Framework for Venue Recommendation.

【Paper Link】【Pages】:1429-1438

【Authors】: Jarana Manotumruksa ; Craig Macdonald ; Iadh Ounis

【Abstract】: Venue recommendation is an important application for Location-Based Social Networks (LBSNs), such as Yelp, and has been extensively studied in recent years. Matrix Factorisation (MF) is a popular Collaborative Filtering (CF) technique that can suggest relevant venues to users based on an assumption that similar users are likely to visit similar venues. In recent years, deep neural networks have been successfully applied to tasks such as speech recognition, computer vision and natural language processing. Building upon this momentum, various approaches for recommendation have been proposed in the literature to enhance the effectiveness of MF-based approaches by exploiting neural network models such as: word embeddings to incorporate auxiliary information (e.g. textual content of comments); and Recurrent Neural Networks (RNN) to capture sequential properties of observed user-venue interactions. However, such approaches rely on the traditional inner product of the latent factors of users and venues to capture the concept of collaborative filtering, which may not be sufficient to capture the complex structure of user-venue interactions. In this paper, we propose a Deep Recurrent Collaborative Filtering framework (DRCF) with a pairwise ranking function that aims to capture user-venue interactions in a CF manner from sequences of observed feedback by leveraging Multi-Layer Perception and Recurrent Neural Network architectures. Our proposed framework consists of two components: namely Generalised Recurrent Matrix Factorisation (GRMF) and Multi-Level Recurrent Perceptron (MLRP) models. In particular, GRMF and MLRP learn to model complex structures of user-venue interactions using element-wise and dot products as well as the concatenation of latent factors. In addition, we propose a novel sequence-based negative sampling approach that accounts for the sequential properties of observed feedback and geographical location of venues to enhance the quality of venue suggestions, as well as alleviate the cold-start users problem. Experiments on three large checkin and rating datasets show the effectiveness of our proposed framework by outperforming various state-of-the-art approaches.

【Keywords】: deep recurrent collaborative filtering framework; dynamic preferences; static preferences

147. Recommendation with Capacity Constraints.

【Paper Link】【Pages】:1439-1448

【Authors】: Konstantina Christakopoulou ; Jaya Kawale ; Arindam Banerjee

【Abstract】: In many recommendation settings, the candidate items for recommendation are associated with a maximum capacity, i.e., number of seats in a Point-of-Interest (POI) or number of item copies in the inventory. However, despite the prevalence of the capacity constraint in the recommendation process, the existing recommendation methods are not designed to optimize for respecting such a constraint. Towards closing this gap, we propose Recommendation with Capacity Constraints -- a framework that optimizes for both recommendation accuracy and expected item usage that respects the capacity constraints. We show how to apply our method to three state-of-the-art latent factor recommendation models: probabilistic matrix factorization (PMF), bayesian personalized ranking (BPR) for item recommendation, and geographical matrix factorization (GeoMF) for POI recommendation. Our experiments indicate that our framework is effective for providing good recommendations while taking the limited resources into consideration. Interestingly, our methods are shown in some cases to further improve the top-N recommendation quality of the respective unconstrained models.

【Keywords】: capacity constraints; latent factor recommendation; point-of-interest recommendation; recommendation systems; user propensity

Session 8A: Recommendation 3 4

148. Joint Representation Learning for Top-N Recommendation with Heterogeneous Information Sources.

【Paper Link】【Pages】:1449-1458

【Authors】: Yongfeng Zhang ; Qingyao Ai ; Xu Chen ; W. Bruce Croft

【Abstract】: The Web has accumulated a rich source of information, such as text, image, rating, etc, which represent different aspects of user preferences. However, the heterogeneous nature of this information makes it difficult for recommender systems to leverage in a unified framework to boost the performance. Recently, the rapid development of representation learning techniques provides an approach to this problem. By translating the various information sources into a unified representation space, it becomes possible to integrate heterogeneous information for informed recommendation. In this work, we propose a Joint Representation Learning (JRL) framework for top-N recommendation. In this framework, each type of information source (review text, product image, numerical rating, etc) is adopted to learn the corresponding user and item representations based on available (deep) representation learning architectures. Representations from different sources are integrated with an extra layer to obtain the joint representations for users and items. In the end, both the per-source and the joint representations are trained as a whole using pair-wise learning to rank for top-N recommendation. We analyze how information propagates among different information sources in a gradient-descent learning paradigm, based on which we further propose an extendable version of the JRL framework (eJRL), which is rigorously extendable to new information sources to avoid model re-training in practice. By representing users and items into embeddings offline, and using a simple vector multiplication for ranking score calculation online, our framework also has the advantage of fast online prediction compared with other deep learning approaches to recommendation that learn a complex prediction network for online calculation.

【Keywords】: heterogeneous information processing; recommender systems; representation learning; top-n recommendation

149. Interacting Attention-gated Recurrent Networks for Recommendation.

【Paper Link】【Pages】:1459-1468

【Authors】: Wenjie Pei ; Jie Yang ; Zhu Sun ; Jie Zhang ; Alessandro Bozzon ; David M. J. Tax

【Abstract】: Capturing the temporal dynamics of user preferences over items is important for recommendation. Existing methods mainly assume that all time steps in user-item interaction history are equally relevant to recommendation, which however does not apply in real-world scenarios where user-item interactions can often happen accidentally. More importantly, they learn user and item dynamics separately, thus failing to capture their joint effects on user-item interactions. To better model user and item dynamics, we present the Interacting Attention-gated Recurrent Network (IARN) which adopts the attention model to measure the relevance of each time step. In particular, we propose a novel attention scheme to learn the attention scores of user and item history in an interacting way, thus to account for the dependencies between user and item dynamics in shaping user-item interactions. By doing so, IARN can selectively memorize different time steps of a user's history when predicting her preferences over different items. Our model can therefore provide meaningful interpretations for recommendation results, which could be further enhanced by auxiliary features. Extensive validation on real-world datasets shows that IARN consistently outperforms state-of-the-art methods.

【Keywords】: attention model; feature-based recommendation; recurrent neural network; user-item interaction

150. A Personalised Ranking Framework with Multiple Sampling Criteria for Venue Recommendation.

【Paper Link】【Pages】:1469-1478

【Authors】: Jarana Manotumruksa ; Craig Macdonald ; Iadh Ounis

【Abstract】: Recommending a ranked list of interesting venues to users based on their preferences has become a key functionality in Location-Based Social Networks (LBSNs) such as Yelp and Gowalla. Bayesian Personalised Ranking (BPR) is a popular pairwise recommendation technique that is used to generate the ranked list of venues of interest to a user, by leveraging the user's implicit feedback such as their check-ins as instances of positive feedback, while randomly sampling other venues as negative instances. To alleviate the sparsity that affects the usefulness of recommendations by BPR for users with few check-ins, various approaches have been proposed in the literature to incorporate additional sources of information such as the social links between users, the textual content of comments, as well as the geographical location of the venues. However, such approaches can only readily leverage one source of additional information for negative sampling. Instead, we propose a novel Personalised Ranking Framework with Multiple sampling Criteria (PRFMC) that leverages both geographical influence and social correlation to enhance the effectiveness of BPR. In particular, we apply a multi-centre Gaussian model and a power-law distribution method, to capture geographical influence and social correlation when sampling negative venues, respectively. Finally, we conduct comprehensive experiments using three large-scale datasets from the Yelp, Gowalla and Brightkite LBSNs. The experimental results demonstrate the effectiveness of fusing both geographical influence and social correlation in our proposed PRFMC framework and its superiority in comparison to BPR-based and other similar ranking approaches. Indeed, our PRFMC approach attains a 37% improvement in MRR over a recently proposed approach that identifies negative venues only from social links.

【Keywords】: geographical influences; negative sampling criterion; social correlation

151. BayDNN: Friend Recommendation with Bayesian Personalized Ranking Deep Neural Network.

【Paper Link】【Pages】:1479-1488

【Authors】: Daizong Ding ; Mi Zhang ; Shao-Yuan Li ; Jie Tang ; Xiaotie Chen ; Zhi-Hua Zhou

【Abstract】: Friendship is the cornerstone to build a social network. In online social networks, statistics show that the leading reason for user to create a new friendship is due to recommendation. Thus the accuracy of recommendation matters. In this paper, we propose a Bayesian Personalized Ranking Deep Neural Network (BayDNN) model for friend recommendation in social networks. With BayDNN, we achieve significant improvement on two public datasets: Epinions and Slashdot. For example, on Epinions dataset, BayDNN significantly outperforms the state-of-the-art algorithms, with a 5% improvement on NDCG over the best baseline. The advantages of the proposed BayDNN mainly come from its underlying convolutional neural network (CNN), which offers a mechanism to extract latent deep structural feature representations of the complicated network data, and a novel Bayesian personalized ranking idea, which precisely captures the users' personal bias based on the extracted deep features. To get good parameter estimation for the neural network, we present a fine-tuned pre-training strategy for the proposed BayDNN model based on Poisson and Bernoulli probabilistic models.

【Keywords】: bayesian personalized ranking deep neural network; pre-training strategy; probabilistic model

Session 8B: Text analysis 4

152. A Topic Model Based on Poisson Decomposition.

【Paper Link】【Pages】:1489-1498

【Authors】: Haixin Jiang ; Rui Zhou ; Limeng Zhang ; Hua Wang ; Yanchun Zhang

【Abstract】: Determining appropriate statistical distributions for modeling text corpora is important for accurate estimation of numerical characteristics. Based on the validity of the test on a claim that the data conforms to Poisson distribution we propose Poisson decomposition model (PDM), a statistical model for modeling count data of text corpora, which can straightly capture each document's multidimensional numerical characteristics on topics. In PDM, each topic is represented as a parameter vector with multidimensional Poisson distribution, which can be easily normalized to multinomial term probabilities and each document is represented as measurements on topics and thereby reduced to a measurement vector on topics. We use gradient descent methods and sampling algorithm for parameter estimation. We carry out extensive experiments on the topics produced by our models. The results demonstrate our approach can extract more coherent topics and is competitive in document clustering by using the PDM-based features, compared to PLSI and LDA.

【Keywords】: poisson decomposition; statistical testing; text classification; topic coherence; topic model

153. A Matrix-Vector Recurrent Unit Model for Capturing Compositional Semantics in Phrase Embeddings.

【Paper Link】【Pages】:1499-1507

【Authors】: Rui Wang ; Wei Liu ; Chris McDonald

【Abstract】: The meaning of a multi-word phrase not only depends on the meaning of its constituent words, but also the rules of composing them to give the so-called compositional semantic. However, many deep learning models for learning compositional semantics target specific NLP tasks such as sentiment classification. Consequently, the word embeddings encode the lexical semantics, the weights of the networks are optimised for the classification task. Such models have no mechanisms to explicitly encode the compositional rules, and hence they are insufficient in capturing the semantics of phrases. We present a novel recurrent computational mechanism that specifically learns the compositionality by encoding the compositional rule of each word into a matrix. The network uses a recurrent architecture to capture the order of words for phrases with various lengths without requiring extra preprocessing such as part-of-speech tagging. The model is thoroughly evaluated on both supervised and unsupervised NLP tasks including phrase similarity, noun-modifier questions, sentiment distribution prediction, and domain specific term identification tasks. We demonstrate that our model consistently outperforms the LSTM and CNN deep learning models, simple algebraic compositions, and other popular baselines on different datasets.

【Keywords】: compositional semantics; deep learning; phrase embeddings

154. Words are Malleable: Computing Semantic Shifts in Political and Media Discourse.

【Paper Link】【Pages】:1509-1518

【Authors】: Hosein Azarbonyad ; Mostafa Dehghani ; Kaspar Beelen ; Alexandra Arkut ; Maarten Marx ; Jaap Kamps

【Abstract】: Recently, researchers started to pay attention to the detection of temporal shifts in the meaning of words. However, most (if not all) of these approaches restricted their efforts to uncovering change over time, thus neglecting other valuable dimensions such as social or political variability. We propose an approach for detecting semantic shifts between different viewpoints---broadly defined as a set of texts that share a specific metadata feature, which can be a time-period, but also a social entity such as a political party. For each viewpoint, we learn a semantic space in which each word is represented as a low dimensional neural embedded vector. The challenge is to compare the meaning of a word in one space to its meaning in another space and measure the size of the semantic shifts. We compare the effectiveness of a measure based on optimal transformations between the two spaces with a measure based on the similarity of the neighbors of the word in the respective spaces. Our experiments demonstrate that the combination of these two performs best. We show that the semantic shifts not only occur over time but also along different viewpoints in a short period of time. For evaluation, we demonstrate how this approach captures meaningful semantic shifts and can help improve other tasks such as the contrastive viewpoint summarization and ideology detection (measured as classification accuracy) in political texts. We also show that the two laws of semantic change which were empirically shown to hold for temporal shifts also hold for shifts across viewpoints. These laws state that frequent words are less likely to shift meaning while words with many senses are more likely to do so.

【Keywords】: ideology detection; semantic shifts; word embeddings; word stability

155. A Neural Candidate-Selector Architecture for Automatic Structured Clinical Text Annotation.

【Paper Link】【Pages】:1519-1528

【Authors】: Gaurav Singh ; Iain James Marshall ; James Thomas ; John Shawe-Taylor ; Byron C. Wallace

【Abstract】: We consider the task of automatically annotating free texts describing clinical trials with concepts from a controlled, structured medical vocabulary. Specifically, we aim to build a model to infer distinct sets of (ontological) concepts describing complementary clinically salient aspects of the underlying trials: the populations enrolled, the interventions administered and the outcomes measured, i.e., the PICO elements. This important practical problem poses a few key challenges. One issue is that the output space is vast, because the vocabulary comprises many unique concepts. Compounding this problem, annotated data in this domain is expensive to collect and hence sparse. Furthermore, the outputs (sets of concepts for each PICO element) are correlated: specific populations (e.g., diabetics) will render certain intervention concepts likely (insulin therapy) while effectively precluding others (radiation therapy). Such correlations should be exploited. We propose a novel neural model that addresses these challenges. We introduce a Candidate-Selector architecture in which the model considers setes of candidate concepts for PICO elements, and assesses their plausibility conditioned on the input text to be annotated. This relies on a 'candidate set' generator, which may be learned or relies on heuristics. A conditional discriminative neural model then jointly selects candidate concepts, given the input text. We compare the predictive performance of our approach to strong baselines, and show that it outperforms them. Finally, we perform a qualitative evaluation of the generated annotations by asking domain experts to assess their quality.

【Keywords】: biomedical informatics; deep learning; text mining

Session 8C: Adversarial IR 4

156. Sybil Defense in Crowdsourcing Platforms.

【Paper Link】【Pages】:1529-1538

【Authors】: Dong Yuan ; Guoliang Li ; Qi Li ; Yudian Zheng

【Abstract】: Crowdsourcing platforms have been widely deployed to solve many computer-hard problems, e.g., image recognition and entity resolution. Quality control is an important issue in crowdsourcing, which has been extensively addressed by existing quality-control algorithms, e.g., voting-based algorithms and probabilistic graphical models. However, these algorithms cannot ensure quality under sybil attacks, which leverages a large number of sybil accounts to generate results for dominating answers of normal workers. To address this problem, we propose a sybil defense framework for crowdsourcing, which can help crowdsourcing platforms to identify sybil workers and defense the sybil attack. We develop a similarity function to quantify worker similarity. Based on worker similarity, we cluster workers into different groups such that we can utilize a small number of golden questions to accurately identify the sybil groups. We also devise online algorithms to instantly detect sybil workers to throttle the attacks. Our method also has ability to detect multi-attackers in one task. To the best of our knowledge, this is the first framework for sybil defense in crowdsourcing. Experimental results on real-world datasets demonstrate that our method can effectively identify and throttle sybil workers.

【Keywords】: crowdsourcing; sybil defense

157. HoloScope: Topology-and-Spike Aware Fraud Detection.

【Paper Link】【Pages】:1539-1548

【Authors】: Shenghua Liu ; Bryan Hooi ; Christos Faloutsos

【Abstract】: As online fraudsters invest more resources, including purchasing large pools of fake user accounts and dedicated IPs, fraudulent attacks become less obvious and their detection becomes increasingly challenging. Existing approaches such as average degree maximization suffer from the bias of including more nodes than necessary, resulting in lower accuracy and increased need for manual verification. Hence, we propose HoloScope, which introduces a novel metric "contrast suspiciousness" integrating information from graph topology and spikes to more accurately detect fraudulent users and objects. Contrast suspiciousness dynamically emphasizes the contrast patterns between fraudsters and normal users, making HoloScope capable of distinguishing the synchronized and anomalous behaviors of fraudsters on topology, bursts and drops, and rating scores. In addition, we provide theoretical bounds for how much this increases the time cost needed for fraudsters to conduct adversarial attacks. Moreover, HoloScope has a concise framework and sub-quadratic time complexity, making the algorithm reproducible and scalable. Extensive experiments showed that HoloScope achieved significant accuracy improvements on synthetic and real data, compared with state-of-the-art fraud detection methods.

【Keywords】: burst and drop; fraud detection; graph mining; time series

158. Building a Dossier on the Cheap: Integrating Distributed Personal Data Resources Under Cost Constraints.

【Paper Link】【Pages】:1549-1558

【Authors】: Imrul Chowdhury Anindya ; Harichandan Roy ; Murat Kantarcioglu ; Bradley Malin

【Abstract】: A wide variety of personal data is routinely collected by numerous organizations that, in turn, share and sell their collections for analytic investigations (e.g., market research). To preserve privacy, certain identifiers are often redacted, perturbed or even removed. A substantial number of attacks have shown that, if care is not taken, such data can be linked to external resources to determine the explicit identifiers (e.g., personal names) or infer sensitive attributes (e.g., income) for the individuals from whom the data was collected. As such, organizations increasingly rely upon record linkage methods to assess the risk such attacks pose and adopt countermeasures accordingly. Traditional linkage methods assume only two datasets would be linked (e.g., linking de-identified hospital discharge to identified voter registration lists), but with the advent of a multi-billion dollar data broker industry, modern adversaries have access to a massive data stash of multiple datasets that can be leveraged. Still, realistic adversaries have budget constraints that prevent them from obtaining and integrating all relevant datasets. Thus, in this work, we investigate a novel privacy risk assessment framework, based on adversaries who plan an integration of datasets for the most accurate estimate of targeted sensitive attributes under a certain budget. To solve this problem, we introduce a graph-based formulation of the problem and predictive modeling methods to prioritize data resources for linkage. We perform an empirical analysis using real world voter registration data from two different U.S. states and show that the methods can be used efficiently to accurately estimate potentially sensitive information disclosure risks even under a non-trivial amount of noise.

【Keywords】: data brokers; data integration; data privacy; heuristic approach; identity disclosure; probabilistic model; record linkage

159. DeMalC: A Feature-rich Machine Learning Framework for Malicious Call Detection.

【Paper Link】【Pages】:1559-1567

【Authors】: Yuhong Li ; Dongmei Hou ; Aimin Pan ; Zhiguo Gong

【Abstract】: Malicious phone call is a plague, in which unscrupulous salesmen or criminals make to acquire money illegally from the victims. As a result, there has been broad interest in deveploing systems to make the end-users vigilant when receiving such phone calls. Typically, these systems justify the phone numbers either by the crowd-generated blacklist or exploiting the features via machine learning techniques. However, the former is frail due to the rare and lazy crowd, while the later suffers from the scarcity of effective features. In this work, we propose a solution named DeMalC to address those problems by applying the machine learning algorithmm on a novel set of discriminative features. These features consist of properties and behaviors that are powerful enough to characterize phone numbers from different perspectives. We extensively evaluated our solution, i.e., DeMalC, using massive call detail records. The experimental result shows the effectiveness of our extracted features. Capable of achieving 91.86% overall accuracy and 79.34% F1-score on the detection of malicious phone numbers, the DeMalC has been deployed online and demonstrated to be a competitive solution for detecting malicious calls.

【Keywords】: Antifraud APP; Data Mining for Social Security; Malicious Call Detection

Session 8D: Health Analytics 2/ Top-k 3

160. FA*IR: A Fair Top-k Ranking Algorithm.

【Paper Link】【Pages】:1569-1578

【Authors】: Meike Zehlike ; Francesco Bonchi ; Carlos Castillo ; Sara Hajian ; Mohamed Megahed ; Ricardo A. Baeza-Yates

【Abstract】: In this work, we define and solve the Fair Top-k Ranking problem, in which we want to determine a subset of k candidates from a large pool of n » k candidates, maximizing utility (i.e., select the "best" candidates) subject to group fairness criteria. Our ranked group fairness definition extends group fairness using the standard notion of protected groups and is based on ensuring that the proportion of protected candidates in every prefix of the top-k ranking remains statistically above or indistinguishable from a given minimum. Utility is operationalized in two ways: (i) every candidate included in the top-k should be more qualified than every candidate not included; and (ii) for every pair of candidates in the top-k, the more qualified candidate should be ranked above. An efficient algorithm is presented for producing the Fair Top-k Ranking, and tested experimentally on existing datasets as well as new datasets released with this paper, showing that our approach yields small distortions with respect to rankings that maximize utility without considering fairness criteria. To the best of our knowledge, this is the first algorithm grounded in statistical tests that can mitigate biases in the representation of an under-represented group along a ranked list.

【Keywords】: algorithmic fairness; bias in computer systems; ranking; top-k selection

161. Capturing Feature-Level Irregularity in Disease Progression Modeling.

【Paper Link】【Pages】:1579-1588

【Authors】: Kaiping Zheng ; Wei Wang ; Jinyang Gao ; Kee Yuan Ngiam ; Beng Chin Ooi ; James Wei Luen Yip

【Abstract】: Disease progression modeling (DPM) analyzes patients' electronic medical records (EMR) to predict the health state of patients, which facilitates accurate prognosis, early detection and treatment of chronic diseases. However, EMR are irregular because patients visit hospital irregularly based on the need of treatment. For each visit, they are typically given different diagnoses, prescribed various medications and lab tests. Consequently, EMR exhibit irregularity at the feature level. To handle this issue, we propose a model based on the Gated Recurrent Unit by decaying the effect of previous records using fine-grained feature-level time span information, and learn the decaying parameters for different features to take into account their different behaviours like decaying speeds under irregularity. Extensive experimental results in both an Alzheimer's disease dataset and a chronic kidney disease dataset demonstrate that our proposed model of capturing feature-level irregularity can effectively improve the accuracy of DPM.

【Keywords】: data analytics; gated recurrent unit; healthcare; time series

162. Health Forum Thread Recommendation Using an Interest Aware Topic Model.

【Paper Link】【Pages】:1589-1598

【Authors】: Kishaloy Halder ; Min-Yen Kan ; Kazunari Sugiyama

【Abstract】: We introduce a general, interest-aware topic model (IATM), in which known higher-level interests on topics expressed by each user can be modeled. We then specialize the IATM for use in consumer health forum thread recommendation by equating each user's self-reported medical conditions as interests and topics as symptoms of treatments for recommendation. The IATM additionally models the implicit interests embodied by users' textual descriptions in their profiles. To further enhance the personalized nature of the recommendations, we introduce jointly normalized collaborative topic regression (JNCTR) which captures how users interact with the various symptoms belonging to the same clinical condition. In our experiments on two real-world consumer health forums, our proposed model significantly outperforms competitive state-of-the-art baselines by over 10% in recall. Importantly, we show that our IATM+JNCTR pipeline also imbues the recommendation process with added transparency, allowing a recommendation system to justify its recommendation with respect to each user's interest in certain health conditions.

【Keywords】: collaborative filtering; graphical model; recommender systems; topic models

163. HotSpots: Failure Cascades on Heterogeneous Critical Infrastructure Networks.

【Paper Link】【Pages】:1599-1607

【Authors】: Liangzhe Chen ; Xinfeng Xu ; Sangkeun Lee ; Sisi Duan ; Alfonso G. Tarditi ; Supriya Chinthavali ; B. Aditya Prakash

【Abstract】: Critical Infrastructure Systems such as transportation, water and power grid systems are vital to our national security, economy, and public safety. Recent events, like the 2012 hurricane Sandy, show how the interdependencies among different CI networks lead to catastrophic failures among the whole system. Hence, analyzing these CI networks, and modeling failure cascades on them becomes a very important problem. However, traditional models either do not take multiple CIs or the dynamics of the system into account, or model it simplistically. In this paper, we study this problem using a heterogeneous network viewpoint. We first construct heterogeneous CI networks with multiple components using national-level datasets. Then we study novel failure maximization problems on these networks, to compute critical nodes in such systems. We then provide HotSpots, a scalable and effective algorithm for these problems, based on careful transformations. Finally, we conduct extensive experiments on real CIS data from multiple US states, and show that our method HotSpots outperforms non-trivial baselines, gives meaningful results and that our approach gives immediate benefits in providing situational-awareness during large-scale failures.

【Keywords】: critical infrastructure systems; failure cascade modeling; failure maximization; heterogeneous graph

164. SOPER: Discovering the Influence of Fashion and the Many Faces of User from Session Logs using Stick Breaking Process.

【Paper Link】【Pages】:1609-1618

【Authors】: Lucky Dhakad ; Mrinal Kanti Das ; Chiranjib Bhattacharyya ; Samik Datta ; Mihir Kale ; Vivek Mehta

【Abstract】: Recommending lifestyle articles is of immediate interest to the e-commerce industry and is beginning to attract research attention. Often followed strategies, such as recommending popular items are inadequate for this vertical because of two reasons. Firstly, users have their own personal preference over items, referred to as personal styles, which lead to the long-tail phenomenon. Secondly, each user displays multiple personas, each persona has a preference over items which could be dictated by a particular occasion, e.g. dressing for a party would be different from dressing to go to office. Recommendation in this vertical is crucially dependent on discovering styles for each of the multiple personas. There is no literature which addresses this problem. We posit a generative model which describes each user by a Simplex Over PERsona, SOPER, where a persona is described as the individuals preferences over prevailing styles modelled as topics over items. The choice of simplex and the long-tail nature necessitates the use of stick-breaking process. The main technical contribution is an efficient collapsed Gibbs sampling based algorithm for solving the attendant inference problem. Trained on large-scale interaction logs spanning more than half-a-million sessions collected from an e-commerce portal, SOPER outperforms previous baselines such as [9] by a large margin of 35% in identifying persona. Consequently it outperforms several competitive baselines comprehensively on the task of recommending from a catalogue of roughly 150 thousand lifestyle articles, by improving the recommendation quality as measured by AUC by a staggering 12.23%, in addition to aiding the interpretability of uncovered personal and fashionable styles thus advancing our precise understanding of the underlying phenomena.

【Keywords】: bayesian nonparametrics; fashion; lifestyle; stick-breaking process; topic models

【Paper Link】【Pages】:1619-1628

【Authors】: Xin Zheng ; Aixin Sun ; Sibo Wang ; Jialong Han

【Abstract】: Twitter provides us a convenient channel to get access to the immediate information about major events. However, it is challenging to acquire a clean and complete set of event-related data due to the characteristics of tweets, eg short and noisy. In this paper, we propose a semi-supervised method to obtain high quality event-related tweets from Twitter stream, in terms of precision and recall. Specifically, candidate event-related tweets are selected based on a set of keywords. We propose to generate and update these keywords dynamically along the event development. To be included in this keyword set, words are evaluated based on single word properties, property based on co-occurred words, and changes of word importance over time. Our solution is capable of capturing keywords of emerging aspects or aspects with increasing importance along event evolvement. By leveraging keyword importance information and a few labeled tweets, we propose a semi-supervised expectation maximization process to identify event-related tweets. This process significantly reduces human effort in acquiring high quality tweets. Experiments on three real world datasets show that our solution outperforms state-of-the-art approaches by up to 10% in F1 measure.

【Keywords】: dynamic keyword generation; event-related tweet identification

166. Distant Meta-Path Similarities for Text-Based Heterogeneous Information Networks.

【Paper Link】【Pages】:1629-1638

【Authors】: Chenguang Wang ; Yangqiu Song ; Haoran Li ; Yizhou Sun ; Ming Zhang ; Jiawei Han

【Abstract】: Measuring network similarity is a fundamental data mining problem. The mainstream similarity measures mainly leverage the structural information regarding to the entities in the network without considering the network semantics. In the real world, the heterogeneous information networks (HINs) with rich semantics are ubiquitous. However, the existing network similarity doesn't generalize well in HINs because they fail to capture the HIN semantics. The meta-path has been proposed and demonstrated as a right way to represent semantics in HINs. Therefore, original meta-path based similarities (e.g., PathSim and KnowSim) have been successful in computing the entity proximity in HINs. The intuition is that the more instances of meta-path(s) between entities, the more similar the entities are. Thus the original meta-path similarity only applies to computing the proximity of two neighborhood (connected) entities. In this paper, we propose the distant meta-path similarity that is able to capture HIN semantics between two distant (isolated) entities to provide more meaningful entity proximity. The main idea is that even there is no shared neighborhood entities of (i.e., no meta-path instances connecting) the two entities, but if the more similar neighborhood entities of the entities are, the more similar the two entities should be. We then find out the optimum distant meta-path similarity by exploring the similarity hypothesis space based on different theoretical foundations. We show the state-of-the-art similarity performance of distant meta-path similarity on two text-based HINs and make the datasets public available.

【Keywords】: distant similarity; heterogeneous information network; knowledge graph; text similarity

Session 8F: Feature/Entity Selection 4

167. Unsupervised Feature Selection with Joint Clustering Analysis.

【Paper Link】【Pages】:1639-1648

【Authors】: Shuai An ; Jun Wang ; Jinmao Wei ; Zhenglu Yang

【Abstract】: Unsupervised feature selection has raised considerable interests in the past decade, due to its remarkable performance in reducing dimensionality without any prior class information. Preserving reliable locality information and achieving excellent cluster separation are two critical issues for unsupervised feature selection. However, existing methods cannot tackle two issues simultaneously. To address the problems, we propose a novel unsupervised approach that integrates sparse feature selection and robust joint clustering analysis. The joint clustering analysis seamlessly unifies the spectral clustering and the orthogonal basis clustering. Specifically, a probabilistic neighborhood graph is utilized to preserve reliable locality information in the spectral clustering, and an orthogonal basis matrix is incorporated to achieve excellent cluster separation in the orthogonal basis clustering. A compact and effective iterative algorithm is designed to optimize the proposed selection framework. Extensive experiments on both synthetic data and real-world data validate the effectiveness of our approach under various evaluation indices.

【Keywords】: cluster separation; joint clustering analysis; locality preserving; unsupervised feature selection

168. Multi-Label Feature Selection using Correlation Information.

【Paper Link】【Pages】:1649-1656

【Authors】: Ali Braytee ; Wei Liu ; Daniel R. Catchpoole ; Paul J. Kennedy

【Abstract】: High-dimensional multi-labeled data contain instances, where each instance is associated with a set of class labels and has a large number of noisy and irrelevant features. Feature selection has been shown to have great benefits in improving the classification performance in machine learning. In multi-label learning, to select the discriminative features among multiple labels, several challenges should be considered: interdependent labels, different instances may share different label correlations, correlated features, and missing and flawed labels. This work is part of a project at The Children's Hospital at Westmead (TB-CHW), Australia to explore the genomics of childhood leukaemia. In this paper, we propose a CMFS (Correlated- and Multi-label Feature Selection method), based on non-negative matrix factorization (NMF) for simultaneously performing feature selection and addressing the aforementioned challenges. Significantly, a major advantage of our research is to exploit the correlation information contained in features, labels and instances to select the relevant features among multiple labels. Furthermore, l2,1 -norm regularization is incorporated in the objective function to undertake feature selection by imposing sparsity on the feature matrix rows. We employ CMFS to decompose the data and multi-label matrices into a low-dimensional space. To solve the objective function, an efficient iterative optimization algorithm is proposed with guaranteed convergence. Finally, extensive experiments are conducted on high-dimensional multi-labeled datasets. The experimental results demonstrate that our method significantly outperforms state-of-the-art multi-label feature selection methods.

【Keywords】: high dimensional data; multi-label classification; multi-label feature selection; new application

169. Content Recommendation by Noise Contrastive Transfer Learning of Feature Representation.

【Paper Link】【Pages】:1657-1665

【Authors】: Yiyang Li ; Guanyu Tao ; Weinan Zhang ; Yong Yu ; Jun Wang

【Abstract】: Personalized recommendation has been proved effective as a content discovery tool for many online news publishers. As fresh news articles are frequently coming to the system while the old ones are fading away quickly, building a consistent and coherent feature representation over the ever-changing articles pool is fundamental to the performance of the recommendation. However, learning a good feature representation is challenging, especially for some small publishers that have normally fewer than 10,000 articles each year. In this paper, we consider to transfer knowledge from a larger text corpus. In our proposed solution, an effective article recommendation engine can be established with a small number of target publisher articles by transferring knowledge from a large corpus of text with a different distribution. Specifically, we leverage noise contrastive estimation techniques to learn the word conditional distribution given the context words, where the noise conditional distribution is pre-trained from the large corpus. Our solution has been deployed in a commercial recommendation service. The large-scale online A/B testing on two commercial publishers demonstrates up to 9.97% relative overall performance gain of our proposed model on the recommendation click-though rate metric over the non-transfer learning baselines.

【Keywords】: article recommendation; noise contrastive estimation; text representation; transfer learning; word2vec

170. NeuPL: Attention-based Semantic Matching and Pair-Linking for Entity Disambiguation.

【Paper Link】【Pages】:1667-1676

【Authors】: Minh C. Phan ; Aixin Sun ; Yi Tay ; Jialong Han ; Chenliang Li

【Abstract】: Entity disambiguation, also known as entity linking, is the task of mapping mentions in text to the corresponding entities in a given knowledge base, e.g. Wikipedia. Two key challenges are making use of mention's context to disambiguate (i.e. local objective), and promoting coherence of all the linked entities (i.e. global objective). In this paper, we propose a deep neural network model to effectively measure the semantic matching between mention's context and target entity. We are the first to employ the long short-term memory (LSTM) and attention mechanism for entity disambiguation. We also propose Pair-Linking, a simple but effective and significantly fast linking algorithm. Pair-Linking iteratively identifies and resolves pairs of mentions, starting from the most confident pair. It finishes linking all mentions in a document by scanning the pairs of mentions at most once. Our neural network model combined with Pair-Linking, named NeuPL, outperforms state-of-the-art systems over different types of documents including news, RSS, and tweets.

【Keywords】: entity disambiguation; pair-linking; semantic matching

Session 8G: Graph Mining 2 4

171. Relaxing Graph Pattern Matching With Explanations.

【Paper Link】【Pages】:1677-1686

【Authors】: Jia Li ; Yang Cao ; Shuai Ma

【Abstract】: Traditional graph pattern matching is based on subgraph isomorphism, which is often too restrictive to identify meaningful matches. To handle this, taxonomy subgraph isomorphism has been proposed to relax the label constraints in the matching. Nonetheless, there are many cases that cannot be covered. In this study, we first formalize taxonomy simulation, a natural matching semantics combing graph simulation with taxonomy, and propose its pattern relaxation to enrich graph pattern matching results with taxonomy information. We also design topological ranking and diversified topological ranking for top-k relaxations. We then study the top-k pattern relaxation problems, by providing their static analyses, and developing algorithms and optimization for finding and evaluating top-k pattern relaxations. We further propose a notion of explanations for answers to the relaxations and develop algorithms to compute explanations. These together give us a framework for enriching the results of graph pattern matching. Using real-life datasets, we experimentally verify that our framework and techniques are effective and efficient for identifying meaningful matches in practice.

【Keywords】: pattern relaxation; query result explanation; taxonomy simulation

172. Active Network Alignment: A Matching-Based Approach.

【Paper Link】【Pages】:1687-1696

【Authors】: Eric Malmi ; Aristides Gionis ; Evimaria Terzi

【Abstract】: Network alignment is the problem of matching the nodes of two graphs, maximizing the similarity of the matched nodes and the edges between them. This problem is encountered in a wide array of applications---from biological networks to social networks to ontologies---where multiple networked data sources need to be integrated. Due to the difficulty of the task, an accurate alignment can rarely be found without human assistance. Thus, it is of great practical importance to develop network alignment algorithms that can optimally leverage experts who are able to provide the correct alignment for a small number of nodes. Yet, only a handful of existing works address this active network alignment setting. The majority of the existing active methods focus on absolute queries ("are nodes a and b the same or not?"), whereas we argue that it is generally easier for a human expert to answer relative queries ("which node in the set b1,...,bn is the most similar to node a?"). This paper introduces two novel relative-query strategies, TopMatchings and GibbsMatchings, which can be applied on top of any network alignment method that constructs and solves a bipartite matching problem. Our methods identify the most informative nodes to query by sampling the matchings of the bipartite graph associated to the network-alignment instance. We compare the proposed approaches to several commonly-used query strategies and perform experiments on both synthetic and real-world datasets. Our sampling-based strategies yield the highest overall performance, outperforming all the baseline methods by more than 15 percentage points in some cases. In terms of accuracy, TopMatchings and GibbsMatchings perform comparably. However, GibbsMatchings is significantly more scalable, but it also requires hyperparameter tuning for a temperature parameter.

【Keywords】: active learning; graph matching; network alignment

173. Discovering Graph Temporal Association Rules.

【Paper Link】【Pages】:1697-1706

【Authors】: Mohammad Hossein Namaki ; Yinghui Wu ; Qi Song ; Peng Lin ; Tingjian Ge

【Abstract】: Detecting regularities between complex events in temporal graphs is critical for emerging applications. This paper proposes graph temporal association rules (GTAR). A GTAR extends traditional association rules to discover temporal associations for complex events captured by a class of temporal pattern queries. We introduce notions of support and confidence for GTARS and formalize the discovery problem for GTARS. We show that despite the enhanced expressive power, GTARS discovery is feasible over large temporal graphs. We develop an effective rule discovery algorithm, which integrates event mining and rule discovery as a single process, and reduces the redundant computation by leveraging their interaction. Using real-life and synthetic data, we experimentally verify the effectiveness and scalability of the algorithms. Our case study also verifies that GTARS demonstrate highly interpretable associations in real-world networks.

【Keywords】: approximate pattern matching; large temporal graph; temporal association rules

174. Minimizing Tension in Teams.

【Paper Link】【Pages】:1707-1715

【Authors】: Behzad Golshan ; Evimaria Terzi

【Abstract】: In large organizations (e.g., companies, universities, etc.) individual experts with different work habits are asked to work together in order to complete projects or tasks. Oftentimes, the differences in the inherent work habits of these experts causes tension among them, which can prove detrimental for the organization's performance and functioning. The question we consider in this paper is the following: "can this tension be reduced by providing incentives to individuals to change their work habits?" We formalize this question in the definition of the k- AlterHabit problem. To the best of our knowledge we are the first to define this problem and analyze its properties. Although we show that k- AlterHabit is NP-hard, we devise polynomial-time algorithms for solving it in practice. Our algorithms are based on interesting connections that we draw between our problem and other combinatorial problems. Our experimental results demonstrate both the efficiency and the efficacy of our algorithmic techniques on a collection of real data.

【Keywords】: algorithms; collaboration networks; experimentation; teams; theory

Session 9A: Queries 4

175. Interactive Spatial Keyword Querying with Semantics.

【Paper Link】【Pages】:1727-1736

【Authors】: Jiabao Sun ; Jiajie Xu ; Kai Zheng ; Chengfei Liu

【Abstract】: Conventional spatial keyword queries confront the difficulty of returning desired objects that are synonyms but morphologically different to query keywords. To overcome this flaw, this paper investigates the interactive spatial keyword querying with semantics. It aims to enhance the conventional queries by not only making sense of the query keywords, but also refining the understanding of query semantics through interactions. On top of the probabilistic topic model, a novel interactive strategy is proposed to precisely infer the latent query semantics by learning from user feedbacks. In each interaction, the returned objects are carefully selected to ensure effective inference of user intended query semantics. Query processing is carried out on a small candidate object set at each round of interaction, and the whole querying process terminates when the latent query semantics learned from user feedback becomes explicit enough. The experimental results on real check-in dataset demonstrates that the quality of results has been significantly improved through limited number of interactions.

【Keywords】: interactive query; spatial database; spatial keyword query

176. From Query-By-Keyword to Query-By-Example: LinkedIn Talent Search Approach.

【Paper Link】【Pages】:1737-1745

【Authors】: Viet Ha-Thuc ; Yan Yan ; Xianren Wu ; Vijay Dialani ; Abhishek Gupta ; Shakti Sinha

【Abstract】: One key challenge in talent search is to translate complex criteria of a hiring position into a search query, while it is relatively easy for a searcher to list examples of suitable candidates for a given position. To improve search e ciency, we propose the next generation of talent search at LinkedIn, also referred to as Search By Ideal Candidates. In this system, a searcher provides one or several ideal candidates as the input to hire for a given position. The system then generates a query based on the ideal candidates and uses it to retrieve and rank results. Shifting from the traditional Query-By-Keyword to this new Query-By-Example system poses a number of challenges: How to generate a query that best describes the candidates? When moving to a completely di erent paradigm, how does one leverage previous product logs to learn ranking models and/or evaluate the new system with no existing usage logs? Finally, given the di erent nature between the two search paradigms, the ranking features typically used for Query-By-Keyword systems might not be optimal for Query- By-Example. This paper describes our approach to solving these challenges. We present experimental results con rming the e ectiveness of the proposed solution, particularly on query building and search ranking tasks. As of writing this paper, the new system has been available to all LinkedIn members.

【Keywords】: learning to rank; personalization; query-by-example

177. Learning to Attend, Copy, and Generate for Session-Based Query Suggestion.

【Paper Link】【Pages】:1747-1756

【Authors】: Mostafa Dehghani ; Sascha Rothe ; Enrique Alfonseca ; Pascal Fleury

【Abstract】: Users try to articulate their complex information needs during search sessions by reformulating their queries. To make this process more effective, search engines provide related queries to help users in specifying the information need in their search process. In this paper, we propose a customized sequence-to-sequence model for session-based query suggestion. In our model, we employ a query-aware attention mechanism to capture the structure of the session context. is enables us to control the scope of the session from which we infer the suggested next query, which helps not only handle the noisy data but also automatically detect session boundaries. Furthermore, we observe that, based on the user query reformulation behavior, within a single session a large portion of query terms is retained from the previously submitted queries and consists of mostly infrequent or unseen terms that are usually not included in the vocabulary. We therefore empower the decoder of our model to access the source words from the session context during decoding by incorporating a copy mechanism. Moreover, we propose evaluation metrics to assess the quality of the generative models for query suggestion. We conduct an extensive set of experiments and analysis. e results suggest that our model outperforms the baselines both in terms of the generating queries and scoring candidate queries for the task of query suggestion.

【Keywords】: copy mechanism; query suggestion; query-aware attention; sequence to sequence model

178. Deep Context Modeling for Web Query Entity Disambiguation.

【Paper Link】【Pages】:1757-1765

【Authors】: Zhen Liao ; Xinying Song ; Yelong Shen ; Saekoo Lee ; Jianfeng Gao ; Ciya Liao

【Abstract】: In this paper, we presented a new study for Web query entity disambiguation (QED), which is the task of disambiguating different candidate entities in a knowledge base given their mentions in a query. QED is particularly challenging because queries are often too short to provide rich contextual information that is required by traditional entity disambiguation methods. In this paper, we propose several methods to tackle the problem of QED. First, we explore the use of deep neural network (DNN) for capturing the character level textual information in queries. Our DNN approach maps queries and their candidate reference entities to feature vectors in a latent semantic space where the distance between a query and its correct reference entity is minimized. Second, we utilize the Web search result information of queries to help generate large amounts of weakly supervised training data for the DNN model. Third, we propose a two-stage training method to combine large-scale weakly supervised data with a small amount of human labeled data, which can significantly boost the performance of a DNN model. The effectiveness of our approach is demonstrated in the experiments using large-scale real-world datasets.

【Keywords】: CLSM; Query Entity Disambiguation; Two-Stage Training

Session 9B: Representation learning 4

179. An Attention-based Collaboration Framework for Multi-View Network Representation Learning.

【Paper Link】【Pages】:1767-1776

【Authors】: Meng Qu ; Jian Tang ; Jingbo Shang ; Xiang Ren ; Ming Zhang ; Jiawei Han

【Abstract】: Learning distributed node representations in networks has been attracting increasing attention recently due to its effectiveness in a variety of applications. Existing approaches usually study networks with a single type of proximity between nodes, which defines a single view of a network. However, in reality there usually exists multiple types of proximities between nodes, yielding networks with multiple views. This paper studies learning node representations for networks with multiple views, which aims to infer robust node representations across different views. We propose a multi-view representation learning approach, which promotes the collaboration of different views and lets them vote for the robust representations. During the voting process, an attention mechanism is introduced, which enables each node to focus on the most informative views. Experimental results on real-world networks show that the proposed approach outperforms existing state-of-the-art approaches for network representation learning with a single view and other competitive approaches with multiple views.

【Keywords】:

180. Representation Learning of Large-Scale Knowledge Graphs via Entity Feature Combinations.

【Paper Link】【Pages】:1777-1786

【Authors】: Zhen Tan ; Xiang Zhao ; Wei Wang

【Abstract】: Knowledge graphs are typical large-scale multi-relational structures, which comprise a large amount of fact triplets. Nonetheless, existing knowledge graphs are still sparse and far from being complete. To refine the knowledge graphs, representation learning is widely used to embed fact triplets into low-dimensional spaces. Many existing knowledge graph embedding models either focus on learning rich features from entities but fail to extract good features of relations, or employ sophisticated models that have rather high time and memory-space complexities. In this paper, we propose a novel knowledge graph embedding model, CombinE. It exploits entity features from two complementary perspectives via the plus and minus combinations. We start with the plus combination, where we use shared features of entity pairs participating in a relation to convey its relation features. To also allow differences of each pairs of entities participating in a relation, we also use the minus combination, where we concentrate on individual entity features, and regard relations as a channel to offset the divergence and preserve the prominence between head and tail entities. Compared with the state-of-the-art models, our experimental results demonstrate that CombinE outperforms existing ones and has low time and memory-space complexities.

【Keywords】: feature combinations; knowledge graphs; link prediction; representation learning

181. Learning Edge Representations via Low-Rank Asymmetric Projections.

【Paper Link】【Pages】:1787-1796

【Authors】: Sami Abu-El-Haija ; Bryan Perozzi ; Rami Al-Rfou'

【Abstract】: We propose a new method for embedding graphs while preserving directed edge information. Learning such continuous-space vector representations (or embeddings) of nodes in a graph is an important first step for using network information (from social networks, user-item graphs, knowledge bases, etc.) in many machine learning tasks. Unlike previous work, we (1) explicitly model an edge as a function of node embeddings, and we (2) propose a novel objective, the graph likelihood, which contrasts information from sampled random walks with non-existent edges. Individually, both of these contributions improve the learned representations, especially when there are memory constraints on the total size of the embeddings. When combined, our contributions enable us to significantly improve the state-of-the-art by learning more concise representations that better preserve the graph structure. We evaluate our method on a variety of link-prediction task including social networks, collaboration networks, and protein interactions, showing that our proposed method learn representations with error reductions of up to 76% and 55%, on directed and undirected graphs. In addition, we show that the representations learned by our method are quite space efficient, producing embeddings which have higher structure-preserving accuracy but are 10 times smaller.

【Keywords】: edge learning; embedding; graph; link prediction; random walk; representation learning

182. HIN2Vec: Explore Meta-paths in Heterogeneous Information Networks for Representation Learning.

【Paper Link】【Pages】:1797-1806

【Authors】: Tao-Yang Fu ; Wang-Chien Lee ; Zhen Lei

【Abstract】: In this paper, we propose a novel representation learning framework, namely HIN2Vec, for heterogeneous information networks (HINs). The core of the proposed framework is a neural network model, also called HIN2Vec, designed to capture the rich semantics embedded in HINs by exploiting different types of relationships among nodes. Given a set of relationships specified in forms of meta-paths in an HIN, HIN2Vec carries out multiple prediction training tasks jointly based on a target set of relationships to learn latent vectors of nodes and meta-paths in the HIN. In addition to model design, several issues unique to HIN2Vec, including regularization of meta-path vectors, node type selection in negative sampling, and cycles in random walks, are examined. To validate our ideas, we learn latent vectors of nodes using four large-scale real HIN datasets, including Blogcatalog, Yelp, DBLP and U.S. Patents, and use them as features for multi-label node classification and link prediction applications on those networks. Empirical results show that HIN2Vec soundly outperforms the state-of-the-art representation learning models for network data, including DeepWalk, LINE, node2vec, PTE, HINE and ESim, by 6.6% to 23.8% of $micro$-$f_1$ in multi-label node classification and 5% to 70.8% of $MAP$ in link prediction.

【Keywords】: heterogeneous information network; representation learning

Session 9C: Graph Mining 3 4

183. Core Decomposition and Densest Subgraph in Multilayer Networks.

【Paper Link】【Pages】:1807-1816

【Authors】: Edoardo Galimberti ; Francesco Bonchi ; Francesco Gullo

【Abstract】: Multilayer networks are a powerful paradigm to model complex systems, where various relations might occur among the same set of entities. Despite the keen interest in a variety of problems, algorithms, and analysis methods in this type of network, the problem of extracting dense subgraphs has remained largely unexplored. As a first step in this direction, in this work we study the problem of core decomposition of a multilayer network. Unlike the single-layer counterpart in which cores are all nested into one another and can be computed in linear time, the multilayer context is much more challenging as no total order exists among multilayer cores; rather, they form a lattice whose size is exponential in the number of layers. In this setting we devise three algorithms which differ in the way they visit the core lattice and in their pruning techniques. We assess time and space efficiency of the three algorithms on a large variety of real-world multilayer networks. We then move a step forward and showcase an application of the multilayer core-decomposition tool to the problem of densest-subgraph extraction from multilayer networks. We introduce a definition of multilayer densest subgraph that trades-off between high density and number of layers in which the high density holds, and show how multilayer core decomposition can be exploited to approximate this problem with quality guarantees.

【Keywords】: core decomposition; densest subgraph; graph mining; multilayer networks

184. Fully Dynamic Algorithm for Top-k Densest Subgraphs.

【Paper Link】【Pages】:1817-1826

【Authors】: Muhammad Anis Uddin Nasir ; Aristides Gionis ; Gianmarco De Francisci Morales ; Sarunas Girdzijauskas

【Abstract】: Given a large graph,the densest-subgraph problem asks to find a subgraph with maximum average degree. When considering the top-k version of this problem, a naïve solution is to iteratively find the densest subgraph and remove it in each iteration. However, such a solution is impractical due to high processing cost. The problem is further complicated when dealing with dynamic graphs, since adding or removing an edge requires re-running the algorithm. In this paper, we study the top-k densest-subgraph problem in the sliding-window model and propose an efficient fully-dynamic algorithm. The input of our algorithm consists of an edge stream, and the goal is to find the node-disjoint subgraphs that maximize the sum of their densities. In contrast to existing state-of-the-art solutions that require iterating over the entire graph upon any update, our algorithm profits from the observation that updates only affect a limited region of the graph. Therefore, the top-k densest subgraphs are maintained by only applying local updates. We provide a theoretical analysis of the proposed algorithm and show empirically that the algorithm often generates denser subgraphs than state-of-the-art competitors. Experiments show an improvement in efficiency of up to five orders of magnitude compared to state-of-the-art solutions.

【Keywords】: community detection; fully-dynamic graph algorithm; sliding window; top-k densest subgraph

185. Minimizing Dependence between Graphs.

【Paper Link】【Pages】:1827-1836

【Authors】: Yu Rong ; Hong Cheng

【Abstract】: In recent years, modeling the relation between two graphs has received unprecedented attention from researchers due to its wide applications in many areas, such as social analysis and bioinformatics. The nature of relations between two graphs can be divided into two categories: the vertex relation and the link relation. Many studies focus on modeling the vertex relation between graphs and try to find the vertex correspondence between two graphs. However, the link relation between graphs has not been fully studied. Specifically, we model the cross-graph link relation as cross-graph dependence, which reflects the dependence of a vertex in one graph on a vertex in the other graph. A generic problem, called Graph Dependence Minimization (GDM), is defined as: given two graphs with cross-graph dependence, how to select a subset of vertexes from one graph and copy them to the other, so as to minimize the cross-graph dependence. Many real applications can benefit from the solution to GDM. Examples include reducing the cross-language links in online encyclopedias, optimizing the cross-platform communication cost between different cloud services, and so on. This problem is trivial if we can select as many vertexes as we want to copy. But what if we can only choose a limited number of vertexes to copy so as to make the two graphs as independent as possible? We formulate GDM with a budget constraint into a combinatorial optimization problem, which is proven to be NP-hard. We propose two algorithms to solve GDM. Firstly, we prove the submodularity of the objective function of GDM and adopt the size-constrained submodular minimization (SSM) algorithm to solve it. Since the SSM-based algorithm cannot scale to large graphs, we design a heuristic algorithm with a provable approximation guarantee. We prove that the error achieved by the heuristic algorithm is bounded by an additive factor which is proportional to the square of the given budget. Extensive experiments on both synthetic and real-world graphs show that the proposed algorithms consistently outperform the well-studied graph centrality measure based solutions. Furthermore, we conduct a case study on the Wikipedia graphs with millions of vertexes and links to demonstrate the potential of GDM to solve real-world problems.

【Keywords】: graph analytics; graph dependence minimization; submodular minimization

186. Exploiting Electronic Health Records to Mine Drug Effects on Laboratory Test Results.

【Paper Link】【Pages】:1837-1846

【Authors】: Mohamed Ghalwash ; Ying Li ; Ping Zhang ; Jianying Hu

【Abstract】: The proliferation of Electronic Health Records (EHRs) challenges data miners to discover potential and previously unknown patterns from a large collection of medical data. One of the tasks that we address in this paper is to reveal previously unknown effects of drugs on laboratory test results. We propose a method that leverages drug information to find a meaningful list of drugs that have an effect on the laboratory result. We formulate the problem as a convex non smooth function and develop a proximal gradient method to optimize it. The model has been evaluated on two important use cases: lowering low-density lipoproteins and glycated hemoglobin test results. The experimental results provide evidence that the proposed method is more accurate than the state-of-the-art method, rediscover drugs that are known to lower the levels of laboratory test results, and most importantly, discover additional potential drugs that may also lower these levels.

【Keywords】: drug effects; drug similarity; proximal methods

Session 9D: Relational Mining 4

187. Efficient Discovery of Ontology Functional Dependencies.

【Paper Link】【Pages】:1847-1856

【Authors】: Sridevi Baskaran ; Alexander Keller ; Fei Chiang ; Lukasz Golab ; Jaroslaw Szlichta

【Abstract】: Functional Dependencies (FDs) define attribute relationships based on syntactic equality, and, when used in data cleaning, they erroneously label syntactically different but semantically equivalent values as errors. We enhance dependency-based data cleaning with Ontology Functional Dependencies (OFDs), which express semantic attribute relationships such as synonyms and is-a hierarchies defined by an ontology. Our technical contributions are twofold: 1) theoretical foundations for OFDs, including a set of sound and complete axioms and a linear-time inference procedure, and 2) an algorithm for discovering OFDs (exact ones and ones that hold with some exceptions) from data that uses the axioms to prune the exponential search space in the number of attributes. We demonstrate the efficiency of our techniques on real datasets, and we show that OFDs can significantly reduce the number of false positive errors in data cleaning techniques that rely on traditional FDs.

【Keywords】: data cleaning; dependency discovery; functional dependency; ontology functional dependency

188. Automatic Navbox Generation by Interpretable Clustering over Linked Entities.

【Paper Link】【Pages】:1857-1865

【Authors】: Chenhao Xie ; Lihan Chen ; Jiaqing Liang ; Kezun Zhang ; Yanghua Xiao ; Hanghang Tong ; Haixun Wang ; Wei Wang

【Abstract】: Rare efforts have been devoted to generating the structured Navigation Box (Navbox) for Wikipedia articles. A Navbox is a table in Wikipedia article page that provides a consistent navigation system for related entities. Navbox is critical for the readership and editing efficiency of Wikipedia. In this paper, we target on the automatic generation of Navbox for Wikipedia articles. Instead of performing information extraction over unstructured natural language text directly, an alternative avenue is explored by focusing on a rich set of semi-structured data in Wikipedia articles: linked entities. The core idea of this paper is as follows: If we cluster the linked entities and interpret them appropriately, we can construct a high-quality Navbox for the article entity. We propose a clustering-then-labeling algorithm to realize the idea. Experiments show that the proposed solutions are effective. Ultimately, our approach enriches Wikipedia with 1.95 million new Navboxes of high quality.

【Keywords】: clustering-thenlabeling; interpretable clustering; knowledge extraction; navbox generation

189. A Two-Stage Framework for Computing Entity Relatedness in Wikipedia.

【Paper Link】【Pages】:1867-1876

【Authors】: Marco Ponza ; Paolo Ferragina ; Soumen Chakrabarti

【Abstract】: Introducing a new dataset with human judgments of entity relatedness, we present a thorough study of all entity relatedness measures in recent literature based on Wikipedia as the knowledge graph. No clear dominance is seen between measures based on textual similarity and graph proximity. Some of the better measures involve expensive global graph computations. We then propose a new, space-efficient, computationally lightweight, two-stage framework for relatedness computation. In the first stage, a small weighted subgraph is dynamically grown around the two query entities; in the second stage, relatedness is derived based on computations on this subgraph. Our system shows better agreement with human judgment than existing proposals both on the new dataset and on an established one. We also plug our relatedness algorithm into a state-of-the-art entity linker and observe an increase in its accuracy and robustness.

【Keywords】: algorithm; entity relatedness; wikipedia

190. Incorporating the Latent Link Categories in Relational Topic Modeling.

【Paper Link】【Pages】:1877-1886

【Authors】: Yuan He ; Cheng Wang ; Changjun Jiang

【Abstract】: The soaring of social media services has greatly propelled the prevalence of document networks. Rather than a set of plain texts, documents are nodes in graphs. An observable link connects the documents at its two ends, thus it implicitly reflects the semantic association between the document pair. Previous work assumes that only similar documents tend to be connected, which neglects the rich connective patterns in the topological structure. In this paper, we introduce a latent correlation factor to categorize the links into several categories, and each category corresponds to a unique kind of association. By fitting the data, the relational information (e.g., homophily and heterophily) can be comprehensively captured. By resorting to Canonical Correlation Analysis (CCA), we maximize the correlation between all pairs of linked documents. We propose a pure generative model and derive efficient learning algorithms based on the variational EM methods. Experiments on three different datasets demonstrate that the proposed model is competitive and usually better than the state-of-the-art baselines on both topic modeling and link prediction.

【Keywords】: canonical correlation analysis; link prediction; relational topic modeling

Session 9E: User characteristics 4

191. Tone Analyzer for Online Customer Service: An Unsupervised Model with Interfered Training.

【Paper Link】【Pages】:1887-1895

【Authors】: Peifeng Yin ; Zhe Liu ; Anbang Xu ; Taiga Nakamura

【Abstract】: Emotion analysis of online customer service conservation is important for good user experience and customer satisfaction. However, conventional metrics do not fit this application scenario. In this work, by collecting and labeling online conversations of customer service on Twitter, we identify 8 new metrics, named as tones, to describe emotional information. To better interpret each tone, we extend the Latent Dirichlet Allocation (LDA) model to Tone LDA (T-LDA). In T-LDA, each latent topic is explicitly associated with one of three semantic categories, i.e., tone-related, domain-specific and auxiliary. By integrating tone label into learning, T-LDA can interfere the original unsupervised training process and thus is able to identify representative tone-related words. In evaluation, T-LDA shows better performance than baselines in predicting tone intensity. Also, a case study is conducted to analyze each tone via T-LDA output.

【Keywords】: emotion; online customer service; tone; topic modeling

192. Nationality Classification Using Name Embeddings.

【Paper Link】【Pages】:1897-1906

【Authors】: Junting Ye ; Shuchu Han ; Yifan Hu ; Baris Coskun ; Meizhu Liu ; Hong Qin ; Steven Skiena

【Abstract】: Nationality identification unlocks important demographic information, with many applications in biomedical and sociological research. Existing name-based nationality classifiers use name substrings as features and are trained on small, unrepresentative sets of labeled names, typically extracted from Wikipedia. As a result, these methods achieve limited performance and cannot support fine-grained classification. We exploit the phenomena of homophily in communication patterns to learn name embeddings, a new representation that encodes gender, ethnicity, and nationality which is readily applicable to building classifiers and other systems. Through our analysis of 57M contact lists from a major Internet company, we are able to design a fine-grained nationality classifier covering 39 groups representing over 90% of the world population. In an evaluation against other published systems over 13 common classes, our F1 score (0.795) is substantial better than our closest competitor Ethnea (0.580). To the best of our knowledge, this is the most accurate, fine-grained nationality classifier available. As a social media application, we apply our classifiers to the followers of major Twitter celebrities over six different domains. We demonstrate stark differences in the ethnicities of the followers of Trump and Obama, and in the sports and entertainments favored by different groups. Finally, we identify an anomalous political figure whose presumably inflated following appears largely incapable of reading the language he posts in.

【Keywords】: ethnicity classification; name embedding; nationality classification

【Paper Link】【Pages】:1907-1916

【Authors】: Shengmin Jin ; Reza Zafarani

【Abstract】: Understanding the role emotions play in social interactions has been a central research question in the social sciences. However, the challenge of obtaining large-scale data on human emotions has left the most fundamental questions on emotions less explored: How do emotions vary across individuals, evolve over time, and are connected to social ties? We address these questions using a large-scale dataset of users that contains both their emotions and social ties. Using this dataset, we identify patterns of human emotions on five different network levels, starting from the user-level and moving up to the whole-network level. At the user-level, we identify how human emotions are distributed and vary over time. At the ego-network level, we find that assortativity is only observed with respect to positive moods. This observation allows us to introduce emotional balance, the "dual'' of structural balance theory. We show that emotional balance has a natural connection to structural balance theory. At the community-level, we find that community members are emotionally-similar and that this similarity is stronger in smaller communities. Structural properties of communities, such as their sparseness or isolatedness, are also connected to the emotions of their members. At the whole-network level, we show that there is a tight connection between the global structure of a network and the emotions of its members. As a result, we demonstrate how one can accurately predict the proportion of positive/negative users within a network by only looking at the network structure. Based on our observations, we propose the Emotional-Tie model -- a network model that can simulate the formation of friendships based on emotions. This model generates graphs that exhibit both patterns of human emotions identified in this work and those observed in real-world social networks, such as having a high clustering coefficient. Our findings can help better understand the interplay between emotions and social ties.

【Keywords】: emotions; network models; sentiments; signed networks

194. Hike: A Hybrid Human-Machine Method for Entity Alignment in Large-Scale Knowledge Bases.

【Paper Link】【Pages】:1917-1926

【Authors】: Yan Zhuang ; Guoliang Li ; Zhuojian Zhong ; Jianhua Feng

【Abstract】: With the vigorous development of the World Wide Web, many large-scale knowledge bases (KBs) have been generated. To improve the coverage of KBs, an important task is to integrate the heterogeneous KBs. Several automatic alignment methods have been proposed which achieve considerable success. However, due to the inconsistency and uncertainty of large-scale KBs, automatic techniques for KBs alignment achieve low quality (especially recall). Thanks to the open crowdsourcing platforms, we can harness the crowd to improve the alignment quality. To achieve this goal, in this paper we propose a novel hybrid human-machine framework for large-scale KB integration. We rst partition the entities of different KBs into many smaller blocks based on their relations. We then construct a partial order on these partitions and develop an inference model which crowdsources a set of tasks to the crowd and infers the answers of other tasks based on the crowdsourced tasks. Next we formulate the question selection problem, which, given a monetary budget B, selects B crowdsourced tasks to maximize the number of inferred tasks. We prove that this problem is NP-hard and propose greedy algorithms to address this problem with an approximation ratio of 1--1/e. Our experiments on real-world datasets indicate that our method improves the quality and outperforms state-of-the-art approaches.

【Keywords】: crowdsourcing; entity alignment; knowledge base

Session 9F: Engagement 4

195. Returning is Believing: Optimizing Long-term User Engagement in Recommender Systems.

【Paper Link】【Pages】:1927-1936

【Authors】: Qingyun Wu ; Hongning Wang ; Liangjie Hong ; Yue Shi

【Abstract】: In this work, we propose to improve long-term user engagement in a recommender system from the perspective of sequential decision optimization, where users' click and return behaviors are directly modeled for online optimization. A bandit-based solution is formulated to balance three competing factors during online learning, including exploitation for immediate click, exploitation for expected future clicks, and exploration of unknowns for model estimation. We rigorously prove that with a high probability our proposed solution achieves a sublinear upper regret bound in maximizing cumulative clicks from a population of users in a given period of time, while a linear regret is inevitable if a user's temporal return behavior is not considered when making the recommendations. Extensive experimentation on both simulations and a large-scale real-world dataset collected from Yahoo frontpage news recommendation log verified the effectiveness and significant improvement of our proposed algorithm compared with several state-of-the-art online learning baselines for recommendation.

【Keywords】: content recommendation; contextual bandit algorithm; user long-term engagement modeling

【Paper Link】【Pages】:1937-1946

【Authors】: Qizhen Zhang ; Tengyuan Ye ; Meryem Essaidi ; Shivani Agarwal ; Vincent Liu ; Boon Thau Loo

【Abstract】: A key ingredient to a startup's success is its ability to raise funding at an early stage. Crowdfunding has emerged as an exciting new mechanism for connecting startups with potentially thousands of investors. Nonetheless, little is known about its effectiveness, nor the strategies that entrepreneurs should adopt in order to maximize their rate of success. In this paper, we perform a longitudinal data collection and analysis of AngelList - a popular crowdfunding social platform for connecting investors and entrepreneurs. Over a 7-10 month period, we track companies that are actively fund-raising on AngelList, and record their level of social engagement on AngelList, Twitter, and Facebook. Through a series of measures on social en- gagement (e.g. number of tweets, posts, new followers), our analysis shows that active engagement on social media is highly correlated to crowdfunding success. In some cases, the engagement level is an order of magnitude higher for successful companies. We further apply a range of machine learning techniques (e.g. decision tree, SVM, KNN, etc) to predict the ability of a company to success- fully raise funding based on its social engagement and other metrics. Since fund-raising is a rare event, we explore various techniques to deal with class imbalance issues. We observe that some metrics (e.g. AngelList followers and Facebook posts) are more signi cant than other metrics in predicting fund-raising success. Furthermore, despite the class imbalance, we are able to predict crowdfunding success with 84% accuracy.

【Keywords】: crowdfunding; machine learning; social networks

197. Optimizing Email Volume For Sitewide Engagement.

【Paper Link】【Pages】:1947-1955

【Authors】: Rupesh Gupta ; Guanfeng Liang ; Rómer Rosales

【Abstract】: In this paper we focus on the problem of optimizing email volume for maximizing sitewide engagement of an online social networking service. Email volume optimization approaches published in the past have proposed optimization of email volume for maximization of engagement metrics which are impacted exclusively by email; for example, the number of sessions that begin with clicks on links within emails. The impact of email on such downstream engagement metrics can be estimated easily because of the ease of attribution of such an engagement event to an email. However, this framework is limited in its view of the ecosystem of the networking service which comprises of several tools and utilities that contribute towards delivering value to members; with email being just one such utility. Thus, in this paper we depart from previous approaches by exploring and optimizing the contribution of email to this ecosystem. In particular, we present and contrast the differential impact of email on sitewide engagement metrics for various types of users. We propose a new email volume optimization approach which maximizes sitewide engagement metrics, such as the total number of active users. This is in sharp contrast to the previous approaches whose objective has been maximization of downstream engagement metrics. We present details of our prediction function for predicting the impact of emails on a user's activeness on the mobile or web application. We describe how certain approximations to this prediction function can be made for solving the volume optimization problem, and present results from online A/B tests.

【Keywords】: Machine learning; email; optimization

198. Understanding Engagement through Search Behaviour.

【Paper Link】【Pages】:1957-1966

【Authors】: Mengdie Zhuang ; Gianluca Demartini ; Elaine G. Toms

【Abstract】: Evaluating user engagement with search is a critical aspect of understanding how to assess and improve information retrieval systems. While standard techniques for measuring user engagement use questionnaires, these are obtrusive to user interaction, and can only be collected at acceptable intervals. The problem we address is whether there is a less obtrusive and more automatic way to assess how users perceive the search process and outcome. Log files collect behavioural signals (e.g., clicks, queries) from users on a large scale. In this paper, we investigate the potential to predict how users perceive engagement with search by modelling behavioural signals from log files using supervised learning methods. We focus on different engagement dimensions (Perceived Usability, Felt Involvement, Endurability and Novelty) and examine how 37 behavioural features can inform these dimensions. Our results, obtained from 377 in-lab participants undergoing goal-based search tasks, support the connection between perceived engagement and search behaviour. More specifically, we show that time- and query-related features are best suited for predicting user perceived engagement, and suggest that different behavioural features better reflect specific dimensions. We demonstrate the possibility of predicting user-perceived engagement using search behavioural features.

【Keywords】: engagement prediction; search behaviour; user engagement

Short Papers (alphabetical by lead authors' last names) 119

199. Citation Metadata Extraction via Deep Neural Network-based Segment Sequence Labeling.

【Paper Link】【Pages】:1967-1970

【Authors】: Dong An ; Liangcai Gao ; Zhuoren Jiang ; Runtao Liu ; Zhi Tang

【Abstract】: Citation metadata extraction plays an important role in academic information retrieval and knowledge management. Current works on this task generally use rule-based, template-based or learning-based approaches but these methods usually either rely on handcrafted features or are limited with domains. Recently, neural networks have shown strong ability in addressing sequence labeling tasks. In this paper, we propose a sequence labeling model for citation metadata extraction, called segment sequence labeling. Instead of inferring at word level, the input sequence is first divided into segments, and then features of the segments are computed to infer the label sequence of the segments. We first run experiments to validate the effectiveness of different parts of the model by comparing it with a CRF-based model and a neural network-based model. Experimental results show our model beats both models on most fields. Besides, our model is evaluated on public datasets UMass and Cora and has achieved significant performance improvement. Our model was trained on the data which were generated from BibTeX files collected on the Web and annotated automatically.

【Keywords】: academic information extraction; citation metadata extraction; information retrieval; sequence labeling

200. A Novel Approach for Efficient Computation of Community Aware Ridesharing Groups.

【Paper Link】【Pages】:1971-1974

【Authors】: Samiul Anwar ; Shuha Nabila ; Tanzima Hashem

【Abstract】: The evolution of ridesharing services has reduced the road traffic congestions in recent years. However, a major concern for ridesharing services is sharing rides with strangers. To address this issue, a few ridesharing approaches have considered social closeness of group members for identifying a ridesharing group. Again, users do not feel comfortable to disclose such personal data (e.g, friendship information) with an untrusted service provider for privacy reasons. We propose a novel way to form ridesharing groups that reveals user social data in community levels, and ensures that a group member shares at least k common communities with at least other m members in the ridesharing group, where k and m are personalized parameters of every group member. We formulate a Community aware Ridesharing Group (CaRG) query that satisfies the constraints of m and k, and returns a ridesharing group with the minimum cost in terms of the spatial proximity of riders from the driver. We show in experiments that our approach to process CaRG queries outperforms a baseline approach with a large margin.

【Keywords】: community; location-based services; query processing; ridesharing

201. Extracting Entities of Interest from Comparative Product Reviews.

【Paper Link】【Pages】:1975-1978

【Authors】: Jatin Arora ; Sumit Agrawal ; Pawan Goyal ; Sayan Pathak

【Abstract】: This paper presents a deep learning based approach to extract product comparison information out of user reviews on various e-commerce websites. Any comparative product review has three major entities of information: the names of the products being compared, the user opinion (predicate) and the feature or aspect under comparison. All these informing entities are dependent on each other and bound by the rules of the language, in the review. We observe that their inter-dependencies can be captured well using LSTMs. We evaluate our system on existing manually labeled datasets and observe out-performance over the existing Semantic Role Labeling (SRL) framework popular for this task.

【Keywords】: comparison mining; deep learning; opinion extraction

202. A Neural Collaborative Filtering Model with Interaction-based Neighborhood.

【Paper Link】【Pages】:1979-1982

【Authors】: Ting Bai ; Ji-Rong Wen ; Jun Zhang ; Wayne Xin Zhao

【Abstract】: Recently, deep neural networks have been widely applied to recommender systems. A representative work is to utilize deep learning for modeling complex user-item interactions. However, similar to traditional latent factor models by factorizing user-item interactions, they tend to be ineffective to capture localized information. Localized information, such as neighborhood, is important to recommender systems in complementing the user-item interaction data. Based on this consideration, we propose a novel Neighborhood-based Neural Collaborative Filtering model (NNCF). To the best of our knowledge, it is the first time that the neighborhood information is integrated into the neural collaborative filtering methods. Extensive experiments on three real-world datasets demonstrate the effectiveness of our model for the implicit recommendation task.

【Keywords】: deep neural network; neighborhood information; recommender systems

203. Profiling DRDoS Attacks with Data Analytics Pipeline.

【Paper Link】【Pages】:1983-1986

【Authors】: Laure Berti-Équille ; Yury Zhauniarovich

【Abstract】: A large amount of Distributed Reflective Denial-of-Service (DRDoS) attacks are launched every day, and our understanding of the modus operandi of their perpetrators is yet very limited as we are submerged with so Big Data to analyze and do not have reliable and complete ways to validate our findings. In this paper, we propose a first analytic pipeline that enables us to cluster and characterize attack campaigns into several main profiles that exhibit similarities. These similarities are due to common technical properties of the underlying infrastructures used to launch these attacks. Although we do not have access to the ground truth and we do not know how many perpetrators are acting behind the scene, we can group their attacks based on relevant commonalities with cluster ensembling to estimate their number and capture their profiles over time. Specifically, our results show that we can repeatably identify and group together common profiles of attacks while considering domain expert's constraint in the cluster ensembles. From the obtained consensus clusters, we can generate comprehensive rules that characterize past campaigns and that can be used for classifying the next ones despite the evolving nature of the attacks. Such rules can be further used to filter out garbage traffic in Internet Service Provider networks.

【Keywords】: analytics; clustering; distributed reflective denial-of-service; ensembling; pipeline; profiling

204. A Compare-Aggregate Model with Dynamic-Clip Attention for Answer Selection.

【Paper Link】【Pages】:1987-1990

【Authors】: Weijie Bian ; Si Li ; Zhao Yang ; Guang Chen ; Zhiqing Lin

【Abstract】: Answer selection for question answering is a challenging task, since it requires effective capture of the complex semantic relations between questions and answers. Previous remarkable approaches mainly adopt general Compare-Aggregate framework that performs word-level comparison and aggregation. In this paper, unlike previous Compare-Aggregate models which utilize the traditional attention mechanism to generate corresponding word-level vector before comparison, we propose a novel attention mechanism named Dynamic-Clip Attention which is directly integrated into the Compare-Aggregate framework. Dynamic-Clip Attention focuses on filtering out noise in attention matrix, in order to better mine the semantic relevance of word-level vectors. At the same time, different from previous Compare-Aggregate works which treat answer selection task as a pointwise classification problem, we propose a listwise ranking approach to model this task to learn the relative order of candidate answers. Experiments on TrecQA and WikiQA datasets show that our proposed model achieves the state-of-the-art performance.

【Keywords】: deep learning; dynamic-clip attention; listwise; question answering

205. Learning Biological Sequence Types Using the Literature.

【Paper Link】【Pages】:1991-1994

【Authors】: Mohamed Reda Bouadjenek ; Karin Verspoor ; Justin Zobel

【Abstract】: We explore in this paper automatic biological sequence type classification for records in biological sequence databases. The sequence type attribute provides important information about the nature of a sequence represented in a record, and is often used in search to filter out irrelevant sequences. However, the sequence type attribute is generally a non-mandatory free-text field, and thus it is subject to many errors including typos, mis-assignment, and non-assignment. In GenBank, this problem concerns roughly 18% of records, an alarming number that should worry the biocuration community. To address this problem of automatic sequence type classification, we propose the use of literature associated to sequence records as an external source of knowledge that can be leveraged for the classification task. We define a set of literature-based features and train a machine learning algorithm to classify a record into one of six primary sequence types. The main intuition behind using the literature for this task is that sequences appear to be discussed differently in scientific articles, depending on their type. The experiments we have conducted on the PubMed Central collection show that the literature is indeed an effective way to address this problem of sequence type classification. Our classification method reached an accuracy of 92.7%, and substantially outperformed two baseline approaches used for comparison.

【Keywords】: biological databases; data analysis; data cleansing; data quality

【Paper Link】【Pages】:1995-1998

【Authors】: Chiyu Cai ; Linjing Li ; Daniel Zeng

【Abstract】: Bots are regarded as the most common kind of malwares in the era of Web 2.0. In recent years, Internet has been populated by hundreds of millions of bots, especially on social media. Thus, the demand on effective and efficient bot detection algorithms is more urgent than ever. Existing works have partly satisfied this requirement by way of laborious feature engineering. In this paper, we propose a deep bot detection model aiming to learn an effective representation of social user and then detect social bots by jointly modeling social behavior and content information. The proposed model learns the representation of social behavior by encoding both endogenous and exogenous factors which affect user behavior. As to the representation of content, we regard the user content as temporal text data instead of just plain text as be treated in other existing works to extract semantic information and latent temporal patterns. To the best of our knowledge, this is the first trial that applies deep learning in modeling social users and accomplishing social bot detection. Experiments on real world dataset collected from Twitter demonstrate the effectiveness of the proposed model.

【Keywords】: behavior factors; bot detection; deep learning; temporal content

207. PMS: an Effective Approximation Approach for Distributed Large-scale Graph Data Processing and Mining.

【Paper Link】【Pages】:1999-2002

【Authors】: Yingjie Cao ; Yangyang Zhang ; Jianxin Li

【Abstract】: Recently, large-scale graph data processing and mining has drawn great attention, and many distributed graph processing systems have been proposed. However, large-scale graph processing remains a challenging problem. Because the computation time in some cases is still unacceptable especially when the time is limited. As illustrated in Table 1, nearly three hours are needed when running Single-Source Shortest Path algorithm on the USA-road dataset using performant open-source distributed graph processing systems. In this paper, we propose an effective priority-based message sampling (PMS ) approach to further improve the performance of distributed graph processing at the cost of some accuracy loss. Noticing that the passing and processing of messages dominates the computation time, our approach works by eliminating those less useful messages directly without passing them which can effectively reduce the computation overhead. We implement our approach basing on Apache Giraph, a popular open-source implementation of Google's Pregel and report the primary results of our prototype system. The experimental results show that our approach can achieve reasonable accuracy with much less computation time.

【Keywords】: approximate computation; distributed system; large-scale graph processing

208. Language Modeling by Clustering with Word Embeddings for Text Readability Assessment.

【Paper Link】【Pages】:2003-2006

【Authors】: Miriam Cha ; Youngjune Gwon ; H. T. Kung

【Abstract】: We present a clustering-based language model using word embeddings for text readability prediction. Presumably, an Euclidean semantic space hypothesis holds true for word embeddings whose training is done by observing word co-occurrences. We argue that clustering with word embeddings in the metric space should yield feature representations in a higher semantic space appropriate for text regression. Also, by representing features in terms of histograms, our approach can naturally address documents of varying lengths. An empirical evaluation using the Common Core Standards corpus reveals that the features formed on our clustering-based language model significantly improve the previously known results for the same corpus in readability prediction. We also evaluate the task of sentence matching based on semantic relatedness using the Wiki-SimpleWiki corpus and find that our features lead to superior matching performance.

【Keywords】: clustering-based language model; readability assessment

209. Compact Multiple-Instance Learning.

【Paper Link】【Pages】:2007-2010

【Authors】: Jing Chai ; Weiwei Liu ; Ivor W. Tsang ; Xiao-Bo Shen

【Abstract】: The weakly supervised Multiple-Instance Learning (MIL) problem has been successfully applied in information retrieval tasks. Two related issues might affect the performance of MIL algorithms: how to cope with label ambiguities and how to deal with non-discriminative components, and we propose COmpact MultiPle-Instance LEarning (COMPILE) to consider them simultaneously. To treat label ambiguities, COMPILE seeks ground-truth positive instances in positive bags. By using weakly supervised information to learn data's short binary representations, COMPILE enhances discrimination via strengthening discriminative components and suppressing non-discriminative ones. We adapt block coordinate descent to optimize COMPILE efficiently. Experiments on text categorization empirically show: 1) COMPILE unifies disambiguation and data preprocessing successfully; 2) it generates short binary representations efficiently to enhance discrimination at significantly reduced storage cost.

【Keywords】: disambiguation; multiple-instance learning; storage cost; text categorization

210. Text Embedding for Sub-Entity Ranking from User Reviews.

【Paper Link】【Pages】:2011-2014

【Authors】: Chih-Yu Chao ; Yi-Fan Chu ; Hsiu-Wei Yang ; Chuan-Ju Wang ; Ming-Feng Tsai

【Abstract】: This paper attempts to conduct analysis for one certain type of user reviews; that is, the reviews on a super-entity (e.g., restaurant) involve descriptions for many sub-entities (e.g., dishes). To deal with such analysis, we propose a text embedding framework for ranking sub-entities from user reviews of a given super-entity. Experiments on two real-world datasets show that our method outperforms three baselines by a statistically significant amount. Intriguing cases from the experiments are discussed in the paper.

【Keywords】: co-occurrence network; ranking; text embedding; user reviews

211. Summarizing Significant Changes in Network Traffic Using Contrast Pattern Mining.

【Paper Link】【Pages】:2015-2018

【Authors】: Elaheh Alipour Chavary ; Sarah M. Erfani ; Christopher Leckie

【Abstract】: Extracting knowledge from the massive volumes of network traffic is an important challenge in network and security management. In particular, network managers require concise reports about significant changes in their network traffic. While most existing techniques focus on summarizing a single traffic dataset, the problem of finding significant differences between multiple datasets is an open challenge. In this paper, we focus on finding important differences between network traffic datasets, and preparing a summarized and interpretable report for security managers. We propose the use of contrast pattern mining, which finds patterns whose support differs significantly from one dataset to another. We show that contrast patterns are highly effective at extracting meaningful changes in traffic data. We also propose several evaluation metrics that reflect the interpretability of patterns for security managers. Our experimental results show that with the proposed unsupervised approach, the vast majority of extracted patterns are pure, i.e., most changes are either attack traffic or normal traffic, but not a mixture of both.

【Keywords】: closed patterns; contrast patterns; dataset summarization

212. Modeling Opinion Influence with User Dual Identity.

【Paper Link】【Pages】:2019-2022

【Authors】: Chengyao Chen ; Zhitao Wang ; Wenjie Li

【Abstract】: Exploring the mechanism that explains how a user's opinion changes under the influence of his/her neighbors is of practical importance (e.g., for predicting the sentiment of his/her future opinion) and has attracted wide attention from both enterprises and academics.Though various opinion influence models have been proposed for opinion prediction, they only consider users' personal identities, but ignore their social identities with which people behave to fit the expectations of the others in the same group. In this work, we explore users' dual identities, including both personal identities and social identities to build a more comprehensive opinion influence model for a better understanding of opinion behaviors. A novel joint learning framework is proposed to simultaneously model opinion dynamics and detect social identity in a unified model. The effectiveness of the proposed approach is demonstrated through the experiments conducted on Twitter datasets

【Keywords】: dual identity; joint learning; opinion influence modeling

213. An Empirical Analysis of Pruning Techniques: Performance, Retrievability and Bias.

【Paper Link】【Pages】:2023-2026

【Authors】: Ruey-Cheng Chen ; Leif Azzopardi ; Falk Scholer

【Abstract】: Prior work on using retrievability measures in the evaluation of information retrieval (IR) systems has laid out the foundations for investigating the relation between retrieval performance and retrieval bias. While various factors influencing retrievability have been examined, showing how the retrieval model may influence bias, no prior work has examined the impact of the index (and how it is optimized) on retrieval bias. Intuitively, how the documents are represented, and what terms they contain, will influence whether they are retrievable or not. In this paper, we investigate how the retrieval bias of a system changes as the inverted index is optimized for efficiency through static index pruning. In our analysis, we consider four pruning methods and examine how they affect performance and bias on the TREC GOV2 Collection. Our results show that the relationship between these factors is varied and complex - and very much dependent on the pruning algorithm. We find that more pruning results in relatively little change or a slight decrease in bias up to a point, and then a dramatic increase. The increase in bias corresponds to a sharp decrease in early precision such as [email protected] and is also indicative of a large decrease in MAP. The findings suggest that the impact of pruning algorithms can be quite varied - but retrieval bias could be used to guide the pruning process. Further work is required to determine precisely which documents are most affected and how this impacts upon performance.

【Keywords】: indexing; pruning; retrievability

214. Text Coherence Analysis Based on Deep Neural Network.

【Paper Link】【Pages】:2027-2030

【Authors】: Baiyun Cui ; Yingming Li ; Yaqing Zhang ; Zhongfei Zhang

【Abstract】: In this paper, we propose a novel deep coherence model (DCM) using a convolutional neural network architecture to capture the text coherence. The text coherence problem is investigated with a new perspective of learning sentence distributional representation and text coherence modeling simultaneously. In particular, the model captures the interactions between sentences by computing the similarities of their distributional representations. Further, it can be easily trained in an end-to-end fashion. The proposed model is evaluated on a standard Sentence Ordering task. The experimental results demonstrate its effectiveness and promise in coherence assessment showing a significant improvement over the state-of-the-art by a wide margin.

【Keywords】: coherence analysis; deep coherence model; distributional representation

215. Unsupervised Matrix-valued Kernel Learning For One Class Classification.

【Paper Link】【Pages】:2031-2034

【Authors】: Shaobo Dang ; Xiongcai Cai ; Yang Wang ; Jianjia Zhang ; Fang Chen

【Abstract】: This paper is concerned with the one class classification(OCC) problem. By introducing the vector-valued function with regularizations in Y-valued Reproducing Hilbert Kernel Space(RHKS), we build an unsupervised classifier and discover the outliers and inliers simultaneously. Manifold regularization is employed to preserve the local similarity of data in input space. Experimental results of the proposed and comparing methods on OCC data sets demonstrate the performance of the proposed algorithm.

【Keywords】: kernel learning; one class classification; outlier detection; unsupervised learning

216. Analysis of Telegram, An Instant Messaging Service.

【Paper Link】【Pages】:2035-2038

【Authors】: Arash Dargahi Nobari ; Negar Reshadatmand ; Mahmood Neshati

【Abstract】: Telegram has become one of the most successful instant messaging services in recent years. In this paper, we developed a crawler to gather its public data. To the best of our knowledge, this paper is the first attempt to analyze the structural and topical aspects of messages published in Telegram instant messaging service using crawled data. We also extracted the mention graph and page rank of our data collection which indicates important differences between linking patterns of Telegram nodes and other usual networks. We also classified messages to detect advertisement and spam messages.

【Keywords】: classification; instant messaging; pagerank; spam detection; telegram

217. Estimating Event Focus Time Using Neural Word Embeddings.

【Paper Link】【Pages】:2039-2042

【Authors】: Supratim Das ; Arunav Mishra ; Klaus Berberich ; Vinay Setty

【Abstract】: Time associated with news events has been leveraged as a complementary dimension to text in several applications such as temporal information retrieval, news event linking, etc. Short textual event descriptions (e.g., single sentences) are prevalent in web documents (also considered as inputs in the above applications) and often lack explicit temporal expressions for grounding them to a precise time period. For example, the event description, "France swears in Emmanuel Macron as the 25th President", lacks temporal cues to indicate that the event occurred in the year "2017". Thus, we address the problem of estimating event focus time defined as a time interval with maximum association thereby indicating its occurrence period. We propose several estimators that leverage distributional event and time representations learned from large external document collections by adapting the word2vec paradigm. Extensive experiments using two real-world datasets and 100 Wikipedia events show that our method outperforms several state-of-the-art baselines.

【Keywords】: event focus time; event vectors; neural word embeddings; pseudo relevance feedback; time vectors; word2vec

218. Personalized Image Aesthetics Assessment.

【Paper Link】【Pages】:2043-2046

【Authors】: Xiang Deng ; Chaoran Cui ; Huidi Fang ; Xiushan Nie ; Yilong Yin

【Abstract】: Automatically assessing image quality from an aesthetic perspective is of great interest to the high-level vision research community. Existing methods are typically non-personalized and quantify image aesthetics with a universal label. However, given the fact that aesthetics is a subjective perception, how to understand user aesthetic perceptions poses a formidable challenge to image aesthetics assessment. In this paper, we propose to model user aesthetic perceptions using a set of exemplar images from social media platforms, and realize personalized aesthetics assessment by transferring this knowledge to adapt the results of the trained generic model. In this way, image aesthetics is measured from both aspects of visual quality and user tastes. Extensive experiments on two benchmark datasets well verified the potential of our approach for personalized image aesthetics assessment.

【Keywords】: image aesthetics assessment; personalization; social media

219. Efficient Fault-Tolerant Group Recommendation Using alpha-beta-core.

【Paper Link】【Pages】:2047-2050

【Authors】: Danhao Ding ; Hui Li ; Zhipeng Huang ; Nikos Mamoulis

【Abstract】: Fault-tolerant group recommendation systems based on subspace clustering successfully alleviate high-dimensionality and sparsity problems. However, the cost of recommendation grows exponentially with the size of dataset. To address this issue, we model the fault-tolerant subspace clustering problem as a search problem on graphs and present an algorithm, GraphRec, based on the concept of α-ß-core. Moreover, we propose two variants of our approach that use indexes to improve query latency. Our experiments on different datasets demonstrate that our methods are extremely fast compared to the state-of-the-art.

【Keywords】: fault tolerance; group recommendation; subspace clustering

220. On Discovering the Number of Document Topics via Conceptual Latent Space.

【Paper Link】【Pages】:2051-2054

【Authors】: Nghia Duong-Trung ; Lars Schmidt-Thieme

【Abstract】: Topic modeling is a widely used technique in knowledge discovery and data mining. However, finding the right number of topics in a given text source has remained a challenging issue. In this paper, we study the concept of conceptual stability via nonnegative matrix factorization. Based on this finding, we propose a method to identify the correct number of topics and offer empirical evidence in its favor in terms of classification accuracy and the number of topics that are naturally present in the text sources. Experiments on real-world text corpora demonstrate that the proposed method has outperformed state-of-the-art latent Dirichlet allocation and nonnegative matrix factorization models.

【Keywords】: nonnegative matrix factorization; stability analysis; topic modeling

221. Chinese Named Entity Recognition with Character-Word Mixed Embedding.

【Paper Link】【Pages】:2055-2058

【Authors】: Shijia E ; Yang Xiang

【Abstract】: Named Entity Recognition (NER) is an important basis for the tasks in natural language processing such as relation extraction, entity linking and so on. The common method of existing Chinese NER systems is to use the character sequence as the input, and the intention is to avoid the word segmentation. However, the character sequence cannot express enough semantic information, so that the recognition accuracy of Chinese NER is not as good as western language such as English. To solve this issue, we propose a Chinese NER method based on Character-Word Mixed Embedding (CWME), and the method is in accord with the pipeline of Chinese natural language processing. Our experiments show that incorporating CWME can effectively improve the performance for the Chinese corpus with state-of-the-art neural architectures widely used in NER, and the proposed method yields nearly 9% absolute improvement over previously results.

【Keywords】: character embedding; named entity recognition; word embedding

222. An Empirical Study of Embedding Features in Learning to Rank.

【Paper Link】【Pages】:2059-2062

【Authors】: Faezeh Ensan ; Ebrahim Bagheri ; Amal Zouaq ; Alexandre Kouznetsov

【Abstract】: This paper explores the possibility of using neural embedding features for enhancing the effectiveness of ad hoc document ranking based on learning to rank models. We have extensively introduced and investigated the effectiveness of features learnt based on word and document embeddings to represent both queries and documents. We employ several learning to rank methods for document ranking using embedding-based features, keyword-based features as well as the interpolation of the embedding-based features with keyword-based features. The results show that embedding features have a synergistic impact on keyword based features and are able to provide statistically significant improvement on harder queries.

【Keywords】: ad hoc retrieval; learning to rank; neural embeddings

223. Privacy of Hidden Profiles: Utility-Preserving Profile Removal in Online Forums.

【Paper Link】【Pages】:2063-2066

【Authors】: Sedigheh Eslami ; Asia J. Biega ; Rishiraj Saha Roy ; Gerhard Weikum

【Abstract】: Users who wish to leave an online forum often do not have the freedom to erase their data completely from the service providers' (SP) system. The primary reason behind this is that analytics on such user data form a core component of many online providers' business models. On the other hand, if the profiles reside in the SP's system in an unchanged form, major privacy violations may occur if the infrastructure is compromised, or the SP is acquired by another organization. In this work, we investigate an alternative solution to standard profile removal, where posts of different users are split and merged into synthetic mediator profiles. The goal of our framework is to preserve the SP's data mining utility as far as possible, while minimizing users' privacy risks. We present several mechanisms of assigning user posts to such mediator accounts and show the effectiveness of our framework using data from StackExchange and various health forums.

【Keywords】: mediator accounts; privacy-utility tradeoff; profile removal; provider utility; split and merge; user privacy

224. QoS-Aware Scheduling of Heterogeneous Servers for Inference in Deep Neural Networks.

【Paper Link】【Pages】:2067-2070

【Authors】: Zhou Fang ; Tong Yu ; Ole J. Mengshoel ; Rajesh K. Gupta

【Abstract】: Deep neural networks (DNNs) are popular in diverse fields such as computer vision and natural language processing. DNN inference tasks are emerging as a service provided by cloud computing environments. However, cloud-hosted DNN inference faces new challenges in workload scheduling for the best Quality of Service (QoS), due to dependence on batch size, model complexity and resource allocation. This paper represents the QoS metric as a utility function of response delay and inference accuracy. We first propose a simple and effective heuristic approach that keeps low response delay and satisfies the requirement on processing throughput. Then we describe an advanced deep reinforcement learning (RL) approach that learns to schedule from experience. The RL scheduler is trained to maximize QoS, using a set of system statuses as the input to the RL policy model. Our approach performs scheduling actions only when there are free GPUs, thus reduces scheduling overhead over common RL schedulers that run at every continuous time step. We evaluate the schedulers on a simulation platform and demonstrate the advantages of RL over heuristics.

【Keywords】: deep neural networks inference; deep reinforcement learning; qos aware scheduling; reinforcement learning; web service

225. Geographic and Temporal Trends in Fake News Consumption During the 2016 US Presidential Election.

【Paper Link】【Pages】:2071-2074

【Authors】: Adam Fourney ; Miklós Z. Rácz ; Gireeja Ranade ; Markus Mobius ; Eric Horvitz

【Abstract】: We present an analysis of traffic to websites known for publishing fake news in the months preceding the 2016 US presidential election. The study is based on the combined instrumentation data from two popular desktop web browsers: Internet Explorer 11 and Edge. We find that social media was the primary outlet for the circulation of fake news stories and that aggregate voting patterns were strongly correlated with the average daily fraction of users visiting websites serving fake news. This correlation was observed both at the state level and at the county level, and remained stable throughout the main election season. We propose a simple model based on homophily in social networks to explain the linear association. Finally, we highlight examples of different types of fake news stories: while certain stories continue to circulate in the population, others are short-lived and die out in a few days.

【Keywords】: browsing data; elections; fake news; social media

226. Inferring Appliance Energy Usage from Smart Meters using Fully Convolutional Encoder Decoder Networks.

【Paper Link】【Pages】:2075-2078

【Authors】: Felan Carlo C. Garcia ; Erees Queen B. Macabebe

【Abstract】: Energy management presents one of the principal sustainability challenges within urban centers given that they account for 75% of the energy consumption worldwide. In the context of a smart city framework, the use of intelligent urban systems provides a key opportunity in addressing the energy sustainability issue as an informatics problem where the goal is to deliver energy usage feedback to the users as a means of enabling behavioral change towards energy sustainability. In this paper we present a method to provide appliance energy usage feedback from smart meters using energy disaggregation. We put energy disaggregation in the context of a source separation and signal reconstruction problem in which we train a fully convolutional encoder decoder network to separate appliance energy usage from aggregate whole house electricity consumption data. The results show that the proposed fully convolutional encoder decoder model can achieve competitive accuracy compared with several state-of-the-art methods.

【Keywords】: ambient intelligence; deep learning; energy disaggregation; energy management; smart city

227. Tracking the Impact of Fact Deletions on Knowledge Graph Queries using Provenance Polynomials.

【Paper Link】【Pages】:2079-2082

【Authors】: Garima Gaur ; Srikanta J. Bedathur ; Arnab Bhattacharya

【Abstract】: Critical business applications in domains ranging from technical support to healthcare increasingly rely on large-scale, automatically constructed knowledge graphs. These applications use the results of complex queries over knowledge graphs in order to help users in taking crucial decisions such as which drug to administer, or whether certain actions are compliant with all the regulatory requirements and so on. However, these knowledge graphs constantly evolve, and the newer versions may adversely impact the results of queries that the previously taken business decisions were based on. We propose a framework based on provenance polynomials to track the impact of knowledge graph changes on arbitrary SPARQL query results. Focusing on the deletion of facts, we show how to efficiently determine the queries impacted by the change, develop ways to incrementally maintain these polynomials, and present an efficient implementation on top of RDF graph databases. Our experimental evaluation over large-scale RDF/SPARQL benchmarks show the effectiveness of our proposal.

【Keywords】: fact deletion; knowledge graph; provenance polynomial

【Paper Link】【Pages】:2083-2086

【Authors】: Lei Gu ; Liying Zhang ; Yang Zhao

【Abstract】: Numerical data clustering is a tractable task since well-defined numerical measures like traditional Euclidean distance can be directly used for it, but nominal data clustering is a very difficult problem because there exists no natural relative ordering between nominal attribute values. This paper mainly aims to make the Euclidean distance measure appropriate to nominal data clustering, and the core idea is to transform each nominal attribute value into numerical. This transformation method consists of three steps. In the first step, the weighted self-information, which can quantify the amount of information in attribute values, is calculated for each value in each nominal attribute. In the second step, we find k nearest neighbors for each object because k nearest neighbors of one object have close similarities with it. In the last step, the weighted self-information of each attribute value in each nominal object is modified according to the object's k nearest neighbors. To evaluate the effectiveness of our proposed method, experiments are done on 10 data sets. Experimental results demonstrate that our method not only enables the Euclidean distance to be used for nominal data clustering, but also can acquire the better clustering performance than several existing state-of-the-art approaches.

【Keywords】: euclidean distance; nominal data clustering; self-information

229. Interest Diffusion in Heterogeneous Information Network for Personalized Item Ranking.

【Paper Link】【Pages】:2087-2090

【Authors】: Mukul Gupta ; Pradeep Kumar ; Rajhans Mishra

【Abstract】: Personalized item ranking for recommending top-N items of interest to a user is an interesting and challenging problem in e-commerce. Researchers and practitioner are continuously trying to devise new methodologies to improve the accuracy of recommendations. Recommendation problem becomes more challenging for sparse binary implicit feedback, due to the absence of explicit signals of interest and sparseness of data. In this paper, we deal with the problem of the sparseness of data and accuracy of recommendations. To address the issue, we propose an interest diffusion methodology in heterogeneous information network for items to be recommended using the meta-information related to items. In this heterogeneous information network, graph regularized interest diffusion is performed to generate personalized recommendations of top-N items. For interest diffusion, personalized weight learning is performed for different meta-information object types in the network. The experimental evaluation and comparison of the proposed methodology with the state-of-the-art techniques using the real-world datasets show the effectiveness of the proposed approach

【Keywords】: heterogeneous information network; implicit feedback; interest diffusion; meta-information

230. Source Retrieval for Web-Scale Text Reuse Detection.

【Paper Link】【Pages】:2091-2094

【Authors】: Matthias Hagen ; Martin Potthast ; Payam Adineh ; Ehsan Fatehifar ; Benno Stein

【Abstract】: The first step of text reuse detection addresses the source retrieval problem: given a suspicious document, a set of candidate sources from which text might have been reused have to be retrieved by querying a search engine. Afterwards, in a second step, the retrieved candidates run through a text alignment with the suspicious document in order to identify reused passages. Obviously, any true source of text reuse that is not retrieved during the source retrieval step reduces the overall recall of a reuse detector. Hence, source retrieval is a recall-oriented task, a fact ignored even by experts: Only 3 of 20 teams participating in a respective task at PAN 2012-2016 managed to find more than half of the sources, the best one achieving a recall of only~0.59. We propose a new approach that reaches a recall of~0.89---a performance gain of~51%.

【Keywords】: pan; plagiarism detection; query formulation; recall-oriented retrieval; source retrieval; text reuse detection

231. Smart City Analytics: Ensemble-Learned Prediction of Citizen Home Care.

【Paper Link】【Pages】:2095-2098

【Authors】: Casper Hansen ; Christian Hansen ; Stephen Alstrup ; Christina Lioma

【Abstract】: We present an ensemble learning method that predicts large increases in the hours of home care received by citizens. The method is supervised, and uses different ensembles of either linear (logistic regression) or non-linear (random forests) classifiers. Experiments with data available from 2013 to 2017 for every citizen in Copenhagen receiving home care (27,775 citizens) show that prediction can achieve state of the art performance as reported in similar health related domains (AUC=0.715). We further find that competitive results can be obtained by using limited information for training, which is very useful when full records are not accessible or available. Smart city analytics does not necessarily require full city records. To our knowledge this preliminary study is the first to predict large increases in home care for smart city analytics.

【Keywords】: ensemble learning; home care; smart city analytics

232. Fast K-means for Large Scale Clustering.

【Paper Link】【Pages】:2099-2102

【Authors】: Qinghao Hu ; Jiaxiang Wu ; Lu Bai ; Yifan Zhang ; Jian Cheng

【Abstract】: K-means algorithm has been widely used in machine learning and data mining due to its simplicity and good performance. However, the standard k-means algorithm would be quite slow for clustering millions of data into thousands of or even tens of thousands of clusters. In this paper, we propose a fast k-means algorithm named multi-stage k-means (MKM) which uses a multi-stage filtering approach. The multi-stage filtering approach greatly accelerates the k-means algorithm via a coarse-to-fine search strategy. To further speed up the algorithm, hashing is introduced to accelerate the assignment step which is the most time-consuming part in k-means. Extensive experiments on several massive datasets show that the proposed algorithm can obtain up to 600X speed-up over the k-means algorithm with comparable accuracy.

【Keywords】: clustering; hashing; k-means

233. Graph Ladder Networks for Network Classification.

【Paper Link】【Pages】:2103-2106

【Authors】: Ruiqi Hu ; Shirui Pan ; Jing Jiang ; Guodong Long

【Abstract】: Numerous network representation-based algorithms for network classification have emerged in recent years, but many suffer from two limitations. First, they separate the network representation learning and node classification in networks into two steps, which may result in sub-optimal results because the node representation may not fit the classification model well, and vice versa. Second, they are mostly shallow methods that can only capture the linear and simple relationships in the data. In this paper, we propose an effective deep learning model, Graph Ladder Networks (GLN), for node classification in networks. Our model learns a ladder network which unifies the representation learning and network classification into one single framework by exploiting both labeled and unlabeled nodes in a network. To integrate both structure and node content information in the networks, the most recently developed graph convolution network, is further employed. The experiments on the most popular academic network dataset, Citeseer, demonstrate that our approach reaches outstanding performance compared to other state-of-the-art algorithms.

【Keywords】: graph convolutional network; ladder network; network classification.; network representation

234. A Communication Efficient Parallel DBSCAN Algorithm based on Parameter Server.

【Paper Link】【Pages】:2107-2110

【Authors】: Xu Hu ; Jun Huang ; Minghui Qiu

【Abstract】: Recent benchmark studies show that MPI-based distributed implementations of DBSCAN, e.g., PDSDBSCAN, outperform other implementations such as apache Spark etc. However, the communication cost of MPI DBSCAN increases drastically with the number of processors, which makes it inefficient for large scale problems. In this paper, we propose PS-DBSCAN, a parallel DBSCAN algorithm that combines the disjoint-set data structure and Parameter Server framework, to minimize communication cost. Since data points within the same cluster may be distributed over different workers which result in several disjoint-sets, merging them incurs large communication costs. In our algorithm, we employ a fast global union approach to union the disjoint-sets to alleviate the communication burden. Experiments over the datasets of different scales demonstrate that PS-DBSCAN outperforms the PDSDBSCAN with 2-10 times speedup on communication efficiency. We have released our PS-DBSCAN in an algorithm platform called Platform of AI (PAI) in Alibaba Cloud.

【Keywords】: density-based clustering; parallel dbscan; parameter server

235. KIEM: A Knowledge Graph based Method to Identify Entity Morphs.

【Paper Link】【Pages】:2111-2114

【Authors】: Longtao Huang ; Lin Zhao ; Shangwen Lv ; Fangzhou Lu ; Yue Zhai ; Songlin Hu

【Abstract】: An entity on the web can be referred by numerous morphs that are always ambiguous, implicit and informal, which makes it challenging to accurately identify all the morphs corresponding to a specific entity. In this paper, we introduce a novel method based on knowledge graph, which takes advantage of both knowledge reasoning and statistic learning. First, we present a model to build a knowledge graph for the given entity. The knowledge graph integrates the fragmented knowledge on how humans create morphs. Then, the candidate morphs are generated based on the rules summarized from the knowledge graph. At last, we use a classification method to filter the useless candidates and identify the target morphs. The experiments conducted on real world dataset demonstrate efficiency of our proposed method in terms of precision and recall.

【Keywords】: entity morphs; knowledge graph; language understanding; web mining

236. Ontology-based Graph Visualization for Summarized View.

【Paper Link】【Pages】:2115-2118

【Authors】: Xin Huang ; Byron Choi ; Jianliang Xu ; William K. Cheung ; Yanchun Zhang ; Jiming Liu

【Abstract】: Data summarization that presents a small subset of a dataset to users has been widely applied in numerous applications and systems. Many datasets are coded with hierarchical terminologies, e.g., the international classification of Diseases-9, Medical Subject Heading, and Gene Ontology, to name a few. In this paper, we study the problem of selecting a diverse set of k elements to summarize an input dataset with hierarchical terminologies, and visualize the summary in an ontology structure. We propose an efficient greedy algorithm to solve the problem with (1-1/e)≈ 62%-approximation guarantee. Preliminary experimental results on real-world datasets show the effectiveness and efficiency of the proposed algorithm for data summarization.

【Keywords】: approximation algorithm; data summarization; graph visualization; ontology structure; top-k diversification

237. An Ad CTR Prediction Method Based on Feature Learning of Deep and Shallow Layers.

【Paper Link】【Pages】:2119-2122

【Authors】: Zai Huang ; Zhen Pan ; Qi Liu ; Bai Long ; Haiping Ma ; Enhong Chen

【Abstract】: In online advertising, Click-Through Rate (CTR) prediction is a crucial task, as it may benefit the ranking and pricing of online ads. To the best of our knowledge, most of the existing CTR prediction methods are shallow layer models (e.g., Logistic Regression and Factorization Machines) or deep layer models (e.g., Neural Networks). Unfortunately, the shallow layer models cannot capture or utilize high-order nonlinear features in ad data. On the other side, the deep layer models cannot satisfy the necessity of updating CTR models online efficiently due to their high computational complexity. To address the shortcomings above, in this paper, we propose a novel hybrid method based on feature learning of both Deep and Shallow Layers (DSL). In DSL, we utilize Deep Neural Network as a deep layer model trained offline to learn high-order nonlinear features and use Factorization Machines as a shallow layer model for CTR prediction. Furthermore, we also develop an online learning implementation based on DSL, i.e., onlineDSL. Extensive experiments on large-scale real-world datasets clearly validate the effectiveness of our DSL method and onlineDSL algorithm compared with several state-of-the-art baselines.

【Keywords】: ctr prediction; feature learning; online advertising

238. A Framework for Estimating Execution Times of IO Traces on SSDs.

【Paper Link】【Pages】:2123-2126

【Authors】: Yoonsuk Kang ; Yong-Yeon Jo ; Jaehyuk Cha ; Wan D. Bae ; Sang-Wook Kim

【Abstract】: With the NAND flash memory technology of solid-state drives (SSDs), the usage of SSDs is expanded to various devices. Due to the cost and time limitations of measuring the actual execution time of each application on SSDs, it is difficult for users to determine the best SSD for their most commonly used applications. In this paper, we propose a framework of estimating the execution time of an application IO trace (i.e., a query IO trace) on a target SSD without its real execution. Our framework is based on the observation that if two IO traces are similar in their IO behavior, their execution times tend to be similar when executed on the same SSD. The performance of the framework is evaluated through extensive experiments on real applications. The results show that our framework is accurate in estimating the execution time of an IO trace on SSDs.

【Keywords】: application io trace; execution time estimation; solid-state drive (ssd)

239. Ranking Rich Mobile Verticals based on Clicks and Abandonment.

【Paper Link】【Pages】:2127-2130

【Authors】: Mami Kawasaki ; Inho Kang ; Tetsuya Sakai

【Abstract】: We consider the problem of ranking rich verticals, which we call "cards," for a given mobile search query. Examples of card types include "SHOP" (showing access and contact information of a shop), "WEATHER" (showing a weather forecast for a particular location), and "TV" (showing information about a TV programme). These cards can be highly visual and/or concise, and may often satisfy the user's information need without making her click on them. While this "good abandonment" of the search engine result page is ideal especially for mobile environments where the interaction between the user and the search engine should be minimal, it poses a challenge for search engine companies whose ranking algorithms rely heavily on click data. In order to provide the right card types to the user for a given query, we propose a graph-based approach which extends a click-based automatic relevance estimation algorithm of Agrawal et al., by incorporating an abandonment-based preference rule. Using a real mobile query log from a commercial search engine, we constructed a data set containing 2,472 pairwise card type preferences covering 992 distinct queries, by hiring three independent assessors. Our proposed method outperforms a click-only baseline by 53-68% in terms of card type preference accuracy. The improvement is also statistically highly significant, with p ≈ 0.0000 according to the paired randomisation test.

【Keywords】: click data; good abandonment; mobile search; vertical ranking

240. Semantic Rules for Machine Diagnostics: Execution and Management.

【Paper Link】【Pages】:2131-2134

【Authors】: Evgeny Kharlamov ; Ognjen Savkovic ; Guohui Xiao ; Rafael Peñaloza ; Gulnar Mehdi ; Mikhail Roshchin ; Ian Horrocks

【Abstract】: Rule-based diagnostics of equipment is an important task in industry. In this paper we present how semantic technologies can enhance diagnostics. In particular, we present our semantic rule language sigRL that is inspired by the real diagnostic languages used in Siemens. SigRL allows to write compact yet powerful diagnostic programs by relying on a high level data independent vocabulary, diagnostic ontologies, and queries over these ontologies. We study computational complexity of SigRL: execution of diagnostic programs, provenance computation, as well as automatic verification of redundancy and inconsistency in diagnostic programs.

【Keywords】: complexity; diagnostic systems; ontologies; rules; sensor signals

241. Machine Learning based Performance Modeling of Flash SSDs.

【Paper Link】【Pages】:2135-2138

【Authors】: Jaehyung Kim ; Jinuk Park ; Sanghyun Park

【Abstract】: Flash memory based solid state drives(SSDs) have alleviated the I/O bottleneck by exploiting its data parallel design. In an enterprise environment, Flash SSD used in the form of a hybrid storage architecture to achieve the better performance with lower cost. In this architecture, I/O load balancing is one of the important factors. However, the internal parallelism distorts the performance measures of the flash SSDs. Despite the criticality of load balancing on I/O intensive environments, these studies have rarely been addressed. In this paper, we examine the effectiveness of applying classification method using machine learning techniques to the I/O saturation estimation by using Linux kernel I/O statistics instead of the utilization measure that is currently used for HDDs. We conclude that machine learning techniques that we employed (Support Vector Machine and LASSO Generalized Linear Model) performs well compared to the existing utilization measure even we cannot collect the internal information of the flash SSDs.

【Keywords】: Flash SSD; Load balancing; Machine Learning

242. A Robust Named-Entity Recognition System Using Syllable Bigram Embedding with Eojeol Prefix Information.

【Paper Link】【Pages】:2139-2142

【Authors】: Sunjae Kwon ; Youngjoong Ko ; Jungyun Seo

【Abstract】: Korean named-entity recognition (NER) systems have been developed mainly on the morphological-level, and they are commonly based on a pipeline framework that identifies named-entities (NEs) following the morphological analysis. However, this framework can mean that the performance of NER systems is degraded, because errors from the morphological analysis propagate into NER systems. This paper proposes a novel syllable-level NER system, which does not require a morphological analysis and can achieve a similar or better performance compared with the morphological-level NER systems. In addition, because the proposed system does not require a morphological analysis step, its processing speed is about 1.9 times faster than those of the previous morphological-level NER systems.

【Keywords】: eojeol prefix information; korean syllable-level named-entity recognition; syllable bigram embedding

243. IDAE: Imputation-boosted Denoising Autoencoder for Collaborative Filtering.

【Paper Link】【Pages】:2143-2146

【Authors】: Jae-woong Lee ; Jongwuk Lee

【Abstract】: In recent years, while deep neural networks have shown impressive performance to solve various recognition and classification problems, collaborative filtering (CF) received relatively little attention to utilize deep neural networks. Because of inherent data sparsity, it remains a challenging problem for deep neural networks. In this paper, we propose a new CF model, namely the imputation-boosted denoising autoencoder (IDAE), for top-N recommendation. Specifically, IDAE consists of two steps: imputing positive values and learning with imputed values. First, it infers and imputes positive user feedback from missing values. Then, the correlation between items is learned by using the denoising autoencoder (DAE) with imputed values. Unlike the existing DAE that randomly corrupts the input, the key characteristic of IDAE is that original user values are taken as the input, and imputed values are reflected as the corrupted output. Our experimental results demonstrate that IDAE significantly outperforms state-of-the-art CF algorithms using autoencoders (by up to 5%) on the MovieLens datasets.

【Keywords】: collaborative filtering; data imputation; denoising autoencoders

244. Computing Betweenness Centrality in B-hypergraphs.

【Paper Link】【Pages】:2147-2150

【Authors】: Kwang Hee Lee ; Myoung-Ho Kim

【Abstract】: The directed hypergraph (especially B-hypergraph) has hyperedges that represent relations of a set of source nodes to a single target node. Author-cited networks and cellular signaling pathways can be modeled as a B-hypergraph. In this paper every source node of a hyperedge in the shortest path p in a B-hypergraph is considered a participant of p. We propose a betweenness centrality in the B-hypergraph that measures the number of shortest paths in which a node participates. The algorithm for computing the approximated betweenness centrality scores is also proposed. Through various performance experiments such as attack robustness and reachability tests, we show that our proposed betweenness centrality is a more appropriate measure in real-world B-hypergraph applications than ordinary betweenness centrality.

【Keywords】: b-hypergraph; betweenness centrality; directed hypergraph

245. Structural-fitting Word Vectors to Linguistic Ontology for Semantic Relatedness Measurement.

【Paper Link】【Pages】:2151-2154

【Authors】: Yang-Yin Lee ; Ting-Yu Yen ; Hen-Hsen Huang ; Hsin-Hsi Chen

【Abstract】: With the aid of recently proposed word embedding algorithms, the study of semantic relatedness has progressed and advanced rapidly. In this research, we propose a novel structural-fitting method that utilizes the linguistic ontology into vector space representations. The ontological information is applied in two ways. The fine2coarse approach refines the word vectors from fine-grained to coarse-grained terms (word types), while the coarse2fine approach refines the word vectors from coarse-grained to fine-grained terms. In the experiments, we show that our proposed methods outperform previous approaches in seven publicly available benchmark datasets.

【Keywords】: Word embedding; linguistic ontology; retrofitting; semantic relatedness; structural-fitting

246. Alternating Pointwise-Pairwise Learning for Personalized Item Ranking.

【Paper Link】【Pages】:2155-2158

【Authors】: Yu Lei ; Wenjie Li ; Ziyu Lu ; Miao Zhao

【Abstract】: Pointwise and pairwise collaborative ranking are two major classes of algorithms for personalized item ranking. This paper proposes a novel joint learning method named alternating pointwise-pairwise learning (APPL) to improve ranking performance. APPL combines the ideas of both pointwise and pairwise learning, and is able to produce a more effective prediction model. The extensive experiments with both explicit and implicit feedback settings on four real-world datasets demonstrate that APPL performs significantly better than the state-of-the-art methods.

【Keywords】: collaborative ranking; item recommendation; personalized item ranking

247. Deep Multi-Similarity Hashing for Multi-label Image Retrieval.

【Paper Link】【Pages】:2159-2162

【Authors】: Tong Li ; Sheng Gao ; Yajing Xu

【Abstract】: mage retrieval based on deep hashing methods has attracted more and more attentions from both academic and industry, due to the out-standing performance of deep neural network in various tasks of computer vision. However, most of the hashing methods are designed to learn simple similarity only for single-label image retrieval, thus cannot work well for the multi-label cases. In this paper, we proposed a framework named Deep Multi-Similarity Hashing (DMSH) method to learn semantic binary representations for multi-label image retrieval task. In the proposed model, a convolutional architecture is incorporated with hash function to learn compact binary representations from every pair of images with multiple labels. On the purposed of learning semantic structure of multi-label images, we define the pairwise loss for multi-label image pairs, which is influenced by zero-loss interval under the control of the number of common labels. The objective loss function consists of hashing quantification loss and pairwise loss for multi-label images, which pays more attention to high-level similarity than low-level similarity during the training process. Furthermore, our proposed model is flexible to be implemented with various deep networks. Experiments on large scale dataset NUS-WIDE have proved the state-of-the-art performance of our proposed DMSH model in the task of multi-label image retrieval.

【Keywords】: common labels; content based image retrieval; deep hashing method; multilabel image retrieval

248. Learning Graph-based Embedding For Time-Aware Product Recommendation.

【Paper Link】【Pages】:2163-2166

【Authors】: Yuqi Li ; Weizheng Chen ; Hongfei Yan

【Abstract】: In this paper, we propose a novel Product Graph Embedding (PGE) model to investigate time-aware product recommendation by leveraging the network representation learning technique. Our model captures the sequential influences of products by transforming the historical purchase records into a product graph. Then the product can be transformed into a low dimensional vector by the network embedding model. Once products are projected into the latent space, we present a novel method to compute user's latest preferences, which projects users into the same latent space as products. This method is based on time-decay functions and the embedding of sequential products that the user purchased. Thus, relatedness between a product and a user can be measured by the similarity between the embedding vectors which represent the product and the user's preferences. The experimental results on purchase records crawled from JINGDONG, show the superiority of our proposed framework for personalized product recommendation.

【Keywords】: dynamic user embedding; network embedding; product recommendation; time aware

249. An Enhanced Topic Modeling Approach to Multiple Stance Identification.

【Paper Link】【Pages】:2167-2170

【Authors】: Junjie Lin ; Wenji Mao ; Yuhao Zhang

【Abstract】: People often publish online texts to express their stances, which reflect the essential viewpoints they stand. Stance identification has been an important research topic in text analysis and facilitates many applications in business, public security and government decision making. Previous work on stance identification solely focuses on classifying the supportive or unsupportive attitude towards a certain topic/entity. The other important type of stance identification, multiple stance identification, was largely ignored in previous research. In contrast, multiple stance identification focuses on identifying different standpoints of multiple parties involved in online texts. In this paper, we address the problem of recognizing distinct standpoints implied in textual data. As people are inclined to discuss the topics favorable to their standpoints, topics thus can provide distinguishable information of different standpoints. We propose a topic-based method for standpoint identification. To acquire more distinguishable topics, we further enhance topic model by adding constraints on document-topic distributions. We finally conduct experimental studies on two real datasets to verify the effectiveness of our approach to multiple stance identification.

【Keywords】: Multiple stance identification; constrained Nonnegative Matrix Factorization; topic modeling

250. TICC: Transparent Inter-Column Compression for Column-Oriented Database Systems.

【Paper Link】【Pages】:2171-2174

【Authors】: Hao Liu ; Yudian Ji ; Jiang Xiao ; Haoyu Tan ; Qiong Luo ; Lionel M. Ni

【Abstract】: In this paper, we present TICC, an automatic data compression component that can transparently eliminate data redundancies across columns in column-oriented database systems. We further propose two approaches to integrate inter-column compression into existing database systems. One approach is to use User Defined Functions (UDFs), and the other is native. We implement these two approaches on top of Hive based on the ORC file, a common data format in column stores, and evaluate the performance of TICC using real-world datasets. The experimental results demonstrate that TICC can significantly reduce the storage overhead and process a variety of queries over large-scale data with up to 20% performance improvement over the original Hive.

【Keywords】: column store; cross-column redundancy; data compression

251. Exploiting User Consuming Behavior for Effective Item Tagging.

【Paper Link】【Pages】:2175-2178

【Authors】: Shen Liu ; Hongyan Liu

【Abstract】: Automatic tagging techniques are important for many applications such as searching and recommendation, which has attracted many researchers' attention in recent years. Existing methods mainly rely on users' tagging behavior or items' content information for tagging, yet users' consuming behavior is ignored. In this paper, we propose to leverage such information and introduce a probabilistic model called joint-tagging LDA to improve tagging accuracy. An effective algorithm based on Zero-Order Collapsed Variational Bayes is developed. Experiments conducted on a real dataset demonstrate that joint-tagging LDA outperforms existing competing methods.

【Keywords】: generative model; tag recommendation; user behavior modeling

252. SEQ: Example-based Query for Spatial Objects.

【Paper Link】【Pages】:2179-2182

【Authors】: Siqiang Luo ; Jiafeng Hu ; Reynold Cheng ; Jing Yan ; Ben Kao

【Abstract】: Spatial object search is prevalent in map services (e.g., Google Maps). To rent an apartment, for example, one will take into account its nearby facilities, such as supermarkets, hospitals, and subway stations. Traditional keyword search solutions, such as the nearby function in Google Maps, are insufficient in expressing the often complex attribute/spatial requirements of users. Those require- ments, however, are essential to reflect the user search intention. In this paper, we propose the Spatial Exemplar Query (SEQ), which allows the user to input a result example over an interface inside the map service. We then propose an effective similarity measure to evaluate the proximity between a candidate answer and the given example. We conduct a user study to validate the effectiveness of SEQ. Our result shows that more than 88% of users would like to have an example assisted search in map services. Moreover, SEQ gets a user satisfactory score of 4.3/5.0, which is more than 2 times higher than that of a baseline solution.

【Keywords】: exemplar query; query by example; spatial query

253. Truth Discovery by Claim and Source Embedding.

【Paper Link】【Pages】:2183-2186

【Authors】: Shanshan Lyu ; Wentao Ouyang ; Huawei Shen ; Xueqi Cheng

【Abstract】: Information gathered from multiple sources on the Web often exhibits conflicts. This phenomenon motivates the need of truth discovery, which aims to automatically find the true claim among multiple conflicting claims. Existing truth discovery methods are mainly based on iterative updates or probabilistic models. In particular, iterative methods specify rules that govern how credibility flows from sources to claims and then back to sources. However, these manually-defined rules tend to be ad hoc and are difficult to adapt and analyze. Probabilistic methods model a few latent factors that impact how sources make claims, such as randomly choosing, guessing, or mistaking. However, these manually-defined factors may not well reflect the underlying data distributions. Given these limitations, we propose a new, unsupervised model for truth discovery in this paper. Our model first constructs a heterogenous network that exploits both source-claim and source-source relationships. It then embeds the network into a low dimensional space through a principled algorithm such that trustworthy sources and true claims (meanwhile, unreliable sources and false claims) are close. In this way, truth discovery can be conveniently performed in the embedding space. Compared with existing methods, our model does not need manually-defined rules or factors. Rather, it learns the embeddings automatically from data. Experiments on two real-world datasets demonstrate that our model outperforms existing state-of-the-art methods for truth discovery.

【Keywords】: crowdsourcing; representation learning; truth discovery

254. Automatic Catchphrase Identification from Legal Court Case Documents.

【Paper Link】【Pages】:2187-2190

【Authors】: Arpan Mandal ; Kripabandhu Ghosh ; Arindam Pal ; Saptarshi Ghosh

【Abstract】: Automatically identifying catchphrases from legal court case documents is an important problem in Legal Information Retrieval, which has not been extensively studied. In this work, we propose an unsupervised approach for extraction and ranking of catchphrases from court case documents, by focusing on noun phrases. Using a dataset of gold standard catchphrases created by legal experts from real-life court documents, we compare the proposed approach with several unsupervised and supervised baselines. We show that the proposed methodology achieves statistically significantly better performance compared to all the baselines.

【Keywords】: catchphrase extraction; court cases; legal ir

255. Learning Temporal Ambiguity in Web Search Queries.

【Paper Link】【Pages】:2191-2194

【Authors】: Behrooz Mansouri ; Mohammad Sadegh Zahedi ; Maseud Rahgozar ; Farhad Oroumchian ; Ricardo Campos

【Abstract】: Time has strong influence on web search. The temporal intent of the searcher adds an important dimension to the relevance judgments of web queries. However, lack of understanding their temporal requirements increases the ambiguity of the queries, turning retrieval effectiveness improvements into a complex task. In this paper, we propose an approach to classify web queries into four different categories considering their temporal ambiguity. For each query, we develop features from its search volumes and related queries using Google trends and its related top Wikipedia pages. Our experiment results show that these features can determine temporal ambiguity of a given query with high accuracy. We have demonstrated that a Multilayer Perceptron Networks can achieve better results in classifying temporal class of queries in comparison to other classifiers.

【Keywords】: Query Intent; Temporal IR; Temporal Query Classification

256. Online Expectation-Maximization for Click Models.

【Paper Link】【Pages】:2195-2198

【Authors】: Ilya Markov ; Alexey Borisov ; Maarten de Rijke

【Abstract】: Click models allow us to interpret user click behavior in search interactions and to remove various types of bias from user clicks. Existing studies on click models consider a static scenario where user click behavior does not change over time. We show empirically that click models deteriorate over time if retraining is avoided. We then adapt online expectation-maximization (EM) techniques to efficiently incorporate new click/skip observations into a trained click model. Our instantiation of Online EM for click models is orders of magnitude more efficient than retraining the model from scratch using standard EM, while loosing little in quality. To deal with outdated click information, we propose a variant of online EM called EM with Forgetting, which surpasses the performance of complete retraining while being as efficient as Online EM.

【Keywords】:

257. Task Embeddings: Learning Query Embeddings using Task Context.

【Paper Link】【Pages】:2199-2202

【Authors】: Rishabh Mehrotra ; Emine Yilmaz

【Abstract】: Continuous space word embedding have been shown to be highly effective in many information retrieval tasks. Embedding representation models make use of local information available in immediately surrounding words to project nearby context words closer in the embedding space. With rising multi-tasking nature of web search sessions, users often try to accomplish different tasks in a single search session. Consequently, the search context gets polluted with queries from different unrelated tasks which renders the context heterogeneous. In this work, we hypothesize that task information provides better context for IR systems to learn from. We propose a novel task context embedding architecture to learn representation of queries in low-dimensional space by leveraging their task context information from historical search logs using neural embedding models. In addition to qualitative analysis, we empirically demonstrate the benefit of leveraging task context to learn query representations.

【Keywords】: neural embeddings; query representations; search tasks

258. Hierarchical RNN with Static Sentence-Level Attention for Text-Based Speaker Change Detection.

【Paper Link】【Pages】:2203-2206

【Authors】: Zhao Meng ; Lili Mou ; Zhi Jin

【Abstract】: Speaker change detection (SCD) is an important task in dialog modeling. Our paper addresses the problem of text-based SCD, which differs from existing audio-based studies and is useful in various scenarios, for example, processing dialog transcripts where speaker identities are missing (e.g., OpenSubtitle), and enhancing audio SCD with textual information. We formulate text-based SCD as a matching problem of utterances before and after a certain decision point; we propose a hierarchical recurrent neural network (RNN) with static sentence-level attention. Experimental results show that neural networks consistently achieve better performance than feature-based approaches, and that our attention-based model significantly outperforms non-attention neural networks.

【Keywords】: hierarchical recurrent neural network; sentence-level attention; speaker change detection

259. Predicting Short-Term Public Transport Demand via Inhomogeneous Poisson Processes.

【Paper Link】【Pages】:2207-2210

【Authors】: Aditya Krishna Menon ; Young Lee

【Abstract】: Forecasting short term passenger demand for public transport is a core problem in urban mobility. Typically, this is addressed using Poisson regression or homogeneous Poisson processes. However, such approaches have several limitations, including susceptibility to noise at fine time granularities, and the inability to capture complex non-stationary trends. In this paper, we show how such short term demand can be accurately modelled with an inhomogeneous Poisson process, using a neural network as the underlying intensity. This choice of intensity subsumes existing models as special cases, and is powerful enough to capture certain stylised facts of real-world demand. Experiments on real-world bus arrival data from a large metropolitan area in Australia validate our approach.

【Keywords】: neural network; point process; urban mobility

260. Analyzing Mathematical Content to Detect Academic Plagiarism.

【Paper Link】【Pages】:2211-2214

【Authors】: Norman Meuschke ; Moritz Schubotz ; Felix Hamborg ; Tomás Skopal ; Bela Gipp

【Abstract】: This paper presents, to our knowledge, the first study on analyzing mathematical expressions to detect academic plagiarism. We make the following contributions. First, we investigate confirmed cases of plagiarism to categorize the similarities of mathematical content commonly found in plagiarized publications. From this investigation, we derive possible feature selection and feature comparison strategies for developing math-based detection approaches and a ground truth for our experiments. Second, we create a test collection by embedding confirmed cases of plagiarism into the NTCIR-11 MathIR Task dataset, which contains approx. 60 million mathematical expressions in 105,120 documents from arXiv.org. Third, we develop a first math-based detection approach by implementing and evaluating different feature comparison approaches using an open source parallel data processing pipeline built using the Apache Flink framework. The best performing approach identifies all but two of our real-world test cases at the top rank and achieves a mean reciprocal rank of 0.86. The results show that mathematical expressions are promising text-independent features to identify academic plagiarism in large collections. To facilitate future research on math-based plagiarism detection, we make our source code and data available.

【Keywords】: mathematical information retrieval; plagiarism detection

261. Learning Entity Type Embeddings for Knowledge Graph Completion.

【Paper Link】【Pages】:2215-2218

【Authors】: Changsung Moon ; Paul Jones ; Nagiza F. Samatova

【Abstract】: Missing data is a severe problem for algorithms that operate over knowledge graphs (KGs). Most previous research in KG completion has focused on the problem of inferring missing entities and missing relation types between entities. However, in addition to these, many KGs also suffer from missing entity types (i.e. the category labels for entities, such as /music/artist). Entity types are a critical enabler for many NLP tasks that use KGs as a reference source, and inferring missing entity types remains an important outstanding obstacle in the field of KG completion. Inspired by recent work to build a contextual KG embedding model, we propose a novel approach to address the entity type prediction problem. We compare the performance of our method with several state-of-the-art KG embedding methods, and show that our approach gives higher prediction accuracy compared to baseline algorithms on two real-world datasets. Our approach also produces consistently high accuracy when inferring entities and relation types, as well as the primary task of inferring entity types. This is in contrast to many of the baseline methods that specialize in one prediction task or another. We achieve this while preserving linear scalability with the number of entity types. Source code and datasets from this paper can be found at (https://github.ncsu.edu/cmoon2/kg).

【Keywords】: entity type prediction; kg embedding method; knowledge graph completion; vector embedding

262. Identifying Top-K Influential Nodes in Networks.

【Paper Link】【Pages】:2219-2222

【Authors】: Sara Mumtaz ; Xiaoyang Wang

【Abstract】: Network Centrality is one of the core concepts in network analysis, which ranks the importance of a node in a network. A considerably extensive range of centrality measures exist that serve the purpose of quantifying the importance of a node according to its application and domain. One such measure is the Betweenness Centrality (BC) which computes the importance of a node in terms of total number of shortest paths that pass through that node. However, these computations are very expensive and pose different challenges for large scale networks. With an attempt to deal with these challenges, our paper presents an approximate algorithm for BC maximization problem, which tries to find a set of nodes with largest BC. The core of our algorithm is the estimation technique, which is based on progressive sampling with early stopping conditions. The reduction in sample size results not only in small computations overhead, but also scales well with large networks. We experimentally evaluate our technique using different datasets to confirm the performance of the developed techniques.

【Keywords】: betweenness centrality; influential nodes; sampling

263. Paraphrastic Fusion for Abstractive Multi-Sentence Compression Generation.

【Paper Link】【Pages】:2223-2226

【Authors】: Mir Tafseer Nayeem ; Yllias Chali

【Abstract】: This paper presents a first attempt towards finding an abstractive compression generation system for a set of related sentences which jointly models sentence fusion and paraphrasing using continuous vector representations. Our paraphrastic fusion system improves the informativity and the grammaticality of the generated sentences. Our system can be applied to various real world applications such as text simplification, microblog, opinion and newswire summarization. We conduct our experiments on human generated multi-sentence compression datasets and evaluate our system on several newly proposed Machine Translation (MT) evaluation metrics. Our experiments demonstrate that our method brings significant improvements over the state of the art systems across different metrics.

【Keywords】: abstractive compression generation; lexical paraphrasing; multi-sentence compression; sentence fusion

264. J-REED: Joint Relation Extraction and Entity Disambiguation.

【Paper Link】【Pages】:2227-2230

【Authors】: Dat Ba Nguyen ; Martin Theobald ; Gerhard Weikum

【Abstract】: Information extraction (IE) from text sources can either be performed as Model-based IE (i.e, by using a pre-specified domain of target entities and relations) or as Open IE (i.e., with no particular assumptions about the target domain). While Model-based IE has limited coverage, Open IE merely yields triples of surface phrases which are usually not disambiguated into a canonical set of entities and relations. This paper presents J-REED: a joint approach for entity disambiguation and relation extraction that is based on probabilistic graphical models. J-REED merges ideas from both Model-based and Open IE by mapping surface names to a background knowledge base, and by making surface relations as crisp as possible.

【Keywords】: entity disambiguation; joint inference; open relation extraction

265. Collaborative Topic Regression with Denoising AutoEncoder for Content and Community Co-Representation.

【Paper Link】【Pages】:2231-2234

【Authors】: Trong T. Nguyen ; Hady W. Lauw

【Abstract】: Personalized recommendation of items frequently faces scenarios where we have sparse observations on users' adoption of items. In the literature, there are two promising directions. One is to connect sparse items through similarity in content. The other is to connect sparse users through similarity in social relations. We seek to integrate both types of information, in addition to the adoption information, within a single integrated model. Our proposed method models item content via a topic model, and user communities via an autoencoder model, while bridging a user's community-based preference to her topic-based preference. Experiments on public real-life data showcase the utility of the model, particularly when there is significant compatibility between communities and topics.

【Keywords】: autoencoder; cold-start recommendation; collaborative deep learning; social collaborative filtering; topic model

266. Accurate Sentence Matching with Hybrid Siamese Networks.

【Paper Link】【Pages】:2235-2238

【Authors】: Massimo Nicosia ; Alessandro Moschitti

【Abstract】: Recent neural network approaches to sentence matching compute the probability of two sentences being similar by minimizing a logistic loss. In this paper, we learn sentence representations by means of a siamese network, which: (i) uses encoders that share parameters; and (ii) enables the comparison between two sentences in terms of their euclidean distance, by minimizing a contrastive loss. Moreover, we add a multilayer perceptron in the architecture to simultaneously optimize the contrastive and the logistic losses. This way, our network can exploit a more informative feedback, given by the logistic loss, which is also quantified by the distance that the two sentences have according to their representation in the euclidean space. We show that jointly minimizing the two losses yields higher accuracy than minimizing them independently. We verify this finding by evaluating several baseline architectures in two sentence matching tasks: question paraphrasing and textual entailment recognition. Our network approaches the state of the art, while being much simpler and faster to train, and with less parameters than its competitors.

【Keywords】: joint loss; natural language processing; neural networks; question similarity; sentence matching; sentence similarity; siamese network

267. Collaborative Sequence Prediction for Sequential Recommender.

【Paper Link】【Pages】:2239-2242

【Authors】: Shuzi Niu ; Rongzhi Zhang

【Abstract】: With the surge of deep learning, more and more attention has been put on the sequential recommender. It can be casted as sequence prediction problem, where we will predict the next item given the previous items. RNN approaches are able to capture the global sequential features from the data compared with the local features derived in Markov Chain methods. However, both approaches rely on the independence of users' sequences, which are not true in practice. We propose to formulate the sequential recommendation problem as collaborative sequence prediction problem to take the dependency of users' sequences into account. In order to solve the collaborative sequence prediction problem, we define the dynamic neighborhood relationship between users and introduce manifold regularization to RNN on the basis of the multi-facets of collaborative filtering, referred to as MrRNN. Experimental results on benchmark datasets show that our approach outperforms the state-of-the-art baselines.

【Keywords】: collaborative sequence prediction; manifold regularization; recurrent networks; sequential recommender

268. Boolean Matrix Decomposition by Formal Concept Sampling.

【Paper Link】【Pages】:2243-2246

【Authors】: Petr Osicka ; Martin Trnecka

【Abstract】: Finding interesting patterns is a classical problem in data mining. Boolean matrix decomposition is nowadays a standard tool that can find a set of patterns-also called factors-in Boolean data that explain the data well. We describe and experimentally evaluate a probabilistic algorithm for Boolean matrix decomposition problem. The algorithm is derived from GreCon algorithm which uses formal concepts-maximal rectangles or tiles-as factors in order to find a decomposition. We change the core of GreCon by substituting a sampling procedure for a deterministic computation of suitable formal concepts. This allows us to alleviate the greedy nature of GreCon, creates a possibility to bypass some of the its pitfalls and to preserve its features, e.g. an ability to explain the entire data.

【Keywords】: boolean matrix decomposition; formal concept analysis; randomized algorithm

269. Enhancing Knowledge Graph Completion By Embedding Correlations.

【Paper Link】【Pages】:2247-2250

【Authors】: Soumajit Pal ; Jacopo Urbani

【Abstract】: Despite their large sizes, modern Knowledge Graphs (KGs) are still highly incomplete. Statistical relational learning methods can detect missing links by "embedding" the nodes and relations into latent feature tensors. Unfortunately, these methods are unable to learn good embeddings if the nodes are not well-connected. Our proposal is to learn embeddings for correlations between subgraphs and add a post-prediction phase to counter the lack of training data. This technique, applied on top of methods like TransE or HolE, can significantly increase the predictions on realistic KGs.

【Keywords】: knowledge graphs; link prediction; rdf; statistical relational learning

270. Robust Heterogeneous Discriminative Analysis for Single Sample Per Person Face Recognition.

【Paper Link】【Pages】:2251-2254

【Authors】: Meng Pang ; Yiu-ming Cheung ; Binghui Wang ; Risheng Liu

【Abstract】: Single sample face recognition is one of the most challenging problems in face recognition (FR), where only one single sample per person (SSPP) is enrolled in the gallery set for training. Although patch-based methods have achieved great success in FR with SSPP, they still have significant limitations. In this work, we propose a new patch-based method, namely Robust Heterogeneous Discriminative Analysis (RHDA), to tackle FR with SSPP. Compared with the existing patch-based methods, RHDA can enhance the robustness against complex facial variations from two aspects. First, we develop a novel Fisher-like criterion, which incorporates two manifold embeddings, to learn heterogeneous discriminative representations of image patches. Specifically, for each patch, the Fisher-like criterion is able to preserve the reconstruction relationship of neighboring patches from the same person, while suppressing neighboring patches from different persons. Second, we present two distance metrics, i.e., patch-to-patch distance and patch-to-manifold distance, and develop a fusion strategy to combine the recognition outputs of above two distance metrics via joint majority voting for identification. Experimental results on the AR and FERET benchmark datasets demonstrate the efficacy of the proposed method.

【Keywords】: heterogeneous subspace analysis; joint majority voting.; representation learning; single sample face recognition

271. Deep Neural Networks for News Recommendations.

【Paper Link】【Pages】:2255-2258

【Authors】: Keunchan Park ; Jisoo Lee ; Jaeho Choi

【Abstract】: A fundamental role of news websites is to recommend articles that are interesting to read. The key challenge of news recommendation is to recommend newly published articles. Unlike other domains, outdated items are considered to be irrelevant in the news recommendation task. Another challenge is that the recommendation candidates are not seen in the training phase. In this paper, we introduce deep neural network models to overcome these challenges. we propose a modified session-based Recurrent Neural Network (RNN) model tailored to news recommendation as well as a history-based RNN model that spans the whole user's past histories. Finally, we propose a Convolutional Neural Network (CNN) model to capture user preferences and to personalize recommendation results. Experimental results on real-world news dataset shows that our model outperforms competitive baselines.

【Keywords】: deep neural networks; recommender systems

272. TATHYA: A Multi-Classifier System for Detecting Check-Worthy Statements in Political Debates.

【Paper Link】【Pages】:2259-2262

【Authors】: Ayush Patwari ; Dan Goldwasser ; Saurabh Bagchi

【Abstract】: Fact-checking political discussions has become an essential clog in computational journalism. This task encompasses an important sub-task---identifying the set of statements with 'check-worthy' claims. Previous work has treated this as a simple text classification problem discounting the nuances involved in determining what makes statements check-worthy. We introduce a dataset of political debates from the 2016 US Presidential election campaign annotated using all major fact-checking media outlets and show that there is a need to model conversation context, debate dynamics and implicit world knowledge. We design a multi-classifier system TATHYA, that models latent groupings in data and improves state-of-art systems in detecting check-worthy statements by 19.5% in F1-score on a held-out test set, gaining primarily gaining in Recall.

【Keywords】: clustering; computational journalism; natural language processing

273. A Collaborative Ranking Model for Cross-Domain Recommendations.

【Paper Link】【Pages】:2263-2266

【Authors】: Dimitrios Rafailidis ; Fabio Crestani

【Abstract】: With the advent of social media, generating high quality cross-domain recommendations has become more and more important for users of heterogeneous domains. In this study, we propose a collaborative ranking model to generate cross-domain recommendations. Given a target domain, we design an objective function aimed at performing push of relevant items at the top of a recommendation list. Also, as users may have different behaviours in multiple domains in our collaborative ranking model we propose a weighting strategy to control the influence of user preferences from auxiliary domains when producing the recommendation lists. Our experiments on ten cross-domain recommendation tasks show that the proposed approach achieves higher recommendation accuracy than other state-of-the-art methods.

【Keywords】: collaborative ranking; cross-domain recommendation; recommendation systems

274. Combining Local and Global Word Embeddings for Microblog Stemming.

【Paper Link】【Pages】:2267-2270

【Authors】: Anurag Roy ; Trishnendu Ghorai ; Kripabandhu Ghosh ; Saptarshi Ghosh

【Abstract】: Stemming is a vital step employed to improve retrieval performance through efficient unification of morphological variants of a word. We propose an unsupervised, context-specific stemming algorithm for microblogs, based on both local and global word embeddings, which is capable of handling the informal, noisy vocabulary of microblogs. Experiments on two standard microblog data collections (TREC 2016 and FIRE 2016) show that, the proposed stemmer enables significantly better retrieval performance than several state-of-the-art stemming algorithms, for the same queries.

【Keywords】: glove; microblog; stemming; word embedding; word2vec

275. An Improved Test Collection and Baselines for Bibliographic Citation Recommendation.

【Paper Link】【Pages】:2271-2274

【Authors】: Dwaipayan Roy

【Abstract】: The problem of recommending bibliographic citations to an author who is writing an article has been well-studied. However, different researchers have used different datasets to evaluate proposed techniques, and have sometimes reported contradictory findings regarding the relative effectiveness of various approaches. In addition, these datasets are problematic in one way or another (e.g., in terms of size or availability), precluding the possibility of adopting one (or some) of them as standard benchmarks. A recently created test collection that makes use of data from CiteSeerx is large, heterogenous, and publicly available, but has certain other limitations. In this paper, we propose a way to modify this test collection to address these limitations. We also use the improved test collection to establish a set of baseline results using elementary content-based techniques, as well as reference directed indexing.

【Keywords】: bibliographic citations; recommender systems; test collections

276. A Way to Boost Semi-NMF for Document Clustering.

【Paper Link】【Pages】:2275-2278

【Authors】: Aghiles Salah ; Melissa Ailem ; Mohamed Nadif

【Abstract】: Semi-Non Negative Matrix Factorization (Semi-NMF) is one of the most popular extensions of NMF, it extends the applicable range of NMF models, to data having mixed signs, as well as strengthens their relation to clustering. However, Semi-NMF has been found to perform somewhat less than NMF, in terms of clustering, when applied to positive data such as text, which we focus on. Inspired by the recent success of neural word embedding models, e.g., word2vec, in learning high quality real valued vector representations of words, we propose to integrate a word embedding model into Semi-NMF. This allows Semi-NMF to capture more semantic relationships among words and, thereby, to infer document factors that are even better for clustering. The combination of Semi-NMF and word embedding noticeably improves the performance of NMF models, in terms of both clustering and embedding, as illustrated in our experiments.

【Keywords】: document clustering; semi-nonnegative matrix factorization; word embedding

277. Recipe Popularity Prediction with Deep Visual-Semantic Fusion.

【Paper Link】【Pages】:2279-2282

【Authors】: Satoshi Sanjo ; Marie Katsurai

【Abstract】: Predicting the popularity of user-created recipes has great potential to be adopted in several applications on recipe-sharing websites. To ensure timely prediction when a recipe is uploaded, a prediction model needs to be trained based on the recipe's content features (i.e., its visual and semantic features). This paper presents a novel approach to predicting recipe popularity using deep visual-semantic fusion. We first pre-train a deep model that predicts the popularity of recipes based on each single modality. We insert additional layers to the two models and concatenate their activations. Finally, we train a network comprising fully connected (FC) layers on the fused features to learn more powerful features, which are used for training a regressor. Based on experiments conducted on more than 150K recipes collected from the Cookpad website, we present a comprehensive comparison with several baselines to verify the effectiveness of our method. The best practice for the proposed method is also described.

【Keywords】: multi-modal fusion; recipe features; recipe popularity prediction

278. Revealing the Hidden Links in Content Networks: An Application to Event Discovery.

【Paper Link】【Pages】:2283-2286

【Authors】: Antonia Saravanou ; Ioannis Katakis ; George Valkanas ; Vana Kalogeraki ; Dimitrios Gunopulos

【Abstract】: Social networks have become the de facto online resource for people to share, comment on and be informed about events pertinent to their interests and livelihood, ranging from road traffic or an illness to concerts and earthquakes, to economics and politics. This has been the driving force behind research endeavors that analyse such data. In this paper, we focus on how Content Networks can help us identify events effectively. Content Networks incorporate both structural and content-related information of a social network in a unified way, at the same time, bringing together two disparate lines of research: graph-based and content-based event discovery in social media. We model interactions of two types of nodes, users and content, and introduce an algorithm that builds heterogeneous, dynamic graphs, in addition to revealing content links in the network's structure. By linking similar content nodes and tracking connected components over time, we can effectively identify different types of events. Our evaluation on social media streaming data suggests that our approach outperforms state-of-the-art techniques, while showcasing the significance of hidden links to the quality of the results.

【Keywords】: anomaly detection; event discovery; graph mining; social network analysis

279. When Labels Fall Short: Property Graph Simulation via Blending of Network Structure and Vertex Attributes.

【Paper Link】【Pages】:2287-2290

【Authors】: Arun V. Sathanur ; Sutanay Choudhury ; Cliff Joslyn ; Sumit Purohit

【Abstract】: Property graphs can be used to represent heterogeneous networks with labeled (attributed) vertices and edges. Given a property graph, simulating another graph with same or greater size with the same statistical properties with respect to the labels and connectivity is critical for privacy preservation and benchmarking purposes. In this work we tackle the problem of capturing the statistical dependence of the edge connectivity on the vertex labels and using the same distribution to regenerate property graphs of the same or expanded size in a scalable manner. However, accurate simulation becomes a challenge when the attributes do not completely explain the network structure. We propose the Property Graph Model (PGM) approach that uses a label augmentation strategy to mitigate the problem and preserve the vertex label and the edge connectivity distributions as well as their correlation, while also replicating the degree distribution. Our proposed algorithm is scalable with a linear complexity in the number of edges in the target graph. We illustrate the efficacy of the PGM approach in regenerating and expanding the datasets by leveraging two distinct illustrations. Our open-source implementation is available on GitHub.

【Keywords】: attributed graphs; graph generation; joint distribution; label augmentation; label-topology correlation; property graphs

280. Integrating the Framing of Clinical Questions via PICO into the Retrieval of Medical Literature for Systematic Reviews.

【Paper Link】【Pages】:2291-2294

【Authors】: Harrisen Scells ; Guido Zuccon ; Bevan Koopman ; Anthony Deacon ; Leif Azzopardi ; Shlomo Geva

【Abstract】: The PICO process is a technique used in evidence based practice to frame and answer clinical questions. It involves structuring the question around four types of clinical information: population, intervention, control or comparison and outcome. The PICO framework is used extensively in the compilation of systematic reviews as the means of framing research questions. However, when a search strategy (comprising of a large Boolean query) is formulated to retrieve studies for inclusion in the review, PICO is often ignored. This paper evaluates how PICO annotations can be applied and integrated into retrieval to improve the screening of studies for inclusion in systematic reviews. The task is to increase precision while maintaining the high level of recall essential to ensure systematic reviews are representative and unbiased. Our results show that restricting the search strategies to match studies using PICO annotations improves precision, however recall is slightly reduced, when compared to the non-PICO baseline. This can lead to both time and cost savings when compiling systematic reviews.

【Keywords】: information retrieval; pico framework; systematic reviews

281. pm-SCAN: an I/O Efficient Structural Clustering Algorithm for Large-scale Graphs.

【Paper Link】【Pages】:2295-2298

【Authors】: Jung Hyuk Seo ; Myoung Ho Kim

【Abstract】: Most existing algorithms for graph clustering, including SCAN, are not designed to cope with large volumes of data that cannot fit in main memory. When there is not enough memory, those algorithms will incur thrashing, i.e. result in huge I/O costs. We propose an I/O-efficient algorithm for structural clustering, pm-SCAN. The main idea of our scheme is to partition a large graph into several subgraphs that can fit into main memory. We first find clusters in each subgraph, and then merge them to produce final clustering of the input graph. Experimental results show that while other existing algorithms are not scalable to the graph size, our proposed method produces scalable performance for limited memory space.

【Keywords】: graph; i/o-efficient algorithm; structural graph clustering

282. Knowledge Graph Embedding with Triple Context.

【Paper Link】【Pages】:2299-2302

【Authors】: Jun Shi ; Huan Gao ; Guilin Qi ; Zhangquan Zhou

【Abstract】: Knowledge graph embedding, which aims to represent entities and relations in vector spaces, has shown outstanding performance on a few knowledge graph completion tasks. Most existing methods are based on the assumption that a knowledge graph is a set of separate triples, ignoring rich graph features, i.e., structural information in the graph. In this paper, we take advantages of structures in knowledge graphs, especially local structures around a triple, which we refer to as triple context. We then propose a Triple-Context-based knowledge Embedding model (TCE). For each triple, two kinds of structure information are considered as its context in the graph; one is the outgoing relations and neighboring entities of an entity and the other is relation paths between a pair of entities, both of which reflect various aspects of the triple. Triples along with their contexts are represented in a unified framework, in which way structural information in triple contexts can be embodied. The experimental results show that our model outperforms the state-of-the-art methods for link prediction.

【Keywords】: knowledge graph; representation learning; triple context

283. Hybrid MemNet for Extractive Summarization.

【Paper Link】【Pages】:2303-2306

【Authors】: Abhishek Kumar Singh ; Manish Gupta ; Vasudeva Varma

【Abstract】: Extractive text summarization has been an extensive research problem in the field of natural language understanding. While the conventional approaches rely mostly on manually compiled features to generate the summary, few attempts have been made in developing data-driven systems for extractive summarization. To this end, we present a fully data-driven end-to-end deep network which we call as Hybrid MemNet for single document summarization task. The network learns the continuous unified representation of a document before generating its summary. It jointly captures local and global sentential information along with the notion of summary worthy sentences. Experimental results on two different corpora confirm that our model shows significant performance gains compared with the state-of-the-art baselines.

【Keywords】: deep learning; natural language; summarization

284. Denoising Clinical Notes for Medical Literature Retrieval with Convolutional Neural Model.

【Paper Link】【Pages】:2307-2310

【Authors】: Luca Soldaini ; Andrew Yates ; Nazli Goharian

【Abstract】: The rapid increase of medical literature poses a significant challenge for physicians, who have repeatedly reported to struggle to keep up to date with developments in research. This gap is one of the main challenges in integrating recent advances in clinical research with day-to-day practice. Thus, the need for clinical decision support (CDS) search systems that can retrieve highly relevant medical literature given a clinical note describing a patient has emerged. However, clinical notes are inherently noisy, thus not being fit to be used as queries as-is. In this work, we present a convolutional neural model aimed at improving clinical notes representation, making them suitable for document retrieval. The system is designed to predict, for each clinical note term, its importance in relevant documents. The approach was evaluated on the 2016 TREC CDS dataset, where it achieved a 37% improvement in infNDCG over state-of-the-art query reduction methods and a 27% improvement over the best known method for the task.

【Keywords】: clinical decision support systems; convolutional neural networks; medical informatics; query reduction

285. SIMD-Based Multiple Sets Intersection with Dual-Scale Search Algorithm.

【Paper Link】【Pages】:2311-2314

【Authors】: Xingshen Song ; Yuexiang Yang ; Xiaoyong Li

【Abstract】: Conjunctive Boolean query is one fundamental operation for document retrieval in many information systems and databases. Various algorithms have been put up in terms of maximizing the query efficiency. In recent years, researchers began to exploit the parallel advantage of single-instruction-multiple-data (SIMD) instructions to accelerate the intersection procedure and achieved substantial gains over previous scalar algorithms. However, these works only focus on intersecting two sets at a time and ignore the scenario of multiple sets intersection. We present a flexible search algorithm which balances non-SIMD and SIMD comparisons in order to provide efficient and effective intersection.

【Keywords】: algorithm optimization; performance evaluation; set intersection; vectorized processing

286. Soft Seeded SSL Graphs for Unsupervised Semantic Similarity-based Retrieval.

【Paper Link】【Pages】:2315-2318

【Authors】: Avikalp Srivastava ; Madhav Datt

【Abstract】: Semantic similarity based retrieval is playing an increasingly important role in many IR systems such as modern web search, question-answering, similar document retrieval etc. Improvements in retrieval of semantically similar content are very significant to applications like Quora, Stack Overflow, Siri etc. We propose a novel unsupervised model for semantic similarity based content retrieval, where we construct semantic flow graphs for each query, and introduce the concept of "soft seeding" in graph based semi-supervised learning (SSL) to convert this into an unsupervised model. We demonstrate the effectiveness of our model on an equivalent question retrieval problem on the Stack Exchange QA dataset, where our unsupervised approach significantly outperforms the state-of-the-art unsupervised models, and produces comparable results to the best supervised models. Our research provides a method to tackle semantic similarity based retrieval without any training data, and allows seamless extension to different domain QA communities, as well as to other semantic equivalence tasks.

【Keywords】: document representation; semantic similarity; similar question retrieval; soft seeded semi-supervised learning graphs; topic model application

287. How Safe is Your (Taxi) Driver?

【Paper Link】【Pages】:2319-2322

【Authors】: Rade Stanojevic

【Abstract】: For an auto insurer, understanding the risk of individual drivers is a critical factor in building a healthy and profitable portfolio. For decades, assessing the risk of drivers has relied on demographic information which allows the insurer to segment the market in several risk groups priced with an appropriate premium. In the recent years, however, some insurers started experimenting with so called Usage-Based Insurance (UBI) in which the insurer monitors a number of additional variables (mostly related to the location) and uses them to better assess the risk of the drivers. While several studies have reported results on the UBI trials these studies keep the studied data confidential (for obvious privacy and business concerns) which inevitably limits their reproducibility and interest by the data-mining community. In this paper we discuss a methodology for studying driver risk assessment using a public dataset of 173M taxi rides in NYC with over 40K drivers. Our approach for risk assessment utilizes not only the location data (which is significantly sparser than what is normally exploited in UBI) but also the revenue, tips and overall activity of the drivers (as proxies of their behavioral traits) and obtain risk scoring accuracy on par with the reported results on non-professional driver cohorts in spite of sparser location data and no demographic information about the drivers.

【Keywords】: car insurance; data analytics; user modeling

288. Sentence Retrieval with Sentiment-specific Topical Anchoring for Review Summarization.

【Paper Link】【Pages】:2323-2326

【Authors】: Jiaxing Tan ; Alexander Kotov ; Rojiar Pir Mohammadiani ; Yumei Huo

【Abstract】: We propose Topic Anchoring-based Review Summarization (TARS), a two-step extractive summarization method, which creates review summaries from the sentences that represent the most important aspects of a review. In the first step, the proposed method utilizes Topic Aspect Sentiment Model (TASM), a novel sentiment-topic model, to identify aspects of sentiment-specific topics in a collection of reviews. The output of TASM is utilized in the second step of TARS to rank review sentences based on how representative of the most important review aspects their words are. Qualitative and quantitative evaluation of review summaries using two collections indicate the effectiveness of structuring review summaries around aspects of sentiment-specific topics.

【Keywords】: opinion mining; text summarization; topic models

289. Visualizing Deep Neural Networks with Interaction of Super-pixels.

【Paper Link】【Pages】:2327-2330

【Authors】: Shixin Tian ; Ying Cai

【Abstract】: An effective way to visualize the prediction of deep neural networks on an image is to decompose the prediction into the contribution of units (pixels or patches). In the existing works, these units are largely considered independently, thus limiting the performance of visualization. In this paper, we propose a new predication visualization method that uses super-pixel as a contribution unit. Moreover, our method takes into consideration of the interaction of adjacent super-pixels. We implement our technique and evaluate its performance with various images. Our results show its excellent performance.

【Keywords】: deep neural networks; image classification; visualization

290. Collecting Non-Geotagged Local Tweets via Bandit Algorithms.

【Paper Link】【Pages】:2331-2334

【Authors】: Saki Ueda ; Yuto Yamaguchi ; Hiroyuki Kitagawa

【Abstract】: How can we collect non-geotagged tweets posted by users in a specific location as many as possible in a limited time span? How can we find such users if we do not have much information about the specified location? Although there are varieties of methods to estimate the locations of users, these methods are not directly applicable to this problem because they require collecting a large amount of random tweets and then filter them to obtain a small amount of tweets from such users. In this paper, we propose a framework that incrementally finds such users and continuously collects tweets from them. Our framework is based on the bandit algorithm that adjusts the trade-off between exploration and exploitation, in other words, it simultaneously finds new users in the specified location and collects tweets from already-found users. The experimental results show that the bandit algorithm works well on this problem and outperforms the carefully-designed baselines.

【Keywords】: bandit algorithm; focused crawling; location estimation; twitter

291. A Temporal Attentional Model for Rumor Stance Classification.

【Paper Link】【Pages】:2335-2338

【Authors】: Amir Pouran Ben Veyseh ; Javid Ebrahimi ; Dejing Dou ; Daniel Lowd

【Abstract】: Rumor stance classification is the task of determining the stance towards a rumor in text. This is the first step in effective rumor tracking on social media which is an increasingly important task. In this work, we analyze Twitter users' stance toward a rumorous tweet, in which users could support, deny, query, or comment upon the rumor. We propose a deep attentional CNN-LSTM approach, which takes the sequence of tweets in a thread of conversation as the input. We use neighboring tweets in the timeline as context vectors to capture the temporal dynamism in users' stance evolution. In addition, we use extra features such as friendship, to leverage useful relational features that are readily available in social media. Our model achieves the state-of-the-art results on rumor stance classification on a recent SemEval dataset, improving accuracy and F1 score by 3.6% and 4.2% respectively.

【Keywords】: lstm; rumor stance classification; temporal attention; twitter

292. Improving the Gain of Visual Perceptual Behaviour on Topic Modeling for Text Recommendation.

【Paper Link】【Pages】:2339-2342

【Authors】: Cheng Wang ; Yujuan Fang ; Zheng Tan ; Yuan He

【Abstract】: Internet information services have been greatly improved profiting from the growing performance of interest mining technology. Visual perceptual behaviours, a new hotspot of mining user's interests, have resulted in great gains in some typical Internet information services, e.g., information retrieval and recommendation. It is validated that combining the subjective visual perceptual behaviours with the objective contents can significantly improve these services' performance. However, the existing methods usually treat the contents and visual perceptual behaviours as two independent parts in the calculating process. The gain of visual perceptual behaviours has not been fully exploited. In this paper, we mainly aim at improving the gain of visual perceptual behaviour for text recommendation, by integrating the objective contents with subjective visual perceptual behaviours. We investigate the correlation between user's reading interests and records of real-time interaction on texts, and then design a real-time visual perceptual behaviour based method for text recommendation, which is able to: (1) build a joint interest model, called ViP-LDA (Visual Perceptual LDA), by integrating the user's visual perceptual behaviours into topic model; (2) make more accurate text recommendation based on ViP-LDA with feedback adjustment. Several experiments on a real data set are implemented to demonstrate the effectiveness of our method.

【Keywords】: eye tracking; interest model; lda; text recommendation; vip-lda; visual perceptual behaviour

293. Semantic Annotation for Places in LBSN through Graph Embedding.

【Paper Link】【Pages】:2343-2346

【Authors】: Yan Wang ; Zongxu Qin ; Jun Pang ; Yang Zhang ; Jin Xin

【Abstract】: With the prevalence of location-based social networks (LBSNs), automated semantic annotation for places plays a critical role in many LBSN-related applications. Although a line of research continues to enhance labeling accuracy, there is still a lot of room for improvement. The crucial problem is to find a high-quality representation for each place. In previous works, the representation is usually derived directly from observed patterns of places or indirectly from calculated proximity amongst places or their combination. In this paper, we also exploit the combination to represent places but present a novel semi-supervised learning framework based on graph embedding, called Predictive Place Embedding (PPE). For place proximity, PPE first learns user embeddings from a user-tag bipartite graph by minimizing supervised loss in order to preserve the similarity of users visiting analogous places. User similarity is then transformed into place proximity by optimizing each place embedding as the centroid of the vectors of its check-in users. Our underlying idea is that a place can be considered as a representative of all its visitors. For observed patterns, a place-temporal bipartite graph is used to further adjust place embeddings by reducing unsupervised loss. Extensive experiments on real large LBSNs show that PPE outperforms state-of-the-art methods significantly.

【Keywords】: deep learning; graph representation; semantic tag

294. A Study of Feature Construction for Text-based Forecasting of Time Series Variables.

【Paper Link】【Pages】:2347-2350

【Authors】: Yiren Wang ; Dominic Seyler ; Shubhra Kanti Karmaker Santu ; ChengXiang Zhai

【Abstract】: Time series are ubiquitous in the world since they are used to measure various phenomena (e.g., temperature, spread of a virus, sales, etc.). Forecasting of time series is highly beneficial (and necessary) for optimizing decisions, yet is a very challenging problem; using only the historical values of the time series is often insufficient. In this paper, we study how to construct effective additional features based on related text data for time series forecasting. Besides the commonly used n-gram features, we propose a general strategy for constructing multiple topical features based on the topics discovered by a topic model. We evaluate feature effectiveness using a data set for predicting stock price changes where we constructed additional features from news text articles for stock market prediction. We found that: 1) Text-based features outperform time series-based features, suggesting the great promise of leveraging text data for improving time series forecasting. 2) Topic-based features are not very effective stand-alone, but they can further improve performance when added on top of n-gram features. 3) The best topic-based feature appears to be a long-term aggregation of topics over time with high weights on recent topics.

【Keywords】:

295. Using Knowledge Graphs to Explain Entity Co-occurrence in Twitter.

【Paper Link】【Pages】:2351-2354

【Authors】: Yiwei Wang ; Mark James Carman ; Yuan-Fang Li

【Abstract】: Modern Knowledge Graphs such as DBPedia contain significant information regarding Named Entities and the logical relationships which exist between them. Twitter on the other hand, contains important information on the popularity and frequency with which these entities are mentioned and discussed in combination with one another. In this paper we investigate whether these two sources of information can be used to complement and explain one another. In particular, we would like to know whether the logical relationships (a.k.a. semantic paths) which exist between pairs of known entities can help to explain the frequency with which those entities co-occur with one another in Twitter. To do this we train a ranking function over semantic paths between pairs of entities. The aim of the ranker is to identify the path that most likely explains why a particular pair of entities have appeared together in a particular tweet. We train the ranking model using a number of lexical, graph-embedding and popularity-based features over semantic paths containing a single intermediate entity and demonstrate the efficacy of the model for determining why pairs of entities occur together in tweets.

【Keywords】:

296. Integrating Side Information for Boosting Machine Comprehension.

【Paper Link】【Pages】:2355-2358

【Authors】: Yutong Wang ; Yixin Xu ; Min Yang ; Zhou Zhao ; Jun Xiao ; Yueting Zhuang

【Abstract】: Machine Reading and Comprehension recently has drawn a fair amount of attention in the field of natural language processing. In this paper, we consider integrating side information to improve machine comprehension on answering cloze-style questions more precisely. To leverage the external information, we present a novel attention-based architecture which could feed the side information representations into word level embeddings to explore the comprehension performance. Our experiments show consistent improvements of our model over various baselines.

【Keywords】: Machine comprehension; Machine reading; Question answering; Text understanding

297. Unsupervised Feature Selection with Heterogeneous Side Information.

【Paper Link】【Pages】:2359-2362

【Authors】: Xiaokai Wei ; Bokai Cao ; Philip S. Yu

【Abstract】: Compared to supervised feature selection, unsupervised feature selection tends to be more challenging due to the lack of guidance from class labels. Along with the increasing variety of data sources, many datasets are also equipped with certain side information of heterogeneous structure. Such side information can be critical for feature selection when class labels are unavailable. In this paper, we propose a new feature selection method, SideFS, to exploit such rich side information. We model the complex side information as a heterogeneous network and derive instance correlations to guide subsequent feature selection. Representations are learned from the side information network and the feature selection is performed in a unified framework. Experimental results show that the proposed method can effectively enhance the quality of selected features by incorporating heterogeneous side information.

【Keywords】: feature selection; heterogeneous information network; side information; unsupervised learning

298. An Empirical Study of Community Overlap: Ground-truth, Algorithmic Solutions, and Implications.

【Paper Link】【Pages】:2363-2366

【Authors】: Joyce Jiyoung Whang

【Abstract】: In real-world social networks, communities tend to be overlapped with each other because a vertex can belong to multiple communities. To identify these overlapping communities, a number of overlapping community detection methods have been proposed over the recent years. However, there have been very few studies on the characteristics and the implications of the community overlap. In this paper, we investigate the properties of the nodes and the edges placed within the overlapped regions between the communities using the ground-truth communities as well as algorithmic communities derived from the state-of-the-art overlapping community detection methods. We find that the overlapped nodes and the overlapped edges play different roles from the ones that are not in the overlapped regions. Using real-world data, we empirically show that the highly overlapped nodes are involved in structure holes of a network. Also, we show that the overlapped nodes and edges play an important role in forming new links in evolving networks and diffusing information through a network.

【Keywords】: community detection; overlap; social network analysis

299. Non-Exhaustive, Overlapping Co-Clustering.

【Paper Link】【Pages】:2367-2370

【Authors】: Joyce Jiyoung Whang ; Inderjit S. Dhillon

【Abstract】: The goal of co-clustering is to simultaneously identify a clustering of the rows as well as the columns of a two dimensional data matrix. Most existing co-clustering algorithms are designed to find pairwise disjoint and exhaustive co-clusters. However, many real-world datasets might contain not only a large overlap between co-clusters but also outliers which should not belong to any co-cluster. We formulate the problem of Non-Exhaustive, Overlapping Co-Clustering where both of the row and column clusters are allowed to overlap with each other and the outliers for each dimension of the data matrix are not assigned to any cluster. To solve this problem, we propose an intuitive objective function, and develop an efficient iterative algorithm which we call the NEO-CC algorithm. We theoretically show that the NEO-CC algorithm monotonically decreases the proposed objective function. Experimental results show that the NEO-CC algorithm is able to effectively capture the underlying co-clustering structure of real-world data, and thus outperforms state-of-the-art clustering and co-clustering methods.

【Keywords】: clustering; co-clustering; k-means; outlier; overlap

300. Simulating Zero-Resource Spoken Term Discovery.

【Paper Link】【Pages】:2371-2374

【Authors】: Jerome White ; Douglas W. Oard

【Abstract】: If search engines are ever to index all of the spoken content in the world, they will need to handle hundreds of languages for which no automatic speech recognition systems exist. Zero-resource spoken term discovery, in which repeated content is detected in some acoustic representation, offers a potentially useful source of indexing features. This paper describes a text-based simulation of a zero-resource spoken term discovery system that allows any information retrieval test collection to be used as a basis for early development of information retrieval techniques. It is proposed that these techniques can be later applied to actual zero-resource spoken term discovery results.

【Keywords】: n-gram retrieval; simulation; zero resource term discovery

301. Algorithmic Bias: Do Good Systems Make Relevant Documents More Retrievable?

【Paper Link】【Pages】:2375-2378

【Authors】: Colin Wilkie ; Leif Azzopardi

【Abstract】: Algorithmic bias presents a difficult challenge within Information Retrieval. Long has it been known that certain algorithms favour particular documents due to attributes of these documents that are not directly related to relevance. The evaluation of bias has recently been made possible through the use of retrievability, a quantifiable measure of bias. While evaluating bias is relatively novel, the evaluation of performance has been common since the dawn of the Cranfield approach and TREC. To evaluate performance, a pool of documents to be judged by human assessors is created from the collection. This pooling approach has faced accusations of bias due to the fact that the state of the art algorithms were used to create it, thus the inclusion of biases associated with these algorithms may be included in the pool. The introduction of retrievability has provided a mechanism to evaluate the bias of these pools. This work evaluates the varying degrees of bias present in the groups of relevant and non-relevant documents for topics. The differentiating power of a system is also evaluated by examining the documents from the pool that are retrieved for each topic. The analysis finds that the systems that perform better, tend to have a higher chance of retrieving a relevant document rather than a non-relevant document for a topic prior to retrieval, indicating that retrieval systems which perform better at TREC are already predisposed to agree with the judgements regardless of the query posed.

【Keywords】: bias; performance; retrievability

302. Session-aware Information Embedding for E-commerce Product Recommendation.

【Paper Link】【Pages】:2379-2382

【Authors】: Chen Wu ; Ming Yan

【Abstract】: Most of the existing recommender systems assume that user's visiting history can be constantly recorded. However, in recent online services, the user identification may be usually unknown and only limited online user behaviors can be used. It is of great importance to model the temporal online user behaviors and conduct recommendation for the anonymous users. In this paper, we propose a list-wise deep neural network based architecture to model the limited user behaviors within each session. To train the model efficiently, we first design a session embedding method to pre-train a session representation, which incorporates different kinds of user search behaviors such as clicks and views. Based on the learnt session representation, we further propose a list-wise ranking model to generate the recommendation result for each anonymous user session. We conduct quantitative experiments on a recently published dataset from an e-commerce company. The evaluation results validate the effectiveness of the proposed method, which can outperform the state-of-the-art.

【Keywords】: e-commerce recommendation; product embedding; session-aware

303. Conflict of Interest Declaration and Detection System in Heterogeneous Networks.

【Paper Link】【Pages】:2383-2386

【Authors】: Siyuan Wu ; Leong Hou U ; Sourav S. Bhowmick ; Wolfgang Gatterbauer

【Abstract】: Peer review is the most critical process in evaluating an article to be accepted for publication in an academic venue. When assigning a reviewer to evaluate an article, the assignment should be aware of conflicts of interest (COIs) such that the reviews are fair to everyone. However, existing conference management systems simply ask reviewers and authors to declare their explicit COIs through a plain search user interface guided by some simple conflict rules. We argue that such declaration system is not enough to discover all latent COI cases. In this work, we study a graphical declaration system that visualizes the relationships of authors and reviewers based on a heterogeneous co-authorship network. With the help of the declarations, we attempt to detect the latent COIs automatically based on the meta-paths of a heterogeneous network.

【Keywords】: conflict of interest; heterogeneous network; peer review process

304. Common-Specific Multimodal Learning for Deep Belief Network.

【Paper Link】【Pages】:2387-2390

【Authors】: Changsheng Xiang ; Xiaoming Jin

【Abstract】: Multimodal Deep Belief Network has been widely used to extract representations for multimodal data by fusing the high-level features of each data modality into common representations. Such straightforward fusion strategy can benefit the classification and information retrieval tasks. However, it may introduce noise in case the high-level features are not naturally common hence non-fusable for different modalities. Intuitively, each modality may have its own specific features and corresponding representation capabilities thus should not be simply fused. Therefore, it is more reasonable to fuse only the common features and represent the multimodal data by both the fused features and the modality-specific features. To distinguish common features from modal-specific features is a challenging task for traditional DBN models where all features are crudely mixed. This paper proposes the Common-Specific Multimodal Deep Belief Network (CSDBN) to solve the problem. CS-DBN automatically separates common features from modal-specific features and fuses only the common ones for data representation. Experimental results demonstrate the superiority of CS-DBN for classification tasks compared with the baseline approaches.

【Keywords】: deep belief network; multimodal data; representation learning

305. JointSem: Combining Query Entity Linking and Entity based Document Ranking.

【Paper Link】【Pages】:2391-2394

【Authors】: Chenyan Xiong ; Zhengzhong Liu ; Jamie Callan ; Eduard H. Hovy

【Abstract】: Entity-based ranking systems often employ entity linking systems to align entities to query and documents. Previously, entity linking systems were not designed specifically for search engines and were mostly used as a preprocessing step. This work presents JointSem, a joint semantic ranking system that combines query entity linking and entity-based document ranking. In JointSem, the spotting and linking signals are used to describe the importance of candidate entities in the query, and the linked entities are utilized to provide additional ranking features for the documents. The linking signals and the ranking signals are combined by a joint learning-to-rank model, and the whole system is fully optimized towards end-to-end ranking performance. Experiments on TREC Web Track datasets demonstrate the effectiveness of joint learning of entity linking and entity-based ranking.

【Keywords】: document ranking; entity linking; entity-based search

306. Learning to Rank with Query-level Semi-supervised Autoencoders.

【Paper Link】【Pages】:2395-2398

【Authors】: Bo Xu ; Hongfei Lin ; Yuan Lin ; Kan Xu

【Abstract】: Learning to rank utilizes machine learning methods to solve ranking problems by constructing ranking models in a supervised way, which needs fixed-length feature vectors of documents as inputs, and outputs the ranking models learned by iteratively reducing the pre-defined ranking loss. The document features are always extracted based on classic textual statistics, and different features contribute differently to ranking performance. Given that well-defined features would contribute more to the retrieval performance, we investigate the usage of autoencoders to enrich the feature representations of documents. Autoencoders, as basic building blocks of deep neural networks, have been successfully used in many text mining tasks for generating effective features. To enrich the feature space for learning to rank, we introduce supervision into the loss functions of autoencoders. Specifically, we first train a linear ranking model on the training data, and then incorporate the learned weights into the reconstruction costs of an autoencoder. Meanwhile, we accumulate the costs of documents for a given query with query-level constraints for producing more useful features. We evaluate the effectiveness of our model on three LETOR datasets, and show that our model can generate effective document features to improve the retrieval performance.

【Keywords】: autoencoders; learning to rank; semi-supervised learning

307. MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis.

【Paper Link】【Pages】:2399-2402

【Authors】: Nan Xu ; Wenji Mao

【Abstract】: With the prevalence of more diverse and multiform user-generated content in social networking sites, multimodal sentiment analysis has become an increasingly important research topic in recent years. Previous work on multimodal sentiment analysis directly extracts feature representation of each modality and fuse these features for classification. Consequently, some detailed semantic information for sentiment analysis and the correlation between image and text have been ignored. In this paper, we propose a deep semantic network, namely MultiSentiNet, for multimodal sentiment analysis. We first identify object and scene as salient detectors to extract deep semantic features of images. We then propose a visual feature guided attention LSTM model to extract words that are important to understand the sentiment of whole tweet and aggregate the representation of those informative words with visual semantic features, object and scene. The experiments on two public available sentiment datasets verify the effectiveness of our MultiSentiNet model and show that our extracted semantic features demonstrate high correlations with human sentiments.

【Keywords】: attentional mechanism; deep neural network; multimodal sentiment analysis; visual semantic features

308. Attentive Graph-based Recursive Neural Network for Collective Vertex Classification.

【Paper Link】【Pages】:2403-2406

【Authors】: Qiongkai Xu ; Qing Wang ; Chenchen Xu ; Lizhen Qu

【Abstract】: Vertex classification is a critical task in graph analysis, where both contents and linkage of vertices are incorporated during classification. Recently, researchers proposed using deep neural network to build an end-to-end framework, which can capture both local content and structure information. These approaches were proved effective in incorporating semantic meanings of neighbouring vertices, while the usefulness of this information was not properly considered. In this paper, we propose an Attentive Graph-based Recursive Neural Network (AGRNN), which exerts attention on neural network to make our model focus on vertices with more relevant semantic information. We evaluated our approach on three real-world datasets and also datasets with synthetic noise. Our experimental results show that AGRNN achieves the state-of-the-art performance, in terms of effectiveness and robustness. We have also illustrated some attention weight samples to demonstrate the rationality of our model.

【Keywords】: attention model; collective vertex classification; recursive neural network

309. Bayesian Heteroscedastic Matrix Factorization for Conversion Rate Prediction.

【Paper Link】【Pages】:2407-2410

【Authors】: Hongxia Yang

【Abstract】: Display Advertising has generated billions of revenue and originated hundreds of scientific papers and patents, yet the accuracy of prediction technologies leaves much to be desired. Conversion rates (CVR) predictions can often be formulated as a matrix or tensor completion problem where each dimension consists of thousands or even hundreds of thousands of levels. Observed entries are typically extremely sparse, comprising only 0.01% to 1% of the entire matrix or tensor with highly unevenly distributed conversion as well as impression sizes. To deal with these issues, we propose an extension of matrix factorization, namely Bayesian Heteroscedastic Matrix Factorization (BHMF), with three key features. First, BHMF accounts for the fact that each observed entry of a matrix has different magnitude of errors depending on the corresponding impression sizes. We extend the previous research on empirical instance-wise weighted matrix factorization with rigorous probabilistic modelling framework. Second, BHMF is amenable to an efficient Bayesian inference algorithm that is scalable to high dimensional matrices. Compared to the optimization based training, it is more robust to the choices of dimensions of the latent factors as well as regularization parameters. Last, the Bayesian approach provides predictive uncertainty estimations for unseen entries that is capable of dealing with cold-start problems. This can potentially affect a good amount of revenue in the real time bidding (RTB) environment. We focus on matrix CVR predictions in this paper but the proposed BHMF can be naturally extended and applied to higher dimensional tensors. We demonstrate the substantial improvement of our model in predictive capabilities on Yahoo! demand side platform (DSP) BrightRoll.

【Keywords】: bayesian matrix factorization; conversion rate prediction; heteroscedastic

310. SERM: A Recurrent Model for Next Location Prediction in Semantic Trajectories.

【Paper Link】【Pages】:2411-2414

【Authors】: Di Yao ; Chao Zhang ; Jian-Hui Huang ; Jingping Bi

【Abstract】: Predicting the next location a user tends to visit is an important task for applications like location-based advertising, traffic planning, and tour recommendation. We consider the next location prediction problem for semantic trajectory data, wherein each GPS record is attached with a text message that describes the user's activity. In semantic trajectories, the confluence of spatiotemporal transitions and textual messages indicates user intents at a fine granularity and has great potential in improving location prediction accuracies. Nevertheless, existing methods designed for GPS trajectories fall short in capturing latent user intents for such semantics-enriched trajectory data. We propose a method named semantics-enriched recurrent model (SERM). SERM jointly learns the embeddings of multiple factors (user, location, time, keyword) and the transition parameters of a recurrent neural network in a unified framework. Therefore, it effectively captures semantics-aware spatiotemporal transition regularities to improve location prediction accuracies. Our experiments on two real-life semantic trajectory datasets show that SERM achieves significant improvements over state-of-the-art methods.

【Keywords】: location prediction; rnn; semantic trajectory

311. Low-Rank Matrix Completion over Finite Abelian Group Algebras for Context-Aware Recommendation.

【Paper Link】【Pages】:2415-2418

【Authors】: Chia-An Yu ; Tak-Shing Chan ; Yi-Hsuan Yang

【Abstract】: The incorporation of contextual information is an important part of context-aware recommendation. Many context-aware recommendation systems adopt tensor completion to include contextual information. However, the symmetries between dimensions of a tensor induce an unreasonable assumption that users, items and contexts should be treated equally in recommender systems. In this paper, we address this by using matrices over finite abelian group algebra (AGA) to model context-aware interactions between users and items. Specifically, we formulate context-aware recommendation as a low-rank matrix completion problem over AGA (MC-AGA) and derive a new algorithm using the inexact augmented Lagrange multiplier method. We then test MC-AGA on two real-world datasets: one containing implicit feedback and one with explicit feedback. Experiment results show that MC-AGA outperforms not only existing tensor completion algorithms but also recommendation systems with other context-aware representations.

【Keywords】: context-aware recommendation; group algebra; low-rank modeling; matrix/tensor completion

312. Spectrum-based Deep Neural Networks for Fraud Detection.

【Paper Link】【Pages】:2419-2422

【Authors】: Shuhan Yuan ; Xintao Wu ; Jun Li ; Aidong Lu

【Abstract】: In this paper, we focus on fraud detection on a signed graph with only a small set of labeled training data. We propose a novel framework that combines deep neural networks and spectral graph analysis. In particular, we use the node projection (called as spectral coordinate) in the low dimensional spectral space of the graph's adjacency matrix as the input of deep neural networks. Spectral coordinates in the spectral space capture the most useful topology information of the network. Due to the small dimension of spectral coordinates (compared with the dimension of the adjacency matrix derived from a graph), training deep neural networks becomes feasible. We develop and evaluate two neural networks, deep autoencoder and convolutional neural network, in our fraud detection framework. Experimental results on a real signed graph show that our spectrum based deep neural networks are effective in fraud detection.

【Keywords】: deep neural networks; fraud detection; spectrum

313. RATE: Overcoming Noise and Sparsity of Textual Features in Real-Time Location Estimation.

【Paper Link】【Pages】:2423-2426

【Authors】: Yu Zhang ; Wei Wei ; Binxuan Huang ; Kathleen M. Carley ; Yan Zhang

【Abstract】: Real-time location inference of social media users is the fundamental of some spatial applications such as localized search and event detection. While tweet text is the most commonly used feature in location estimation, most of the prior works suffer from either the noise or the sparsity of textual features. In this paper, we aim to tackle these two problems. We use topic modeling as a building block to characterize the geographic topic variation and lexical variation so that "one-hot" encoding vectors will no longer be directly used. We also incorporate other features which can be extracted through the Twitter streaming API to overcome the noise problem. Experimental results show that our RATE algorithm outperforms several benchmark methods, both in the precision of region classification and the mean distance error of latitude and longitude regression.

【Keywords】: location inference; microblog; real-time; text mining

314. Missing Value Learning.

【Paper Link】【Pages】:2427-2430

【Authors】: Zhi-Lin Zhao ; Chang-Dong Wang ; Kun-Yu Lin ; Jian-Huang Lai

【Abstract】: Missing value is common in many machine learning problems and much effort has been made to handle missing data to improve the performance of the learned model. Sometimes, our task is not to train a model using those unlabeled/labeled data with missing value but process examples according to the values of some specified features. So, there is an urgent need of developing a method to predict those missing values. In this paper, we focus on learning from the known values to learn missing value as close as possible to the true one. It's difficult for us to predict missing value because we do not know the structure of the data matrix and some missing values may relate to some other missing values. We solve the problem by recovering the complete data matrix under the three reasonable constraints: feature relationship, upper recovery error bound and class relationship. The proposed algorithm can deal with both unlabeled and labeled data and generative adversarial idea will be used in labeled data to transfer knowledge. Extensive experiments have been conducted to show the effectiveness of the proposed algorithms.

【Keywords】: generative adversarial; missing value; supervised learning; unsupervised learning

315. Local Ensemble across Multiple Sources for Collaborative Filtering.

【Paper Link】【Pages】:2431-2434

【Authors】: Jing Zheng ; Fuzhen Zhuang ; Chuan Shi

【Abstract】: Recently, Transfer Collaborative Filtering (TCF) methods across multiple source domains, which employ knowledge from different source domains to improve the recommendation performance in the target domain, have been applied in recommender systems. The existing multi-source TCF methods either require overlapping objects in different domains or simply re-weight domains to merge them together. In this paper, we propose a novel LO cal EN semble framework across multiple source domains for collaborative filtering (called LOEN for short), where weights of multiple sources for each missing rating in the target domain are determined according to their corresponding local structures. Compared with the previous TCF methods, LOEN does not require overlapping data and considers the divergence of sources through exploiting the local structures of ratings, which allows LOEN to be more general and effective. Experiments conducted on real datasets validate the effectiveness of LOEN, especially for knowledge transfer across unrelated source domains.

【Keywords】: local ensemble; recommender system; transfer collaborative filtering

【Paper Link】【Pages】:2435-2438

【Authors】: Endong Zhu ; Yanghui Rao ; Haoran Xie ; Yuwei Liu ; Jian Yin ; Fu Lee Wang

【Abstract】: This paper addresses the task of cross-domain social emotion classification of online documents. The cross-domain task is formulated as using abundant labeled documents from a source domain and a small amount of labeled documents from a target domain, to predict the emotion of unlabeled documents in the target domain. Although several cross-domain emotion classification algorithms have been proposed, they require that feature distributions of different domains share a sufficient overlapping, which is hard to meet in practical applications. This paper proposes a novel framework, which uses the emotion distribution of training documents at the cluster level, to alleviate the aforementioned issue. Experimental results on two datasets show the effectiveness of our proposed model on cross-domain social emotion classification.

【Keywords】: clustering; cross-domain classification; emotion detection

317. Knowledge-based Question Answering by Jointly Generating, Copying and Paraphrasing.

【Paper Link】【Pages】:2439-2442

【Authors】: Shuguang Zhu ; Xiang Cheng ; Sen Su ; Shuang Lang

【Abstract】: With the development of large-scale knowledge bases, people are building systems which give simple answers to questions based on consolidate facts. In this paper, we focus on simple questions, which ask about only a subject and relation in the knowledge base. Observing that certain parts of a question usually overlap with names of its corresponding subject and relation in the knowledge base, we argue that a question is formed by a mixture of copying and generation. To model that, we propose a sequence-to-sequence (seq2seq) architecture which encodes a candidate subject-relation pair and decodes it into the given question, where the decoding probability is used to select the best candidate. In our decoder, the copying mode points the subject or relation and duplicates its name, while the generating mode summarizes the meaning of the subject-relation pair and produces a word to smooth the question. Realizing that although sometimes a subject or relation is pointed, different names or keywords might be used, we also incorporate a paraphrasing mode to supplement the copying mode using an automatically mined lexicon. Extensive experiments on the largest dataset exhibit our better performance compared with the state-of-the-art methods.

【Keywords】: knowledge base; neural network; question answering

Demonstrations (alphabetical by lead authors' last names) 29

318. PODIUM: Procuring Opinions from Diverse Users in a Multi-Dimensional World.

【Paper Link】【Pages】:2443-2446

【Authors】: Yael Amsterdamer ; Oded Goldreich

【Abstract】: The procurement of opinions is an important task in many contexts. When selecting members of a certain population to ask for their opinions, diversity inside the selected subset is a central consideration. People with diverse profiles are assumed to provide a wider range of opinions and thus to better represent the opinions of the entire population. However, in platforms with a large user base such as crowdsourcing applications and social networks, defining and realizing notions of diversity are both nontrivial. The profiles of users typically contain information that is high-dimensional and semantically rich. We present PODIUM, a tool for opinion procurement that accounts for complex user profiles and enables customizable user selection. Beyond selecting a subset of users with diverse profiles, PODIUM produces explanation for the choice of each user and visual aids to compare the selected subset to the entire population on different dimensions. We demonstrate the use of PODIUM on the TripAdvisor user base, which further enables us to examine the ability of our system to predict diverse opinions in user reviews.

【Keywords】:

319. VizQ: A System for Scalable Processing of Visibility Queries in 3D Spatial Databases.

【Paper Link】【Pages】:2447-2450

【Authors】: Arif Arman ; Mohammed Eunus Ali ; Farhana Murtaza Choudhury ; Kaysar Abdullah

【Abstract】: In this demonstration, we present VizQ, an efficient, scalable, and interactive system to process and visualize a comprehensive collection of novel visibility queries in the presence of obstacles in 3D space. Specifically, we demonstrate four types of query processing: (i) k Maximum Visibility Query (kMVQ), that finds k locations with the maximum visibility of a target object (ii) Visibility Color Map (VCM), where each point in the space is assigned a color value denoting the visibility measure of the target (iii) Continuous Maximum Visibility (CMV) that continuously finds the location that provides the best view of a moving target, and (iv) Text Visibility Color Map (TVCM), where VCM is generated considering readability of text data displayed on a target. We are the first to propose efficient algorithms to run all of the above four types of visibility queries in the context of a large number of 3D obstacle database. We exploit human visibility metrics to design our data structures and algorithms to efficiently process queries, and our approaches outperform baseline approaches in several order of magnitude both in terms of I/Os and processing time. The link of our demonstration video is https://youtu.be/rcizJtFvQfU.

【Keywords】: location based services; spatial database; visibility query

320. CoreDB: a Data Lake Service.

【Paper Link】【Pages】:2451-2454

【Authors】: Amin Beheshti ; Boualem Benatallah ; Reza Nouri ; Van Munin Chhieng ; HuangTao Xiong ; Xu Zhao

【Abstract】: The continuous improvement in connectivity, storage and data processing capabilities allow access to a data deluge from sensors, social-media, news, user-generated, government and private data sources. Accordingly, in a modern data-oriented landscape, with the advent of various data capture and management technologies, organizations are rapidly shifting to datafication of their processes. In such an environment, analysts may need to deal with a collection of datasets, from relational to NoSQL, that holds a vast amount of data gathered from various private/open data islands, i.e. Data Lake. Organizing, indexing and querying the growing volume of internal data and metadata, in a data lake, is challenging and requires various skills and experiences to deal with dozens of new databases and indexing technologies: How to store information items? What technology to use for persisting the data? How to deal with the large volume of streaming data? How to trace and persist information about data? What technology to use for indexing the data? How to query the data lake? To address the above mentioned challenges, we present CoreDB - an open source data lake service - which offers researchers and developers a single REST API to organize, index and query their data and metadata. CoreDB manages multiple database technologies and offers a built-in design for security and tracing.

【Keywords】: data api; data lake; database service

321. SimMeme: Semantic-Based Meme Search.

【Paper Link】【Pages】:2455-2458

【Authors】: Maya Ekron ; Tova Milo ; Brit Youngmann

【Abstract】: With the proliferation of social image-sharing applications, image search becomes an increasingly common activity. In this work, we focus on a particular class of images that convey semantic meaning beyond the visual appearance, and whose search presents particular challenges. A prominent example is Memes, an emerging popular type of captioned pictures, which we will use in this demo to demonstrate our solution. Unlike in conventional image-search, visually similar Memes may reflect different concepts. The intent is sometimes captured by user annotations, but these too are often incomplete and ambiguous. Thus, a deeper analysis of the semantic relations among Memes is required for an accurate search. To address this problem, we present SimMeme, a semantic aware search engine for Memes. SimMeme uses a generic graph-based data model that aligns all the information available about the Memes with a semantic ontology. A novel similarity measure that interweaves common image, textual, structural and semantic similarities into one holistic measure is employed to effectively answer user queries. We will demonstrate the operation of SimMeme over a large repository of real-life annotated Memes which we have constructed by web crawling and crowd annotations, allowing users to appreciate the quality of the search results as well as the execution efficiency.

【Keywords】: image-search; information-network; semantic; similarity

322. SummIt: A Tool for Extractive Summarization, Discovery and Analysis.

【Paper Link】【Pages】:2459-2462

【Authors】: Guy Feigenblat ; Odellia Boni ; Haggai Roitman ; David Konopnicki

【Abstract】: We propose to demonstrate SummIt -- a tool for extractive summarization, discovery and analysis. The main goal of SummIt is to provide consumable summaries that are driven by users' information intents. To this end, SummIt discovers and analyzes potential intents that can be used for summarization. Given an intent, SummIt generates a summary based on a novel unsupervised, query-focused, extractive, multi-document summarization approach. Using visualization aids, SummIt further allows to analyze a given summary and explore both its narrow and broader context.

【Keywords】: discovery; query-focused; search intents; summarization; unsupervised

323. Rapid Analysis of Network Connectivity.

【Paper Link】【Pages】:2463-2466

【Authors】: Scott Freitas ; Hanghang Tong ; Nan Cao ; Yinglong Xia

【Abstract】: This research focuses on accelerating the computational time of two base network algorithms (k-simple shortest paths and minimum spanning tree for a subset of nodes)---cornerstones behind a variety of network connectivity mining tasks---with the goal of rapidly finding networkpathways andtrees using a set of user-specific query nodes. To facilitate this process we utilize: (1) multi-threaded algorithm variations, (2) network re-use for subsequent queries and (3) a novel algorithm, Key Neighboring Vertices (KNV), to reduce the network search space. The proposed KNV algorithm serves a dual purpose: (a) to reduce the computation time for algorithmic analysis and (b) to identify key vertices in the network (\textit ). Empirical results indicate this combination of techniques significantly improves the baseline performance of both algorithms. We have also developed a web platform utilizing the proposed network algorithms to enable researchers and practitioners to both visualize and interact with their datasets (PathFinder: http://www.path-finder.io.

【Keywords】: k-simple shortest paths; mst; search space reduction; multi-threading; parallel processing; network visualization; seed nodes

324. HyPerInsight: Data Exploration Deep Inside HyPer.

【Paper Link】【Pages】:2467-2470

【Authors】: Nina Hubig ; Linnea Passing ; Maximilian E. Schüle ; Dimitri Vorona ; Alfons Kemper ; Thomas Neumann

【Abstract】: Nowadays we are drowning in data of various varieties. For all these mixed types and categories of data there exist even more different analysis approaches, often done in single hand-written solutions. We propose to extend HyPer, a main memory database system to a uniform data agent platform following the one system fits all approach for solving a wide variety of data analysis problems. We achieve this by applying a flexible operator concept to a set of various important data exploration algorithms. With that, HyPer solves analytical questions using clustering, classification, association rule mining and graph mining besides standard HTAP (Hybrid Transaction and Analytical Processing) workloads on the same database state. It enables to approach the full variety and volume of HTAP extended for data exploration (HTAPx), and only needs knowledge of already introduced SQL extensions that are automatically optimized by the database's standard optimizer. In this demo we will focus on the benefits and flexibility we create by using the SQL extensions for several well-known mining workloads. In our interactive webinterface for this project named HyPerInsight we demonstrate how HyPer outperforms the best open source competitor Apache Spark in common use cases in social media, geo-data, recommender systems and several other.

【Keywords】: apriori; database operators; dbscan; hyper; k-means; query processing; sql

325. Interactive System for Reasoning about Document Age.

【Paper Link】【Pages】:2471-2474

【Authors】: Adam Jatowt ; Ricardo Campos

【Abstract】: Recently, many historical texts have become digitized and made accessible for search and browsing. Professionals who work with collections of such texts often need to verify the correctness of documents' key metadata - their creation dates. In this paper, we demonstrate an interactive system for estimating the age of documents. It may be useful not only for tagging a large number of undated documents, but also for verifying already known timestamps. In order to infer probable dates, we rely on a large scale lexical corpora, Google Books Ngrams. Besides estimating the document creation year, the system also outputs evidences to support age detection and reasoning process and allows testing different hypotheses about document's age.

【Keywords】: document metadata; document timestamping; historical texts

326. SemFacet: Making Hard Faceted Search Easier.

【Paper Link】【Pages】:2475-2478

【Authors】: Evgeny Kharlamov ; Luca Giacomelli ; Evgeny Sherkhonov ; Bernardo Cuenca Grau ; Egor V. Kostylev ; Ian Horrocks

【Abstract】: Faceted search is a prominent search paradigm that became the standard in many Web applications and has also been recently proposed as a suitable paradigm for exploring and querying RDF graphs. One of the main challenges that hampers usability of faceted search systems especially in the RDF context is information overload, that is, when the size of faceted interfaces becomes comparable to the size of the data over which the search is performed. In this demo we present (an extension of) our faceted search system SemFacet and focus on features that address the information overload: ranking, aggregation, and reachability. The demo attendees will be able to try our system on an RDF graph that models online shopping over a catalogs with up to millions of products.

【Keywords】: aggregation; faceted search; ranking; recursion

327. Metacrate: Organize and Analyze Millions of Data Profiles.

【Paper Link】【Pages】:2483-2486

【Authors】: Sebastian Kruse ; David Hahn ; Marius Walter ; Felix Naumann

【Abstract】: Databases are one of the great success stories in IT. However, they have been continuously increasing in complexity, hampering operation, maintenance, and upgrades. To face this complexity, sophisticated methods for schema summarization, data cleaning, information integration, and many more have been devised that usually rely on data profiles, such as data statistics, signatures, and integrity constraints. Such data profiles are often extracted by automatic algorithms, which entails various problems: The profiles can be unfiltered and huge in volume; different profile types require different complex data structures; and the various profile types are not integrated with each other. We introduce Metacrate, a system to store, organize, and analyze data profiles of relational databases, thereby following the proven design of databases. In particular, we (i) propose a logical and a physical data model to store all kinds of data profiles in a scalable fashion; (ii) describe an analytics layer to query, integrate, and analyze the profiles efficiently; and (iii) implement on top a library of established algorithms to serve use cases, such as schema discovery, database refactoring, and data cleaning.

【Keywords】: data profiling; database administration; metadata management

328. SemVis: Semantic Visualization for Interactive Topical Analysis.

【Paper Link】【Pages】:2487-2490

【Authors】: Tuan M. V. Le ; Hady W. Lauw

【Abstract】: Exploratory analysis of a text corpus is an important task that can be aided by informative visualization. One spatially-oriented form of document visualization is a scatterplot, whereby every document is associated with a coordinate, and relationships among documents can be perceived through their spatial distances. Semantic visualization further infuses the visualization space with latent semantics, by incorporating a topic model that has a representation in the visualization space, allowing users to also perceive relationships between documents and topics spatially. We illustrate how a semantic visualization system called SemVis could be used to navigate a text corpus interactively and topically via browsing and searching.

【Keywords】: interactive topical analysis; semantic visualization; topic model

329. Exploring the Veracity of Online Claims with BackDrop.

【Paper Link】【Pages】:2491-2494

【Authors】: Julien Leblay ; Weiling Chen ; Steven J. Lynden

【Abstract】: Using the Web to assess the validity of claims presents many challenges. Whether the data comes from social networks or established media outlets, individual or institutional data publishers, one has to deal with scale and heterogeneity, as well as with incomplete, imprecise and sometimes outright false information. All of these are closely studied issues. Yet in many situations, the claims under scrutiny, and the data itself, have some inherent context-dependency making them impossible to completely disprove, or evaluate through a simple (e.g. scalar) measure. While data models used on the Web typically deal with universal knowledge, we believe the time has come to put context, such as time or provenance, at the forefront and watch knowledge through multiple lenses. We present BackDrop, an application that enables annotating knowledge and ontologies found online to explore how the veracity of claims varies with context. BackDrop comes in the form of a Web interface, in which users can interactively populate and annotate knowledge bases, and explore under which circumstances certain claims are more or less credible.

【Keywords】: contextual reasoning; fact checking; web data

330. AliMe Assist : An Intelligent Assistant for Creating an Innovative E-commerce Experience.

【Paper Link】【Pages】:2495-2498

【Authors】: Feng-Lin Li ; Minghui Qiu ; Haiqing Chen ; Xiongwei Wang ; Xing Gao ; Jun Huang ; Juwei Ren ; Zhongzhou Zhao ; Weipeng Zhao ; Lei Wang ; Guwei Jin ; Wei Chu

【Abstract】: We present AliMe Assist, an intelligent assistant designed for creating an innovative online shopping experience in E-commerce. Based on question answering (QA), AliMe Assist offers assistance service, customer service, and chatting service. It is able to take voice and text input, incorporate context to QA, and support multi-round interaction. Currently, it serves millions of customer questions per day and is able to address 85% of them. In this paper, we demonstrate the system, present the underlying techniques, and share our experience in dealing with real-world QA in the E-commerce field.

【Keywords】: convolutional neural network; knowledge graph; question answering; rerank; semantic normalization; sequence-to-sequence

331. Public Transportation Mode Detection from Cellular Data.

【Paper Link】【Pages】:2499-2502

【Authors】: Guanyao Li ; Chun-Jie Chen ; Sheng-Yun Huang ; Ai-Jou Chou ; Xiaochuan Gou ; Wen-Chih Peng ; Chih-Wei Yi

【Abstract】: Public transportation is essential in people's daily life and it is crucial to understand how people move around the city. Some prior works have exploited GPS, Wi-Fi or bluetooth to collect data, in which extra sensors or devices were needed. Other works utilized data from smart card systems. However, some public transportation systems have their own smart card system and the smart card data cannot include all kinds of transportation modes, which makes it unsuitable for our study.Nowadays, each user has his/her own mobile phones and from the cellular data of mobile phone service providers, it is possible to know the uses' transportation mode and the fine-grained crowd flows. As such, given a set of cellular data, we propose a system for public transportation mode detection, crowd density estimation, and crowd flow estimation. Note that we only have cellular data, no extra sensor data collected from users' mobile phones. In this paper, we refer to some external data sources (e.g., the bus routing networks) to identify transportation modes. Users' cellular data sometimes have uncertainty about user location information. Thus, we propose two approaches for different transportation mode detection considering the cell tower properties, spatial and temporal factors. We demonstrate our system using the data from Chunghwa Telecom, which is the largest telecommunication company in Taiwan, to show the usefulness of our system.

【Keywords】: crowd density and flow estimation; smart cities; transportation mode detection; urban computing

332. Urbanity: A System for Interactive Exploration of Urban Dynamics from Streaming Human Sensing Data.

【Paper Link】【Pages】:2503-2506

【Authors】: Mengxiong Liu ; Zhengchao Liu ; Chao Zhang ; Keyang Zhang ; Quan Yuan ; Tim Hanratty ; Jiawei Han

【Abstract】: With the urbanization process worldwide, modeling the dynamics of people's activities in urban environments has become a crucial socioeconomic task. We present Urbanity, a novel system that leverages geo-tagged social media streams for modeling urban dynamics. Urbanity automatically discovers the spatial and temporal hotspots where people's activities concentrate; and captures the cross-modal correlations among location, time, and text by jointly mapping different units into the same latent space. With Urbanity, the end users are able to use flexible query schemes to retrieve different resources (e.g., POIs, hotspots, hours, activities) that meet their needs. Furthermore, Urbanity can handle continuous streams to update the learned model, thus revealing up-to-date patterns of urban activities.

【Keywords】: activity modeling; multimodal embedding; social media; spatiotemporal data; urban computing

333. SemDia: Semantic Rule-Based Equipment Diagnostics Tool.

【Paper Link】【Pages】:2507-2510

【Authors】: Gulnar Mehdi ; Evgeny Kharlamov ; Ognjen Savkovic ; Guohui Xiao ; Elem Güzel Kalayci ; Sebastian Brandt ; Ian Horrocks ; Mikhail Roshchin ; Thomas A. Runkler

【Abstract】: Rule-based diagnostics of power generating equipment is an important task in industry. In this demo we present how semantic technologies can enhance diagnostics. In particular, we present our semantic rule language sigRL that is inspired by the real diagnostic languages in Siemens. SigRL allows to write compact yet powerful diagnostic programs by relying on a high level data independent vocabulary, diagnostic ontologies, and queries over these ontologies. We present our diagnostic system SemDia. The attendees will be able to write diagnostic programs in SemDia using sigRL over 50 Siemens turbines. We also present how such programs can be automatically verified for redundancy and inconsistency. Moreover, the attendees will see the provenance service that SemDia provides to trace the origin of diagnostic results.

【Keywords】: diagnostic systems; ontologies; rules; turbines

334. TaCLe: Learning Constraints in Tabular Data.

【Paper Link】【Pages】:2511-2514

【Authors】: Sergey Paramonov ; Samuel Kolb ; Tias Guns ; Luc De Raedt

【Abstract】: Spreadsheet data is widely used today by many different people and across industries. However, writing, maintaining and identifying good formulae for spreadsheets can be time consuming and error-prone. To address this issue we have introduced the TaCLe system (Tabular Constraint Learner). The system tackles an inverse learning problem: given a plain comma separated file, it reconstructs the spreadsheet formulae that hold in the tables. Two important considerations are the number of cells and constraints to check, and how to deal with multiple formulae for the same cell. Our system reasons over entire rows and columns and has an intuitive user interface for interacting with the learned constraints and data. It can be seen as an intelligent assistance tool for discovering formulae from data. As a result, the user obtains a spreadsheet that can automatically recompute dependent cells when updating or adding data.

【Keywords】: constraint learning; relational learning; spreadsheets

335. An Interactive Framework for Video Surveillance Event Detection and Modeling.

【Paper Link】【Pages】:2515-2518

【Authors】: Fabio Persia ; Fabio Bettini ; Sven Helmer

【Abstract】: We present a framework for high-level event detection in video streams based on a novel temporal extension of relational algebra. With the help of intuitive and interactive graphical user interfaces, a user can have a look at the different layers of our system to gain insights into the inner workings of the system, as well as create new events on the fly and track their processing through the system. As a proof-of-concept we have predefined events on three video surveillance data sets, but we also plan to run a demo with a live video stream generated by a local webcam.

【Keywords】: event query languages; high-level event detection; intervals; surveillance; video stream

336. Storyfinder: Personalized Knowledge Base Construction and Management by Browsing the Web.

【Paper Link】【Pages】:2519-2522

【Authors】: Steffen Remus ; Manuel Kaufmann ; Kathrin Ballweg ; Tatiana von Landesberger ; Chris Biemann

【Abstract】: This paper presents Storyfinder, an application which consists of a browser plugin and a web server backend with the goal to highlight and manage the information contained in web pages by combining techniques from natural language processing and visual analytics. Webpages are analyzed while visiting them by means of natural language processing components, and metadata in the form of named entities and keywords are extracted and stored for further reference. The extracted information is instantaneously highlighted in the web page and stored in a graph of entities and relations. The graph can be inspected and modified. The investigational scope can be set to a single web page, multiple web pages, or the complete set of analyzed web pages in a user's history. The graph view is designed to adhere to standards of visual analytics and information visualization. Storyfinder is available as an open source application. Its benefit for information access is evaluated in a small user study.

【Keywords】: information extraction; knowledge management; visual analytics

【Paper Link】【Pages】:2523-2526

【Authors】: Muhammad Aamir Saleem ; Rohit Kumar ; Toon Calders ; Xike Xie ; Torben Bach Pedersen

【Abstract】: Due to the popularity of social networks with geo-tagged activities, so-called location-based social networks (LBSN), a number of methods have been proposed for influence maximization for applications such as word-of-mouth marketing (WOMM), and out-of-home marketing (OOH). It is thus important to analyze and compare these different approaches. In this demonstration, we present a unified system IMaxer that both provides a complete pipeline of state-of-the-art and novel models and algorithms for influence maximization (IM) as well as allows to evaluate and compare IM techniques for a particular scenario. IMaxer allows to select and transform the required data from raw LBSN datasets. It further provides a unified model that utilizes interactions of nodes in an LBSN, i.e., users and locations, for capturing diverse types of information propagations. On the basis of these interactions, influential nodes can be found and their potential influence can be simulated and visualized using Google Maps and graph visualization APIs. Thus, IMaxer allows users to compare and pick the most suitable IM method in terms of effectiveness and cost.

【Keywords】: influence maximization; location-based social networks; unified system

338. StreamingCube: A Unified Framework for Stream Processing and OLAP Analysis.

【Paper Link】【Pages】:2527-2530

【Authors】: Salman Ahmed Shaikh ; Hiroyuki Kitagawa

【Abstract】: In most streaming applications, the data streams need to be analyzed continuously to make instant decisions exploiting latest information. Often data streams are multidimensional and are at the low-level of abstraction, whereas analysts are interested in multi-level interactive analysis of data streams across several dimensions. On-line analytical processing (OLAP) is a proven technique for such analysis of static data and has also been studied by some researchers for data streams. Traditionally this is achieved by coupling a stream processing engine with an OLAP engine. We believe that coupling multiple systems is not an efficient solutions as it results in lower performance (due to the transfer of data between multiple systems), resource wastage (due to replication of data for each coupled system) and increased complexity and maintenance cost. To this end, we present StreamingCube, a unified framework for data stream processing and its interactive OLAP analysis. The proposed framework possesses all the essential operators to process data streams and introduces a new operator, cubify, to maintain OLAP lattice nodes (materialized views) incrementally. The novelty of the introduced cubify operator lies in the incremental maintenance of the materialized views. To demonstrate StreamingCube, a web-based GUI has been developed which enables users to register continuous queries (CQs). Once a CQ has been registered, users can perform different OLAP operations through the GUI for the interactive analysis. The results of the OLAP queries/operations are displayed in the form of tables and graphs.

【Keywords】: Incremental view maintenance; Online analytical processing; Stream processing; Unified framework

339. Product Exploration based on Latent Visual Attributes.

【Paper Link】【Pages】:2531-2534

【Authors】: Tomás Skopal ; Ladislav Peska ; Gregor Kovalcík ; Tomás Grosup ; Jakub Lokoc

【Abstract】: In this demo paper, we present a prototype web application of a product search engine of a fashion e-shop. Although e-shop products consist of full-text description, relational attributes (e.g., price, type, size, color, etc.) as well as visual information (product photo), traditional search engines in e-shops only provide full-text and relational attributes for product filtering. In our retrieval model, we incorporate also the visual information into the search by extracting visual-semantic features using deep convolutional neural networks. Furthermore, visual exploration of the product space using the visual-semantic features (multi-example queries) is used to dynamically discover latent visual attributes that could enhance the original relational schema by fuzzy attributes (e.g., a floral pattern in product). In the demo, we show how these latent attributes could be used to recommend the user preferred products and even outfits (e.g., shoes, bag, jacket) that fit a certain visual style.

【Keywords】: convolutional neural networks; latent visual attributes; outfit recommendation

340. Hierarchical Module Classification in Mixed-initiative Conversational Agent System.

【Paper Link】【Pages】:2535-2538

【Authors】: Sia Xin Yun Suzanna ; Li Lianjie Anthony

【Abstract】: Our operational context is a task-oriented dialog system where no single module satisfactorily addresses the range of conversational queries from humans. Such systems must be equipped with a range of technologies to address semantic, factual, task-oriented, open domain conversations using rule-based, semantic-web, traditional machine learning and deep learning. This raises two key challenges. First, the modules need to be managed and selected appropriately. Second, the complexity of troubleshooting on such systems is high. We address these challenges with a mixed-initiative model that controls conversational logic through hierarchical classification. We also developed an interface to increase interpretability for operators and to aggregate module performance.

【Keywords】: conversational agent; dialogue systems; language modeling; machine learning

341. Blockchain-based Data Management and Analytics for Micro-insurance Applications.

【Paper Link】【Pages】:2539-2542

【Authors】: Hoang Tam Vo ; Lenin Mehedy ; Mukesh K. Mohania ; Ermyas Abebe

【Abstract】: In this paper, we demonstrate a blockchain-based solution for transparently managing and analyzing data in a pay-as-you-go car insurance application. This application allows drivers who rarely use cars to only pay insurance premium for particular trips they would like to travel. One of the key challenges from database perspective is how to ensure all the data pertaining to the actual trip and premium payment made by the users are transparently recorded so that every party in the insurance contract including the driver, the insurance company, and the financial institution is confident that the data are tamper-proof and traceable. Another challenge from information retrieval perspective is how to perform entity matching and pattern matching on customer data as well as their trip and claim history recorded on the blockchain for intelligent fraud detection. Last but not least, the drivers' trip history, once have been collected sufficiently, can be much valuable for the insurance company to do offline analysis and build statistics on past driving behaviour and past vehicle runtime. These statistics enable the insurance company to offer the users with transparent and individualized insurance quotes. Towards this end, we develop a blockchain-based solution for micro-insurance applications that transparently keeps records and executes smart contracts depending on runtime conditions while also connecting with off-chain analytic databases.

【Keywords】: blockchain; data analytics; data management; information management; information retrieval

342. CleanCloud: Cleaning Big Data on Cloud.

【Paper Link】【Pages】:2543-2546

【Authors】: Hongzhi Wang ; Xiaoou Ding ; Xiangying Chen ; Jianzhong Li ; Hong Gao

【Abstract】: We describe CleanCloud, a system for cleaning big data based on Map-Reduce paradigm in cloud. Using Map-Reduce paradigm, the system detects and repairs various data quality problems in big data. We demonstrate the following features of CleanCloud: (a) the support for cleaning multiple data quality problems in big data; (b) a visual tool for watching the status of big data cleaning process and tuning the parameters for data cleaning; (c) the friendly interface for data input and setting as well as cleaned data collection for big data. CleanCloud is a promising system that provides scalable and effect data cleaning mechanism for big data in either files or databases.

【Keywords】: data cleaning; entity resolution; parallel computing

343. Interactive Analytics System for Exploring Outliers.

【Paper Link】【Pages】:2547-2550

【Authors】: Mingrui Wei ; Lei Cao ; Chris Cormier ; Hui Zheng ; Elke A. Rundensteiner

【Abstract】: ONION is the first system with rich interactive support for efficiently analyzing outliers. ONION features an innovative exploration model that offers an "outlier-centric panorama'' into big datasets. The ONION system is composed of an offline preprocessing phase followed by an online exploration phase that supports rich classes of novel exploration operations. As our demonstration illustrates, this enables analysts to interactively explore outliers at near real-time speed even over large datasets. We demonstrate ONION's capabilities with urban planning applications use cases on the Open Street Maps dataset.

【Keywords】: interactive analytics; outlier detection; parameter exploration

344. Query and Animate Multi-attribute Trajectory Data.

【Paper Link】【Pages】:2551-2554

【Authors】: Jianqiu Xu ; Ralf Hartmut Güting

【Abstract】: The widespread use of GPS-enabled devices has led to huge amounts of trajectory data. In addition to location and time, trajectories are associated with descriptive attributes representing different aspects of real entities, called multi-attribute trajectories. This comes from the combination of several data sources and enables a range of new applications in which users can find interesting trajectories and discover potential relationships that cannot be determined solely based on GPS data. In this demo, we provide the motivation scenario and introduce a system that is developed to integrate standard trajectories (a sequence of timestamped locations) and attributes into one unified framework. The system is able to answer a range of interesting queries on multi-attribute trajectories that are not handled by standard trajectories. The system supports both standard trajectories and multi-attribute trajectories. We demonstrate how to form queries and animate multi-attribute trajectories in the system. To our knowledge, existing moving objects prototype systems do not support multi-attribute trajectories.

【Keywords】: animation; index structure; multi-attribute trajectories; queries

345. ClaimVerif: A Real-time Claim Verification System Using the Web and Fact Databases.

【Paper Link】【Pages】:2555-2558

【Authors】: Shi Zhi ; Yicheng Sun ; Jiayi Liu ; Chao Zhang ; Jiawei Han

【Abstract】: Our society is increasingly digitalized. Every day, a tremendous amount of information is being created, shared, and digested through all kinds of cyber channels. Although people can easily acquire information from various sources (social media, news articles, etc.), the truthfulness of most received information remains unverified. In many real-life scenarios, false information has become the de facto cause that leads to detrimental decision makings, and techniques that can automatically filter false information are highly demanded. However, verifying whether a piece of information is trustworthy is difficult because: (1) selecting candidate snippets for fact checking is nontrivial; and (2) detecting supporting evidences, i.e. stances, suffers from the difficulty of measuring the similarity between claims and related evidences. We build ClaimVerif, a claim verification system that not only provides credibility assessment for any user-given query claim, but also rationales the assessment results with supporting evidences. ClaimVerif can automatically select the stances from millions of documents and employs two-step training to justify the opinions of the stances. Furthermore, combined with the credibility of stances sources, ClaimVerif degrades the score of stances from untrustworthy sources and alleviates the negative effects from rumor spreaders. Our empirical evaluations show that ClaimVerif achieves both high accuracy and efficiency in different claim verification tasks. It can be highly useful in practical applications by providing multi-dimension analysis for the suspicious statements, including the stances, opinions, source credibility and estimated judgements.

【Keywords】: fact checking; rumor detection; source credibility analysis; text mining

346. POOLSIDE: An Online Probabilistic Knowledge Base for Shopping Decision Support.

【Paper Link】【Pages】:2559-2562

【Authors】: Ping Zhong ; Zhanhuai Li ; Qun Chen ; Yanyan Wang ; Lianping Wang ; Murtadha H. M. Ahmed ; Fengfeng Fan

【Abstract】: We present POOLSIDE, an online PrObabilistic knOwLedge base for ShoppIng DEcision support, that provides with the on-target recommendation service based on explicit user requirement. With a natural language interface, POOLSIDE can answer question in real-time. We present how to construct the knowledge base and how to enable real-time response in POOLSIDE. Finally, we demonstrate that Poolside can give high-quality product recommendations with high efficiency.(The demo video can be accessed via the link:https://www.youtube.com/watch?v=D8ALi11CUcc)

【Keywords】: decision support system; knowledge base; markov logic network

Workshops 4

347. Overview of the 4th HistoInformatics Workshop.

【Paper Link】【Pages】:2563-2564

【Authors】: Mohammed Hasanuzzaman ; Gaël Dias ; Adam Jatowt ; Marten Düring ; Antal van den Bosch

【Abstract】: In line with global trends, historical records are increasingly available in forms that computer can process. These ever expanding records (such as scanned books, large-scale corpora, academic papers, maps, photos, audios, videos)---either digitally born or reconstructed through digitization pipelines---are too big to be read or viewed manually. Historians, like other humanities researchers, have a keen interest in computational approaches to process and study digitized historical information for research, writing, and dissemination of historical knowledge. In Computer Science, experimental tools and methods are challenged to be validated regarding their relevance for real-world questions and applications. The HistoInformatics workshop series is focused on the challenges and opportunities of data-driven humanities and brings together scientists and scholars at the forefront of this emerging field, at the interface between History, Anthropology, Archaeology, Computer Science and associated disciplines as well as the cultural heritage sector. The 4th HistoInformatics Workshop was a half day workshop co-located with the 26th ACM International Conference on Information and Knowledge Management (CIKM 2017) in Singapore.

【Keywords】: computational history; cultural heritage; digital history

348. IDM 2017: Workshop on Interpretable Data Mining - Bridging the Gap between Shallow and Deep Models.

【Paper Link】【Pages】:2565-2566

【Authors】: Xia Hu ; Shuiwang Ji

【Abstract】: Intelligent systems built upon complex machine learning and data mining models (e.g., deep neural networks) have shown superior performances on various real-world applications. However, their effectiveness is limited by the difficulty in interpreting the resultant prediction mechanisms or how the results are obtained. In contrast, the results of many simple or shallow models, such as rule-based or tree-based methods, are explainable but not sufficiently accurate. Model interpretability enables the systems to be clearly understood, properly trusted, effectively managed and widely adopted by end users. Interpretations are necessary in applications such as medical diagnosis, fraud detection and object recognition where valid reasons would be significantly helpful, if not necessary, before taking actions based on predictions. This workshop is about interpreting the prediction mechanisms or results of the complex computational models for data mining by taking advantage of simple models which are easier to understand. We wish to exchange ideas on recent approaches to the challenges of model interpretability, identify emerging fields of applications for such techniques, and provide opportunities for relevant interdisciplinary research or projects.

【Keywords】: data mining; deep models; interpretability; machine learning; shallow models

【Paper Link】【Pages】:2567-2568

【Authors】: Manjira Sinha ; Xiangnan He ; Alessandro Bozzon ; Sandya Mannarswamy ; Pradeep Murukannaiah ; Tridib Mukherjee

【Abstract】: In an increasingly digital urban setting, connected & concerned Citizens typically voice their opinions on various civic topics via social media. Efficient and scalable analysis of these citizen voices on social media to derive actionable insights is essential to the development of smart cities. The very nature of the data: heterogeneity and dynamism, the scarcity of gold standard annotated corpora, and the need for multi-dimensional analysis across space, time and semantics, makes urban social media analytics challenging. This workshop is dedicated to the theme of social media analytics for smart cities, with the aim of focusing the interest of CIKM research community on the challenges in mining social media data for urban informatics. The workshop hopes to foster collaboration between researchers working in information retrieval, social media analytics, linguistics; social scientists, and civic authorities, to develop scalable and practical systems for capturing and acting upon real world issues of cities as voiced by their citizens in social media. The aim of this workshop is to encourage researchers to develop techniques for urban analytics of social media data, with specific focus on applying these techniques to practical urban informatics applications for smart cities.

【Keywords】: social media analytics; text mining; urban informatics

350. Additional Workshops Co-located with CIKM 2017.

【Paper Link】【Pages】:2569-2570

【Authors】: Marianne Winslett

【Abstract】: Summary of three workshops co-located with CIKM 2017.

【Keywords】: cikm 2017 workshops

26. CIKM 2017:Singapore

Paper Num: 350 || Session Num: 53

Keynote & Invited Talks 4

1. Machine Learning @ Amazon.

2. Deception Detection: When Computers Become Better than Humans.

3. When Deep Learning Meets Transfer Learning.

4. A Hyper-connected World.

Session 1A: Multimedia 4

5. Jointly Modeling Static Visual Appearance and Temporal Pattern for Unsupervised Video Hashing.

6. Construction of a National Scale ENF Map using Online Multimedia Data.

7. Dual Learning for Cross-domain Image Captioning.

8. A New Approach to Compute CNNs for Extremely Large Images.

Session 1B: IR evaluation 4

9. Active Sampling for Large-scale Information Retrieval Evaluation.

10. Intent Based Relevance Estimation from Click Logs.

11. A Comparison of Nuggets and Clusters for Evaluating Timeline Summaries.

12. Sensitive and Scalable Online Evaluation with Theoretical Guarantees.

Session 1C: Sentiment 4

13. Users Are Known by the Company They Keep: Topic Models for Viewpoint Discovery in Social Networks.

14. Aspect-level Sentiment Classification with HEAT (HiErarchical ATtention) Network.

15. Dyadic Memory Networks for Aspect-based Sentiment Analysis.

16. Modeling Language Discrepancy for Cross-Lingual Sentiment Analysis.

Session 1D: Network Embedding 1 4

17. Multi-view Clustering with Graph Embedding for Connectome Analysis.

18. Attributed Signed Network Embedding.

19. Enhancing the Network Embedding Quality with Structural Similarity.

20. On Embedding Uncertain Graphs.

Session 1E: Web/App data 4

21. A Large Scale Prediction Engine for App Install Clicks and Conversions.

22. Building Natural Language Interfaces to Web APIs.

23. UFeed: Refining Web Data Integration Based on User Feedback.

24. Extracting Records from the Web Using a Signal Processing Approach.

Session 1F: Graph data 4

25. A Scalable Graph-Coarsening Based Index for Dynamic Graph Databases.

26. Natural Language Question/Answering: Let Users Talk With The Knowledge Graph.

27. Keyword Search on RDF Graphs - A Query Graph Assembly Approach.

28. Region Representation Learning via Mobility Flow.

Session 2A: Ranking 4

29. Learning Visual Features from Snapshots for Web Search.

30. DeepRank: A New Deep Architecture for Relevance Ranking in Information Retrieval.

31. Learning to Un-Rank: Quantifying Search Exposure for Users in Online Communities.

32. Balancing Speed and Quality in Online Learning to Rank for Information Retrieval.

Session 2B: Crowdsourcing 1 4

33. Crowd-enabled Pareto-Optimal Objects Finding Employing Multi-Pairwise-Comparison Questions.

34. Destination-aware Task Assignment in Spatial Crowdsourcing.

35. Crowdsourced Selection on Multi-Attribute Data.

36. Select Your Questions Wisely: For Entity Resolution With Crowd Errors.

Session 2C: Recommendation 1 4

37. Reply With: Proactive Recommendation of Email Attachments.

38. Learning and Transferring Social and Item Visibilities for Personalized Recommendation.

39. Joint Topic-Semantic-aware Social Recommendation for Online Voting.

40. Interactive Social Recommendation.

Session 2D: Network Embedding 2 4

41. From Properties to Links: Deep Network Embedding on Incomplete Graphs.

42. Learning Community Embedding with Community Detection and Node Embedding on Graphs.

43. Attributed Network Embedding for Learning in a Dynamic Environment.

44. Learning Node Embeddings in Interaction Graphs.

Session 2E: Skyline Queries 4

45. Efficient Computation of Subspace Skyline over Categorical Domains.

46. Fast Algorithms for Pareto Optimal Group-based Skyline.

47. Probabilistic Skyline on Incomplete Data.

48. Communication-Efficient Distributed Skyline Computation.

Session 2F: Social Media Analysis 4

49. Bringing Salary Transparency to the World: Computing Robust Compensation Insights via LinkedIn Salary.

50. Efficient Document Filtering Using Vector Space Topic Expansion and Pattern-Mining: The Case of Event Detection in Microposts.

51. LARM: A Lifetime Aware Regression Model for Predicting YouTube Video Popularity.

52. Modeling Affinity based Popularity Dynamics.

Session 3A: Spatiotemporal 4

53. Scenic Routes Now: Efficiently Solving the Time-Dependent Arc Orienteering Problem.

54. Modeling Temporal-Spatial Correlations for Crime Prediction.

55. Spatiotemporal Event Forecasting from Incomplete Hyper-local Price Data.

56. Exploiting Spatio-Temporal User Behaviors for User Linkage.

Session 3B: Short text retrieval 4

57. Similarity-based Distant Supervision for Definition Retrieval.

58. Hybrid BiLSTM-Siamese network for FAQ Assistance.

59. Regularized and Retrofitted models for Learning Sentence Representation with Context.

60. Talking to Your TV: Context-Aware Voice Search with Hierarchical Recurrent Neural Networks.

Session 3C: Community Detection 4

61. GPU-Accelerated Graph Clustering via Parallel Label Propagation.

62. Temporally Like-minded User Community Identification through Neural Embeddings.