25. CIKM 2016:Indianapolis, IN, USA

Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, IN, USA, October 24-28, 2016. ACM 【DBLP Link】

Paper Num: 327 || Session Num: 56

Industry Track Demo Papers 19
Industry Track Short Papers 6
Keynote Address 1 1
Keynote Address 2 1
Keynote Address 3 1
Poster Session I: Short Papers 55
Poster Session II: Extended Short Papers 52
Session 1a: Recommendation 4
Session 1b: Deep Learning Applications 4
Session 1c: Document Classification and Labeling 4
Session 1d: Better Queries 4
Session 1e: Better Search 4
Session 1f: Industry Session I 3
Session 2a: Learning to Rank 4
Session 2b: Question Answering 4
Session 2c: Wikipedia 4
Session 2d: Clustering 4
Session 2e: Understanding Text 4
Session 2f: Industry Session II 3
Session 3a: Graph Analytics 4
Session 3b: Event Detection and Analytics 4
Session 3c: Crowdsourcing 4
Session 3d: Mobile 4
Session 3e: Social Networks---Links and Trust 4
Session 3f: Industry Session III 4
Session 4a: Information Retrieval 4
Session 4b: User Behavior and Interfaces 4
Session 4c: Documents 4
Session 4d: Knowledge Mining and Management 4
Session 4e: Truth Discovery 4
Session 4f: Industry Session IV 3
Session 5a: Sentiment and Opinion Mining 4
Session 5b: Time Series 4
Session 5c: Learning for Classification and Prediction 4
Session 5d: Social Media 4
Session 5e: Queries and Search 4
Session 5f: Industry Session V 2
Session 6a: Learning Algorithms 4
Session 6b: Databases and Data Processing 4
Session 6c: Large Graph Processing 4
Session 6d: Information Retrieval II 4
Session 6e: Entity Detection and Analysis 4
Session 6f: Industry Session VI 3
Session 7a: Advertising and Ranking 4
Session 7b: Query Analytics 4
Session 7c: Information Retrieval III 4
Session 7d: Data Mining 4
Session 7e: Network Analytics 4
Session 7f: Industry Session VII 4
Session 8a: Learning 4
Session 8b: Social Networks---Diffusion and Cascades 4
Session 8c: Applications 4
Session 8d: Algorithms 4
Session 8e: High Performance Big Data 4
Session 8f: Industry Session VIII 4
Workshops 6

Keynote Address 1 1

1. Toward Data-Driven Education: CIKM-2016 Keynote.

【Paper Link】【Pages】:3

【Authors】: Rakesh Agrawal

【Abstract】: A program of study can be viewed as a knowledge graph consisting of learning units and relationships between them. Such a knowledge graph provides the core data structure for organizing and navigating learning experiences. We address three issues in this talk. First, how can we synthesize the knowledge graph, given a set of concepts to be covered in the study program. Next, how can we use data mining to identify and correct deficiencies in a knowledge graph. Finally, how can we use data mining to form study groups with the goal of maximizing overall learning. We conclude by pointing out some open research problems.

【Keywords】: data mining; education; enriching educational content; knowledge graph; learning; study groups

Session 1a: Recommendation 4

【Paper Link】【Pages】:5-14

【Authors】: Xin Wang ; Wei Lu ; Martin Ester ; Can Wang ; Chun Chen

【Abstract】: With the explosive growth of online social networks, it is now well understood that social information is highly helpful to recommender systems. Social recommendation methods are capable of battling the critical cold-start issue, and thus can greatly improve prediction accuracy. The main intuition is that through trust and influence, users are more likely to develop affinity toward items consumed by their social ties. Despite considerable work in social recommendation, little attention has been paid to the important distinctions between strong and weak ties, two well-documented notions in social sciences. In this work, we study the effects of distinguishing strong and weak ties in social recommendation. We use neighbourhood overlap to approximate tie strength and extend the popular Bayesian Personalized Ranking (BPR) model to incorporate the distinction of strong and weak ties. We present an EM-based algorithm that simultaneously classifies strong and weak ties in a social network w.r.t. optimal recommendation accuracy and learns latent feature vectors for all users and all items. We conduct extensive empirical evaluation on four real-world datasets and demonstrate that our proposed method significantly outperforms state-of-the-art pairwise ranking methods in a variety of accuracy metrics.

【Keywords】: recommender systems; user behaviour modelling

3. Learning Graph-based POI Embedding for Location-based Recommendation.

【Paper Link】【Pages】:15-24

【Authors】: Min Xie ; Hongzhi Yin ; Hao Wang ; Fanjiang Xu ; Weitong Chen ; Sen Wang

【Abstract】: With the rapid prevalence of smart mobile devices and the dramatic proliferation of location-based social networks (LBSNs), location-based recommendation has become an important means to help people discover attractive and interesting points of interest (POIs). However, the extreme sparsity of user-POI matrix and cold-start issue create severe challenges, causing CF-based methods to degrade significantly in their recommendation performance. Moreover, location-based recommendation requires spatiotemporal context awareness and dynamic tracking of the user's latest preferences in a real-time manner. To address these challenges, we stand on recent advances in embedding learning techniques and propose a generic graph-based embedding model, called GE, in this paper. GE jointly captures the sequential effect, geographical influence, temporal cyclic effect and semantic effect in a unified way by embedding the four corresponding relational graphs (POI-POI, POI-Region, POI-Time and POI-Word)into a shared low dimensional space. Then, to support the real-time recommendation, we develop a novel time-decay method to dynamically compute the user's latest preferences based on the embedding of his/her checked-in POIs learnt in the latent space. We conduct extensive experiments to evaluate the performance of our model on two real large-scale datasets, and the experimental results show its superiority over other competitors, especially in recommending cold-start POIs. Besides, we study the contribution of each factor to improve location-based recommendation and find that both sequential effect and temporal cyclic effect play more important roles than geographical influence and semantic effect.

【Keywords】: cold start; dynamic user preference modeling; graph embedding; location-based social network; poi embedding; poi recommendation

4. Improving Personalized Trip Recommendation by Avoiding Crowds.

【Paper Link】【Pages】:25-34

【Authors】: Xiaoting Wang ; Christopher Leckie ; Jeffrey Chan ; Kwan Hui Lim ; Tharshan Vaithianathan

【Abstract】: There has been a growing interest in recommending trips for tourists using location-based social networks. The challenge of trip recommendation not only lies in searching for relevant points-of-interest (POIs) to form a personalized trip, but also selecting the best time of day to visit the POIs. Popular POIs can be too crowded during peak times, resulting in long queues and delays. In this work, we propose the Personalized Crowd-aware Trip Recommendation (PersCT) algorithm to recommend personalized trips that also avoid the most crowded times of the POIs. We model the problem as an extension of the Orienteering Problem with multiple constraints. We extract user interests by collaborative filtering and we propose an extension of the Ant Colony Optimisation algorithm to merge user interests with POI popularity and crowdedness data to recommend trips. We evaluate our algorithm using foot traffic information obtained from a real-life pedestrian sensor dataset and user travel histories extracted from a Flickr photo dataset. We show that our algorithm out-performs several benchmarks in achieving a balance between conflicting objectives by satisfying user interests while reducing the crowdedness of the trips.

【Keywords】: location-based social network; orienteering problem; trip recommendation

5. Memory-based Recommendations of Entities for Web Search Users.

【Paper Link】【Pages】:35-44

【Authors】: Ignacio Fernández-Tobías ; Roi Blanco

【Abstract】: Modern search engines have evolved from mere document retrieval systems to platforms that assist the users in discovering new information. In this context, entity recommendation systems exploit query log data to proactively provide the users with suggestions of entities (people, movies, places, etc.) from knowledge bases that are relevant for their current information need. Previous works consider the problem of ranking facts and entities related to the user's current query, or focus on specific recommendation domains requiring supervised selection and extraction of features from knowledge bases. In this paper we propose a set of domain-agnostic methods based on nearest neighbors collaborative filtering that exploit query log data to generate entity suggestions, taking into account the user's full search session. Our experimental results on a large dataset from a commercial search engine show that the proposed methods are able to compute relevant entity recommendations outperforming a number of baselines. Finally, we perform an analysis on a cross-domain scenario using different entity types, and conclude that even if knowing the right target domain is important for providing effective recommendations, some inter-domain user interactions are helpful for the task at hand.

【Keywords】: entity recommendation; recommender systems; web search

Session 1b: Deep Learning Applications 4

6. LICON: A Linear Weighting Scheme for the Contribution ofInput Variables in Deep Artificial Neural Networks.

【Paper Link】【Pages】:45-54

【Authors】: Gjergji Kasneci ; Thomas Gottron

【Abstract】: In recent years artificial neural networks have become the method of choice for many pattern recognition tasks. Despite their overwhelming success, a rigorous and easy to interpret mathematical explanation of the influence of input variables on a output produced by a neural network is still missing. We propose a generic framework as well as a concrete method for quantifying the influence of individual input signals on the output computed by a deep neural network. Inspired by the variable weighting scheme in the log-linear combination of variables in logistic regression, the proposed method provides linear models for specific observations of the input variables. This linear model locally approximates the behaviour of the neural network and can be used to quantify the influence of input variables in a principled way. We demonstrate the effectiveness of the proposed method in experiments on various synthetic and real-world datasets.

【Keywords】: artificial neural networks; contribution; explanation; input variables; linear weighting scheme

7. A Deep Relevance Matching Model for Ad-hoc Retrieval.

【Paper Link】【Pages】:55-64

【Authors】: Jiafeng Guo ; Yixing Fan ; Qingyao Ai ; W. Bruce Croft

【Abstract】: In recent years, deep neural networks have led to exciting breakthroughs in speech recognition, computer vision, and natural language processing (NLP) tasks. However, there have been few positive results of deep models on ad-hoc retrieval tasks. This is partially due to the fact that many important characteristics of the ad-hoc retrieval task have not been well addressed in deep models yet. Typically, the ad-hoc retrieval task is formalized as a matching problem between two pieces of text in existing work using deep models, and treated equivalent to many NLP tasks such as paraphrase identification, question answering and automatic conversation. However, we argue that the ad-hoc retrieval task is mainly about relevance matching while most NLP matching tasks concern semantic matching, and there are some fundamental differences between these two matching tasks. Successful relevance matching requires proper handling of the exact matching signals, query term importance, and diverse matching requirements. In this paper, we propose a novel deep relevance matching model (DRMM) for ad-hoc retrieval. Specifically, our model employs a joint deep architecture at the query term level for relevance matching. By using matching histogram mapping, a feed forward matching network, and a term gating network, we can effectively deal with the three relevance matching factors mentioned above. Experimental results on two representative benchmark collections show that our model can significantly outperform some well-known retrieval models as well as state-of-the-art deep matching models.

【Keywords】: ad-hoc retrieval; neural models; ranking models; relevance matching; semantic matching

8. A Neural Network Approach to Quote Recommendation in Writings.

【Paper Link】【Pages】:65-74

【Authors】: Jiwei Tan ; Xiaojun Wan ; Jianguo Xiao

【Abstract】: Quote is a language phenomenon of transcribing the saying of someone else. Proper usage of quote can usually make the statement more elegant and convincing. However, the ability of quote usage is usually limited by the amount of quotes one remembers or knows. Quote recommendation is a task of exploiting abundant quote repositories to help people make better use of quotes while writing. The task is different from conventional recommendation tasks due to the characteristic of quote. A pilot study has explored this task by using a learning to rank framework and manually designed features. However, it is still hard to model the meaning of a quote, which is an interesting and challenging problem. In this paper, we propose a neural network approach based on LSTMs to the quote recommendation task. We directly learn the distributed meaning representations for the contexts and the quotes, and then measure the relevance based on the meaning representations. In particular, we try to represent the words in quotes with specific embeddings, according to the contexts, topics and even author preferences of the quotes. Experimental results on a large dataset show that our proposed approach achieves the state-of-the-art performance and it outperforms several strong baselines.

【Keywords】: deep learning; document recommendation; lstm; quote recommendation

9. Retweet Prediction with Attention-based Deep Neural Network.

【Paper Link】【Pages】:75-84

【Authors】: Qi Zhang ; Yeyun Gong ; Jindou Wu ; Haoran Huang ; Xuanjing Huang

【Abstract】: On Twitter-like social media sites, the re-posting statuses or tweets of other users are usually considered to be the key mechanism for spreading information. How to predict whether a tweet will be retweeted by a user has received increasing attention in recent years. Previous methods studied the problem using various linguistic features, personal information of users, and many other manually constructed features to achieve the task. Usually, feature engineering is a laborious task, we require to obtain the external sources and they are difficult or not always available. Recently, deep learning methods have been used in the industry and research community for their ability to learn optimal features automatically and in many tasks, deep learning methods can achieve state-of-the art performance, such as natural language processing, computer vision, image classification and so on. In this work, we proposed a novel attention-based deep neural network to incorporate contextual and social information for this task. We used embeddings to represent the user, the user's attention interests, the author and tweet respectively. To train and evaluate the proposed methods, we also constructed a large dataset collected from Twitter. Experimental results showed that the proposed method could achieve better results than the previous state-of-the-art methods.

【Keywords】: attention mechanism; deep neural network; retweet prediction

Session 1c: Document Classification and Labeling 4

10. Effective Document Labeling with Very Few Seed Words: A Topic Model Approach.

【Paper Link】【Pages】:85-94

【Authors】: Chenliang Li ; Jian Xing ; Aixin Sun ; Zongyang Ma

【Abstract】: Developing text classifiers often requires a large number of labeled documents as training examples. However, manually labeling documents is costly and time-consuming. Recently, a few methods have been proposed to label documents by using a small set of relevant keywords for each category, known as dataless text classification. In this paper, we propose a Seed-Guided Topic Model (named STM) for the dataless text classification task. Given a collection of unlabeled documents, and for each category a small set of seed words that are relevant to the semantic meaning of the category, the STM predicts the category labels of the documents through topic influence. STM models two kinds of topics: category-topics and general-topics. Each category-topic is associated with one specific category, representing its semantic meaning. The general-topics capture the global semantic information underlying the whole document collection. STM assumes that each document is associated with a single category-topic and a mixture of general-topics. A novelty of the model is that STM learns the topics by exploiting the explicit word co-occurrence patterns between the seed words and regular words (i.e., non-seed words) in the document collection. A document is then labeled, or classified, based on its posterior category-topic assignment. Experiments on two widely used datasets show that STM consistently outperforms the state-of-the-art dataless text classifiers. In some tasks, STM can also achieve comparable or even better classification accuracy than the state-of-the-art supervised learning solutions. Our experimental results further show that STM is insensitive to the tuning parameters. Stable performance with little variation can be achieved in a broad range of parameter settings, making it a desired choice for real applications.

【Keywords】: dataless text classification; text analysis; topic modeling

11. Cross-lingual Text Classification via Model Translation with Limited Dictionaries.

【Paper Link】【Pages】:95-104

【Authors】: Ruochen Xu ; Yiming Yang ; Hanxiao Liu ; Andrew Hsi

【Abstract】: Cross-lingual text classification (CLTC) refers to the task of classifying documents in different languages into the same taxonomy of categories. An open challenge in CLTC is to classify documents for the languages where labeled training data are not available. Existing approaches rely on the availability of either high-quality machine translation of documents (to the languages where massively training data are available), or rich bilingual dictionaries for effective translation of trained classification models (to the languages where labeled training data are lacking). This paper studies the CLTC challenge under the assumption that neither condition is met. That is, we focus on the problem of translating classification models with highly incomplete bilingual dictionaries. Specifically, we propose two new approaches that combines unsupervised word embedding in different languages, supervised mapping of embedded words across languages, and probabilistic translation of classification models. The approaches show significant performance improvement in CLTC on a benchmark corpus of Reuters news stories (RCV1/RCV2) in English, Spanish, German, French and Chinese and an internal dataset in Uzbek, compared to representative baseline methods using conventional bilingual dictionaries or highly incomplete ones.

【Keywords】: cross-lingual text classification; multilingual text data; transfer learning

12. Semi-supervised Multi-Label Topic Models for Document Classification and Sentence Labeling.

【Paper Link】【Pages】:105-114

【Authors】: Hossein Soleimani ; David J. Miller

【Abstract】: Extracting parts of a text document relevant to a class label is a critical information retrieval task. We propose a semi-supervised multi-label topic model for jointly achieving document and sentence-level class inferences. Under our model, each sentence is associated with only a subset of the document's labels (including possibly none of them), with the label set of the document the union of the labels of all of its sentences. For training, we use both labeled documents, and, typically, a larger set of unlabeled documents. Our model, in a semisupervised fashion, discovers the topics present, learns associations between topics and class labels, predicts labels for new (or unlabeled) documents, and determines label associations for each sentence in every document. For learning, our model does not require any ground-truth labels on sentences. We develop a Hamiltonian Monte Carlo based algorithm for efficiently sampling from the joint label distribution over all sentences, a very high-dimensional discrete space. Our experiments show that our approach outperforms several benchmark methods with respect to both document and sentence-level classification, as well as test set log-likelihood. All code for replicating our experiments is available from https://github.com/hsoleimani/MLTM.

【Keywords】: credit assignment; multi-label classification; semi-supervised learning; topic models

13. Linked Document Embedding for Classification.

【Paper Link】【Pages】:115-124

【Authors】: Suhang Wang ; Jiliang Tang ; Charu C. Aggarwal ; Huan Liu

【Abstract】: Word and document embedding algorithms such as Skip-gram and Paragraph Vector have been proven to help various text analysis tasks such as document classification, document clustering and information retrieval. The vast majority of these algorithms are designed to work with independent and identically distributed documents. However, in many real-world applications, documents are inherently linked. For example, web documents such as blogs and online news often have hyperlinks to other web documents, and scientific articles usually cite other articles. Linked documents present new challenges to traditional document embedding algorithms. In addition, most existing document embedding algorithms are unsupervised and their learned representations may not be optimal for classification when labeling information is available. In this paper, we study the problem of linked document embedding for classification and propose a linked document embedding framework LDE, which combines link and label information with content information to learn document representations for classification. Experimental results on real-world datasets demonstrate the effectiveness of the proposed framework. Further experiments are conducted to understand the importance of link and label information in the proposed framework LDE.

【Keywords】: document embedding; linked data; word embedding

Session 1d: Better Queries 4

14. Detecting Promotion Campaigns in Query Auto Completion.

【Paper Link】【Pages】:125-134

【Authors】: Yuli Liu ; Yiqun Liu ; Ke Zhou ; Min Zhang ; Shaoping Ma ; Yue Yin ; Hengliang Luo

【Abstract】: Query Auto Completion (QAC) aims to provide possible suggestions to Web search users from the moment they start entering a query, which is thought to reduce their physical and cognitive efforts in query formulation. However, the QAC has been misused by malicious users, being transformed into a new form of promotion campaign. These malicious users attack the search engines to replace legitimate auto-completion candidate suggestions with manipulated contents. Through this way, they provide a new malicious advertising service to promote their customers' products or services in QAC. To our best knowledge, we are among the first to investigate this new type of Promotion Campaign in QAC (PCQ). Firstly, we look into the causes of PCQ based on practical commercial search query logs. We found that various queries containing certain promotion intents are submitted multiple times to search engines to promote their rankings in QAC. Secondly, an effective promotion query detection framework is proposed by promotion intent propagation on query-user bipartite graph, which takes into account the behavioral characteristics of promotion campaigns. Finally, we extend the query detection framework to promotion target detection to identify the consistent promotion target which is the inherent goal of the promotion campaign. Large-scale manual annotations on practical data set convey both the effectiveness of our proposed algorithm, and an in-depth understanding of PCQ.

【Keywords】: promotion campaign; query auto completion; spam detection

15. A Unified Index for Spatio-Temporal Keyword Queries.

【Paper Link】【Pages】:135-144

【Authors】: Tuan-Anh Hoang-Vu ; Huy T. Vo ; Juliana Freire

【Abstract】: From tweets to urban data sets, there has been an explosion in the volume of textual data that is associated with both temporal and spatial components. Efficiently evaluating queries over these data is challenging. Previous approaches have focused on the spatial aspect. Some used separate indices for space and text, thus incurring the overhead of storing separate indices and joining their results. Others proposed a combined index that either inserts terms into a spatial structure or adds a spatial structure to an inverted index. These benefit queries with highly-selective constraints that match the primary index structure but have limited effectiveness and pruning power otherwise. We propose a new indexing strategy that uniformly handles text, space and time in a single structure, and is thus able to efficiently evaluate queries that combine keywords with spatial and temporal constraints. We present a detailed experimental evaluation using real data sets which shows that not only our index attains substantially lower query processing times, but it can also be constructed in a fraction of the time required by state-of-the-art approaches.

【Keywords】: kd-tree; spatial range query; spatial top-k query; spatio-temporal keyword index

16. Privacy-Preserving Reachability Query Services for Massive Networks.

【Paper Link】【Pages】:145-154

【Authors】: Jiaxin Jiang ; Peipei Yi ; Byron Choi ; Zhiwei Zhang ; Xiaohui Yu

【Abstract】: This paper studies privacy-preserving reachability query services under the paradigm of data outsourcing. Specifically, graph data have been outsourced to a third-party service provider (SP), query clients submit their queries to the (SP), and the (SP) returns the query answers to the clients. However, the (SP) may not always be trustworthy. Hence, this paper investigates protecting the structural information of the graph data and the query answers from the (SP). Existing techniques are either insecure or not scalable. This paper proposes a privacy-preserving labeling, called ppTopo. To our knowledge, ppTopo is the first work that can produce reachability index on massive networks and is secure against known plaintext attacks (KPA). Specifically, we propose a scalable index construction algorithm by employing the idea of topological folding, recently proposed by Cheng et al. We propose a novel asymmetric scalar product encryption in modulo 3 (ASPE3). It allows us to encrypt the index labels and transforms the queries into scalar products of encrypted labels. We perform an experimental study of the proposed technique on the SNAP networks. Compared with the existing methods, our results show that our technique is capable of producing the encrypted indexes at least 5 times faster for massive networks and the client's decryption time is 2-3 times smaller for most graphs.

【Keywords】: data and query privacies; graph databases; reachability queries

17. Sequential Query Expansion using Concept Graph.

【Paper Link】【Pages】:155-164

【Authors】: Saeid Balaneshinkordan ; Alexander Kotov

【Abstract】: Manually and automatically constructed concept graphs (or semantic networks), in which the nodes correspond to words or phrases and the typed edges designate semantic relationships between words and phrases, have been previously shown to be rich sources of effective latent concepts for query expansion. However, finding good expansion concepts for a given query in large and dense concept graphs is a challenging problem, since the number of candidate concepts that are related to query terms and phrases and need to be examined increases exponentially with the distance from the original query concepts. In this paper, we propose a two-stage feature-based method for sequential selection of the most effective concepts for query expansion from a concept graph. In the first stage, the proposed method weighs the concepts according to different types of computationally inexpensive features, including collection and concept graph statistics. In the second stage, a sequential concept selection algorithm utilizing more expensive features is applied to find the most effective expansion concepts at different distances from the original query concepts. Experiments on TREC datasets of different type indicate that the proposed method achieves significant improvement in retrieval accuracy over state-of-the-art methods for query expansion using concept graphs.

【Keywords】: feature-based IR models; query analysis; query expansion; semantic networks

Session 1e: Better Search 4

18. Learning Latent Vector Spaces for Product Search.

【Paper Link】【Pages】:165-174

【Authors】: Christophe Van Gysel ; Maarten de Rijke ; Evangelos Kanoulas

【Abstract】: We introduce a novel latent vector space model that jointly learns the latent representations of words, e-commerce products and a mapping between the two without the need for explicit annotations. The power of the model lies in its ability to directly model the discriminative relation between products and a particular word. We compare our method to existing latent vector space models (LSI, LDA and word2vec) and evaluate it as a feature in a learning to rank setting. Our latent vector space model achieves its enhanced performance as it learns better product representations. Furthermore, the mapping from words to products and the representations of words benefit directly from the errors propagated back from the product representations during parameter estimation. We provide an in-depth analysis of the performance of our model and analyze the structure of the learned representations.

【Keywords】: entity retrieval; latent space models; representation learning

19. Incorporating Clicks, Attention and Satisfaction into a Search Engine Result Page Evaluation Model.

【Paper Link】【Pages】:175-184

【Authors】: Aleksandr Chuklin ; Maarten de Rijke

【Abstract】: Modern search engine result pages often provide immediate value to users and organize information in such a way that it is easy to navigate. The core ranking function contributes to this and so do result snippets, smart organization of result blocks and extensive use of one-box answers or side panels. While they are useful to the user and help search engines to stand out, such features present two big challenges for evaluation. First, the presence of such elements on a search engine result page (SERP) may lead to the absence of clicks, which is, however, not related to dissatisfaction, so-called 'good abandonments.' Second, the non-linear layout and visual difference of SERP items may lead to non-trivial patterns of user attention, which is not captured by existing evaluation metrics. In this paper we propose a model of user behavior on a SERP that jointly captures click behavior, user attention and satisfaction, the CAS model, and demonstrate that it gives more accurate predictions of user actions and self-reported satisfaction than existing models based on clicks alone. We use the CAS model to build a novel evaluation metric that can be applied to non-linear SERP layouts and that can account for the utility that users obtain directly on a SERP. We demonstrate that this metric shows better agreement with user-reported satisfaction than conventional evaluation metrics.

【Keywords】: click models; evaluation; good abandonment; mouse movement; user behavior

20. The Role of Relevance in Sponsored Search.

【Paper Link】【Pages】:185-194

【Authors】: Luca Maria Aiello ; Ioannis Arapakis ; Ricardo A. Baeza-Yates ; Xiao Bai ; Nicola Barbieri ; Amin Mantrach ; Fabrizio Silvestri

【Abstract】: Sponsored search aims at retrieving the advertisements that in the one hand meet users' intent reflected in their search queries, and in the other hand attract user clicks to generate revenue. Advertisements are typically ranked based on their expected revenue that is computed as the product between their predicted probability of being clicked (i.e., namely clickability) and their advertiser provided bid. The relevance of an advertisement to a user query is implicitly captured by the predicted clickability of the advertisement, assuming that relevant advertisements are more likely to attract user clicks. However, this approach easily biases the ranking toward advertisements having rich click history. This may incorrectly lead to showing irrelevant advertisements whose clickability is not accurately predicted due to lack of click history. Another side effect consists of never giving a chance to new advertisements that may be highly relevant to be printed due to their lack of click history. To address this problem, we explicitly measure the relevance between an advertisement and a query without relying on the advertisement's click history, and present different ways of leveraging this relevance to improve user search experience without reducing search engine revenue. Specifically, we propose a machine learning approach that solely relies on text-based features to measure the relevance between an advertisement and a query. We discuss how the introduced relevance can be used in four important use cases: pre-filtering of irrelevant advertisements, recovering advertisements with little history, improving clickability prediction, and re-ranking of the advertisements on the final search result page. Offine experiments using large-scale query logs and online A/B tests demonstrate the superiority of the proposed click-oblivious relevance model and the important roles that relevance plays in sponsored search.

【Keywords】: relevance in sponsored search; relevance model

21. PowerWalk: Scalable Personalized PageRank via Random Walks with Vertex-Centric Decomposition.

【Paper Link】【Pages】:195-204

【Authors】: Qin Liu ; Zhenguo Li ; John C. S. Lui ; Jiefeng Cheng

【Abstract】: Most methods for Personalized PageRank (PPR) precompute and store all accurate PPR vectors, and at query time, return the ones of interest directly. However, the storage and computation of all accurate PPR vectors can be prohibitive for large graphs, especially in caching them in memory for real-time online querying. In this paper, we propose a distributed framework that strikes a better balance between offline indexing and online querying. The offline indexing attains a fingerprint of the PPR vector of each vertex by performing billions of ``short'' random walks in parallel across a cluster of machines. We prove that our indexing method has an exponential convergence, achieving the same precision with previous methods using a much smaller number of random walks. At query time, the new PPR vector is composed by a linear combination of related fingerprints, in a highly efficient vertex-centric decomposition manner. Interestingly, the resulting PPR vector is much more accurate than its offline counterpart because it actually uses more random walks in its estimation. More importantly, we show that such decomposition for a batch of queries can be very efficiently processed using a shared decomposition. Our implementation, PowerWalk, takes advantage of advanced distributed graph engines and it outperforms the state-of-the-art algorithms by orders of magnitude. Particularly, it responses to tens of thousands of queries on graphs with billions of edges in just a few seconds.

【Keywords】: personalized pagerank; random walks; vertex-centric decomposition

Session 1f: Industry Session I 3

22. Building Industry-specific Knowledge Bases.

【Paper Link】【Pages】:205-206

【Authors】: Shivakumar Vaithyanathan

【Abstract】: Building industry-specific knowledge bases relies heavily on collecting and representing domain knowledge over time. Domain knowledge includes: (1) the logical schema, constraints and domain vocabulary of the application, (2) the models and algorithms to populate instances of that schema, and (3) the data necessary to build and maintain those models and algorithms. In IBM Watson we are using an ontology-driven approach for the creation and consumption of industry-specific knowledge bases. The creation of such knowledge bases involves well known building blocks: natural language processing, entity resolution, data transformation, etc. It is critical that the models and algorithms that implement these building blocks be transparent and optimizable for efficient execution. In this talk, I will describe the design of domain-specific languages (DSL) with specialized constructs that serve as target languages for learning these models and algorithms, and the generation of training data for scaling up the learning.

【Keywords】:

23. Reuters Tracer: A Large Scale System of Detecting & Verifying Real-Time News Events from Twitter.

【Paper Link】【Pages】:207-216

【Authors】: Xiaomo Liu ; Quanzhi Li ; Armineh Nourbakhsh ; Rui Fang ; Merine Thomas ; Kajsa Anderson ; Russ Kociuba ; Mark Vedder ; Steven Pomerville ; Ramdev Wudali ; Robert Martin ; John Duprey ; Arun Vachher ; William Keenan ; Sameena Shah

【Abstract】: News professionals are facing the challenge of discovering news from more diverse and unreliable information in the age of social media. More and more news events break on social media first and are picked up by news media subsequently. The recent Brussels attack is such an example. At Reuters, a global news agency, we have observed the necessity of providing a more effective tool that can help our journalists to quickly discover news on social media, verify them and then inform the public. In this paper, we describe Reuters Tracer, a system for sifting through all noise to detect news events on Twitter and assessing their veracity. We disclose the architecture of our system and discuss the various design strategies that facilitate the implementation of machine learning models for noise filtering and event detection. These techniques have been implemented at large scale and successfully discovered breaking news faster than traditional journalism

【Keywords】: event detection & verification; noise filtering; twitter

24. Structural Clustering of Machine-Generated Mail.

【Paper Link】【Pages】:217-226

【Authors】: Noa Avigdor-Elgrabli ; Mark Cwalinski ; Dotan Di Castro ; Iftah Gamzu ; Irena Grabovitch-Zuyev ; Liane Lewin-Eytan ; Yoelle Maarek

【Abstract】: Several recent studies have presented different approaches for clustering and classifying machine-generated mail based on email headers. We propose to expand these approaches by considering email message bodies. We argue that our approach can help increase coverage and precision in several tasks, and is especially critical for mail extraction. We remind that mail extraction supports a variety of mail mining applications such as ad re-targeting, mail search, and mail summarization. We introduce new structural clustering methods that leverage the HTML structure that is common to messages generated by a same mass-sender script. We discuss how such structural clustering can be conducted at different levels of granularity, using either strict or flexible matching constraints, depending on the use cases. We present large scale experiments carried over real Yahoo mail traffic. For our first use case of automatic mail extraction, we describe novel flexible-matching clustering methods that meet the key requirements of high intra-cluster similarity, adequate clusters size, and relatively small overall number of clusters. We identify the precise level of flexibility that is needed in order to achieve extremely high extraction precision (close to 100%), while producing relatively small number of clusters. For our second use case, namely, mail classification, we show that strict structural matching is more adequate, achieving precision and recall rates between 85%-90%, while converging to a stable classification after a short learning cycle. This represents an increase of 10%-20% compared to the sender-based method described in previous work, when run over the same period length. Our work has been deployed in production in Yahoo mail backend.

【Keywords】: machine generated mail; mail classification; mail clustering; mail extraction; mail mining; mail structural clustering

Session 2a: Learning to Rank 4

25. LambdaFM: Learning Optimal Ranking with Factorization Machines Using Lambda Surrogates.

【Paper Link】【Pages】:227-236

【Authors】: Fajie Yuan ; Guibing Guo ; Joemon M. Jose ; Long Chen ; Haitao Yu ; Weinan Zhang

【Abstract】: State-of-the-art item recommendation algorithms, which apply Factorization Machines (FM) as a scoring function and pairwise ranking loss as a trainer (PRFM for short), have been recently investigated for the implicit feedback based context-aware recommendation problem (IFCAR). However, good recommenders particularly emphasize on the accuracy near the top of the ranked list, and typical pairwise loss functions might not match well with such a requirement. In this paper, we demonstrate, both theoretically and empirically, PRFM models usually lead to non-optimal item recommendation results due to such a mismatch. Inspired by the success of LambdaRank, we introduce Lambda Factorization Machines (LambdaFM), which is particularly intended for optimizing ranking performance for IFCAR. We also point out that the original lambda function suffers from the issue of expensive computational complexity in such settings due to a large amount of unobserved feedback. Hence, instead of directly adopting the original lambda strategy, we create three effective lambda surrogates by conducting a theoretical analysis for lambda from the top-N optimization perspective. Further, we prove that the proposed lambda surrogates are generic and applicable to a large set of pairwise ranking loss functions. Experimental results demonstrate LambdaFM significantly outperforms state-of-the-art algorithms on three real-world datasets in terms of four standard ranking measures.

【Keywords】: context-aware; factorization machines; lambdafm; pairwise ranking; prfm; top-n recommendation

26. Plackett-Luce Regression Mixture Model for Heterogeneous Rankings.

【Paper Link】【Pages】:237-246

【Authors】: Maksim Tkachenko ; Hady Wirawan Lauw

【Abstract】: Learning to rank is an important problem in many scenarios, such as information retrieval, natural language processing, recommender systems, etc. The objective is to learn a function that ranks a number of instances based on their features. In the vast majority of the learning to rank literature, there is an implicit assumption that the population of ranking instances are homogeneous, and thus can be modeled by a single central ranking function. In this work, we are concerned with learning to rank for a heterogeneous population, which may consist of a number of sub-populations, each of which may rank objects differently. Because these sub-populations are not known in advance, and are effectively latent, the problem turns into simultaneously learning both a set of ranking functions, as well as the latent assignment of instances to functions. To address this problem in a joint manner, we develop a probabilistic graphical model called Plackett-Luce Regression Mixture or PLRM model, and describe its inference via Expectation-Maximization algorithm. Comprehensive experiments on publicly-available real-life datasets showcase the effectiveness of PLRM, as opposed to a pipelined approach of clustering followed by learning to rank, as well as approaches that assume a single ranking function for a heterogeneous population.

【Keywords】: graphical model; heterogeneous ranking; learning to rank; mixture model; plackett-luce

27. Compression-Based Selective Sampling for Learning to Rank.

【Paper Link】【Pages】:247-256

【Authors】: Rodrigo M. Silva ; Guilherme de Castro Mendes Gomes ; Mário S. Alvim ; Marcos André Gonçalves

【Abstract】: Learning to rank (L2R) algorithms use a labeled training set to generate a ranking model that can be later used to rank new query results. These training sets are very costly and laborious to produce, requiring human annotators to assess the relevance or order of the documents in relation to a query. Active learning (AL) algorithms are able to reduce the labeling effort by actively sampling an unlabeled set and choosing data instances that maximize the effectiveness of a learning function. But AL methods require constant supervision, as documents have to be labeled at each round of the process. In this paper, we propose that certain characteristics of unlabeled L2R datasets allow for an unsupervised, compression-based selection process to be used to create small and yet highly informative and effective initial sets that can later be labeled and used to bootstrap a L2R system. We implement our ideas through a novel unsupervised selective sampling method, which we call Cover, that has several advantages over AL methods tailored to L2R. First, it does not need an initial labeled seed set and can select documents from scratch. Second, selected documents do not need to be labeled as the iterations of the method progress since it is unsupervised (i.e., no learning model needs to be updated). Thus, an arbitrarily sized training set can be selected without human intervention depending on the available budget. Third, the method is efficient and can be run on unlabeled collections containing millions of query-document instances. We run various experiments with two important L2R benchmarking collections to show that the proposed method allows for the creation of small, yet very effective training sets. It achieves full training-like performance with less than 10% of the original sets selected, outperforming the baselines in both effectiveness and scalability.

【Keywords】: active learning; compression; learning to rank; selective sampling

28. Incorporating Risk-Sensitiveness into Feature Selection for Learning to Rank.

【Paper Link】【Pages】:257-266

【Authors】: Daniel Xavier de Sousa ; Sérgio Daniel Canuto ; Thierson Couto Rosa ; Wellington Santos Martins ; Marcos André Gonçalves

【Abstract】: Learning to Rank (L2R) is currently an essential task in basically all types of information systems given the huge and ever increasing amount of data made available. While many solutions have been proposed to improve L2R functions, relatively little attention has been paid to the task of improving the quality of the feature space. L2R strategies usually rely on dense feature representations, which contain noisy or redundant features, increasing the cost of the learning process, without any benefits. Although feature selection (FS) strategies can be applied to reduce dimensionality and noise, side effects of such procedures have been neglected, such as the risk of getting very poor predictions in a few (but important) queries. In this paper we propose multi-objective FS strategies that optimize both aspects at the same time: ranking performance and risk-sensitive evaluation. For this, we approximate the Pareto-optimal set for multi-objective optimization in a new and original application to L2R. Our contributions include novel FS methods for L2R which optimize multiple, potentially conflicting, criteria. In particular, one of the objectives (risk-sensitive evaluation) has never been optimized in the context of FS for L2R before. Our experimental evaluation shows that our proposed methods select features that are more effective (ranking performance) and low-risk than those selected by other state-of-the-art FS methods.

【Keywords】: feature selection; learning to rank; risk-sensitiveness

Session 2b: Question Answering 4

【Paper Link】【Pages】:267-276

【Authors】: Laure Soulier ; Lynda Tamine ; Gia-Hung Nguyen

【Abstract】: In this paper, we specifically consider the challenging task of solving a question posted on Twitter. The latter generally remains unanswered and most of the replies, if any, are only from members of the questioner's neighborhood. As outlined in previous work related to community Q&A, we believe that question-answering is a collaborative process and that the relevant answer to a question post is an aggregation of answer nuggets posted by a group of relevant users. Thus, the problem of identifying the relevant answer turns into the problem of identifying the right group of users who would provide useful answers and would possibly be willing to collaborate together in the long-term. Accordingly, we present a novel method, called CRAQ, that is built on the collaboration paradigm and formulated as a group entropy optimization problem. To optimize the quality of the group, an information gain measure is used to select the most likely ``informative" users according to topical and collaboration likelihood predictive features. Crowd-based experiments performed on two crisis-related Twitter datasets demonstrate the effectiveness of our collaborative-based answering approach.

【Keywords】: collaborative group recommendation; social information retrieval; social network question-answering

30. Learning to Extract Conditional Knowledge for Question Answering using Dialogue.

【Paper Link】【Pages】:277-286

【Authors】: Pengwei Wang ; Lei Ji ; Jun Yan ; Lianwen Jin ; Wei-Ying Ma

【Abstract】: Knowledge based question answering (KBQA) has attracted much attention from both academia and industry in the field of Artificial Intelligence. However, many existing knowledge bases (KBs) are built by static triples. It is hard to answer user questions with different conditions, which will lead to significant answer variances in questions with similar intent. In this work, we propose to extract conditional knowledge base (CKB) from user question-answer pairs for answering user questions with different conditions through dialogue. Given a subject, we first learn user question patterns and conditions. Then we propose an embedding based co-clustering algorithm to simultaneously group the patterns and conditions by leveraging the answers as supervisor information. After that, we extract the answers to questions conditioned on both question pattern clusters and condition clusters as a CKB. As a result, when users ask a question without clearly specifying the conditions, we use dialogues in natural language to chat with users for question specification and answer retrieval. Experiments on real question answering (QA) data show that the dialogue model using automatically extracted CKB can more accurately answer user questions and significantly improve user satisfaction for questions with missing conditions.

【Keywords】: conditional knowledge base; dialogue; knowledge based question answering

31. aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model.

【Paper Link】【Pages】:287-296

【Authors】: Liu Yang ; Qingyao Ai ; Jiafeng Guo ; W. Bruce Croft

【Abstract】: As an alternative to question answering methods based on feature engineering, deep learning approaches such as convolutional neural networks (CNNs) and Long Short-Term Memory Models (LSTMs) have recently been proposed for semantic matching of questions and answers. To achieve good results, however, these models have been combined with additional features such as word overlap or BM25 scores. Without this combination, these models perform significantly worse than methods based on linguistic feature engineering. In this paper, we propose an attention based neural matching model for ranking short answer text. We adopt value-shared weighting scheme instead of position-shared weighting scheme for combining different matching signals and incorporate question term importance learning using question attention network. Using the popular benchmark TREC QA data, we show that the relatively simple aNMM model can significantly outperform other neural network models that have been used for the question answering task, and is competitive with models that are combined with additional features. When aNMM is combined with additional features, it outperforms all baselines.

【Keywords】: deep learning; question answering; term importance learning; value-shared weights

32. Medical Question Answering for Clinical Decision Support.

【Paper Link】【Pages】:297-306

【Authors】: Travis R. Goodwin ; Sanda M. Harabagiu

【Abstract】: The goal of modern Clinical Decision Support (CDS) systems is to provide physicians with information relevant to their management of patient care. When faced with a medical case, a physician asks questions about the diagnosis, the tests, or treatments that should be administered. Recently, the TREC-CDS track has addressed this challenge by evaluating results of retrieving relevant scientific articles where the answers of medical questions in support of CDS can be found. Although retrieving relevant medical articles instead of identifying the answers was believed to be an easier task, state-of-the-art results are not yet sufficiently promising. In this paper, we present a novel framework for answering medical questions in the spirit of TREC-CDS by first discovering the answer and then selecting and ranking scientific articles that contain the answer. Answer discovery is the result of probabilistic inference which operates on a probabilistic knowledge graph, automatically generated by processing the medical language of large collections of electronic medical records (EMRs). The probabilistic inference of answers combines knowledge from medical practice (EMRs) with knowledge from medical research (scientific articles). It also takes into account the medical knowledge automatically discerned from the medical case description. We show that this novel form of medical question answering (Q/A) produces very promising results in (a) identifying accurately the answers and (b) it improves medical article ranking by 40\%.

【Keywords】: clinical decision support; medical information retrieval; question answering

Session 2c: Wikipedia 4

33. Error Link Detection and Correction in Wikipedia.

【Paper Link】【Pages】:307-316

【Authors】: Chengyu Wang ; Rong Zhang ; Xiaofeng He ; Aoying Zhou

【Abstract】: The hyperlink structure of Wikipedia forms a rich semantic network connecting entities and concepts, enabling it as a valuable source for knowledge harvesting. Wikipedia, as crowd-sourced data, faces various data quality issues which significantly impacts knowledge systems depending on it as the information source. One such issue occurs when an anchor text in a Wikipage links to a wrong Wikipage, causing the error link problem. While much of previous work has focused on leveraging Wikipedia for entity linking, little has been done to detect error links. In this paper, we address the error link problem, and propose algorithms to detect and correct error links. We introduce an efficient method to generate candidate error links based on iterative ranking in an Anchor Text Semantic Network. This greatly reduces the problem space. A more accurate pairwise learning model was used to detect error links from the reduced candidate error link set, while suggesting correct links in the same time. This approach is effective when data sparsity is a challenging issue. The experiments on both English and Chinese Wikipedia illustrate the effectiveness of our approach. We also provide a preliminary analysis on possible causes of error links in English and Chinese Wikipedia.

【Keywords】: error link; linkrank; pairwise learning; wikipedia

34. Using Prerequisites to Extract Concept Maps fromTextbooks.

【Paper Link】【Pages】:317-326

【Authors】: Shuting Wang ; Alexander Ororbia ; Zhaohui Wu ; Kyle Williams ; Chen Liang ; Bart Pursel ; C. Lee Giles

【Abstract】: We present a framework for constructing a specific type of knowledge graph, a concept map from textbooks. Using Wikipedia, we derive prerequisite relations among these concepts. A traditional approach for concept map extraction consists of two sub-problems: key concept extraction and concept relationship identification. Previous work for the most part had considered these two sub-problems independently. We propose a framework that jointly optimizes these sub-problems and investigates methods that identify concept relationships. Experiments on concept maps that are manually extracted in six educational areas (computer networks, macroeconomics, precalculus, databases, physics, and geometry) show that our model outperforms supervised learning baselines that solve the two sub-problems separately. Moreover, we observe that incorporating textbook information helps with concept map extraction.

【Keywords】: concept maps; open education; textbooks; web knowledge

35. Vandalism Detection in Wikidata.

【Paper Link】【Pages】:327-336

【Authors】: Stefan Heindorf ; Martin Potthast ; Benno Stein ; Gregor Engels

【Abstract】: Wikidata is the new, large-scale knowledge base of the Wikimedia Foundation. Its knowledge is increasingly used within Wikipedia itself and various other kinds of information systems, imposing high demands on its integrity. Wikidata can be edited by anyone and, unfortunately, it frequently gets vandalized, exposing all information systems using it to the risk of spreading vandalized and falsified information. In this paper, we present a new machine learning-based approach to detect vandalism in Wikidata. We propose a set of 47 features that exploit both content and context information, and we report on 4 classifiers of increasing effectiveness tailored to this learning task. Our approach is evaluated on the recently published Wikidata Vandalism Corpus WDVC-2015 and it achieves an area under curve value of the receiver operating characteristic, ROC-AUC, of 0.991. It significantly outperforms the state of the art represented by the rule-based Wikidata Abuse Filter (0.865 ROC-AUC) and a prototypical vandalism detector recently introduced by Wikimedia within the Objective Revision Evaluation Service (0.859 ROC-AUC).

【Keywords】: data quality; knowledge base; trust; vandalism

36. Finding News Citations for Wikipedia.

【Paper Link】【Pages】:337-346

【Authors】: Besnik Fetahu ; Katja Markert ; Wolfgang Nejdl ; Avishek Anand

【Abstract】: An important editing policy in Wikipedia is to provide citations for added statements in Wikipedia pages, where statements can be arbitrary pieces of text, ranging from a sentence to a paragraph. In many cases citations are either outdated or missing altogether. In this work we address the problem of finding and updating news citations for statements in entity pages. We propose a two-stage supervised approach for this problem. In the first step, we construct a classifier to find out whether statements need a news citation or other kinds of citations (web, book, journal, etc.). In the second step, we develop a news citation algorithm for Wikipedia statements, which recommends appropriate citations from a given news collection. Apart from IR techniques that use the statement to query the news collection, we also formalize three properties of an appropriate citation, namely: (i) the citation should entail the Wikipedia statement, (ii) the statement should be central to the citation, and (iii) the citation should be from an authoritative source. We perform an extensive evaluation of both steps, using 20 million articles from a real-world news collection. Our results are quite promising, and show that we can perform this task with high precision and at scale.

【Keywords】: missing citations; news citations; wikipedia; wikipedia enrichment

Session 2d: Clustering 4

37. SemiNMF-PCA framework for Sparse Data Co-clustering.

【Paper Link】【Pages】:347-356

【Authors】: Kais Allab ; Lazhar Labiod ; Mohamed Nadif

【Abstract】: Several studies have demonstrated the importance of co-clustering which aims to cluster simultaneously the sets of objects and features. The co-clustering is often more effective than one-side clustering, especially when considering sparse high dimensional data. In this paper, we propose a novel way to consider the co-clustering and the reduction of the dimension simultaneously. Our approach takes advantage of the mutual reinforcement between Principal Component Analysis (PCA) which provides a low-dimensional representation of data and Semi-Nonnegative Matrix Factorization (SemiNMF) that learns this low-dimensional representation and lends itself to a co-clustering interpretation. In other words, the proposed framework aims to find an optimal subspace of multi-dimensional variables for effectively identifying a partition of the set of objects. We show that by doing so, our model is able to learn low-dimensional representations that are better suited for co-clustering, outperforming not only spectral methods, but also co-clustering graph-regularized-based methods.

【Keywords】: co-clustering; dimensionality reduction; locality preserving; matrix factorization

38. Effective and Efficient Spectral Clustering on Text and Link Data.

【Paper Link】【Pages】:357-366

【Authors】: Zhiqiang Xu ; Yiping Ke

【Abstract】: Clustering text and link data, as an important task in text and link analysis, aims at finding communities of linked documents by leveraging the information from both domains. Due to its improved performance over the single domain counterpart, it has attracted increasing attention from practitioners in recent years. Despite its popularity, all existing algorithms on clustering text and link data overlook the existence of domain-specific distinctions and thus result in unsatisfactory clustering quality. In this paper, we address this limitation by explicitly modeling the domain-specific distinctions in the clustering process. Specifically, we extend the idea of consensus and domain-specific subspace decomposition from flat data to graph data. Such a modeling, when coupled with a regularization to further sharpen the information distinction, makes the consensus information between text and link more accurate for clustering with both domains. The final model is cast into the spectral clustering model by imposing the subspace orthogonality. To eschew the costly eigen-decomposition required for spectral clustering and further speed-up the optimization, we take advantage of the data sparsity and the low dimensionality of subspaces, and deploy a constraint-preserving gradient method to efficiently solve the model. The experimental study on three real datasets shows that our algorithm consistently and significantly outperforms the state-of-the-art relevant algorithms in terms of both quality and efficiency.

【Keywords】: efficiency; spectral clustering; text and link analysis

39. Robust Spectral Ensemble Clustering.

【Paper Link】【Pages】:367-376

【Authors】: Zhiqiang Tao ; Hongfu Liu ; Sheng Li ; Yun Fu

【Abstract】: Ensemble Clustering (EC) aims to integrate multiple Basic Partitions (BPs) of the same dataset into a consensus one. It could be transformed as a graph partition problem on the co-association matrix derived from BPs. However, existing EC methods usually directly use the co-association matrix, yet without considering various noises (e.g., the disagreement between different BPs or outliers) that may exist in it. These noises can impair the cluster structure of a co-association matrix and thus degrade the final clustering performance. In this paper, we propose a novel Robust Spectral Ensemble Clustering (RSEC) approach to address this challenge. First, RSEC learns a robust representation for the co-association matrix through low-rank constraint, which reveals the cluster structure of a co-association matrix and captures various noises in it. Second, RSEC finds the consensus partition by conducting spectral clustering. These two steps are iteratively performed in a unified optimization framework. Most importantly, during our optimization process, we utilize consensus partition to iteratively enhance the block-diagonal structure of the learned representation to further assist the clustering process. Experiments on numerous real-world datasets demonstrate the effectiveness of our method compared with the state-of-the-art. Moreover, several impact factors that may affect the clustering performance of our approach are also explored extensively.

【Keywords】: co-association matrix; ensemble clustering; low-rank representation; spectral clustering

40. Hybrid Indexing for Versioned Document Search with Cluster-based Retrieval.

【Paper Link】【Pages】:377-386

【Authors】: Xin Jin ; Daniel Agun ; Tao Yang ; Qinghao Wu ; Yifan Shen ; Susen Zhao

【Abstract】: The previous two-phase method for searching versioned documents seeks a cost tradeoff by using non-positional information to rank document versions first. The second phase then re-ranks top document versions using positional information with fragment-based index compression. This paper proposes an alternative approach that uses cluster-based retrieval to quickly narrow the search scope guided by version representatives at Phase 1 and develops a hybrid index structure with adaptive runtime data traversal to speed up Phase 2 search. The hybrid scheme exploits the advantages of forward index and inverted index based on the term characteristics to minimize the time in extracting positional and other feature information during runtime search. This paper compares several indexing and data traversal options with different time and space tradeoffs and describes evaluation results to demonstrate their effectiveness. The experiment results show that the proposed scheme can be up-to about 4x as fast as the previous work on solid state drives while retaining good relevance.

【Keywords】: positional inverted index; query processing; search in document archives; versioned data

Session 2e: Understanding Text 4

【Paper Link】【Pages】:387-396

【Authors】: Zhaochun Ren ; Oana Inel ; Lora Aroyo ; Maarten de Rijke

【Abstract】: A viewpoint is a triple consisting of an entity, a topic related to this entity and sentiment towards this topic. In time-aware multi-viewpoint summarization one monitors viewpoints for a running topic and selects a small set of informative documents. In this paper, we focus on time-aware multi-viewpoint summarization of multilingual social text streams. Viewpoint drift, ambiguous entities and multilingual text make this a challenging task. Our approach includes three core ingredients: dynamic viewpoint modeling, cross-language viewpoint alignment, and, finally, multi-viewpoint summarization. Specifically, we propose a dynamic latent factor model to explicitly characterize a set of viewpoints through which entities, topics and sentiment labels during a time interval are derived jointly; we connect viewpoints in different languages by using an entity-based semantic similarity measure; and we employ an update viewpoint summarization strategy to generate a time-aware summary to reflect viewpoints. Experiments conducted on a real-world dataset demonstrate the effectiveness of our proposed method for time-aware multi-viewpoint summarization of multilingual social text streams.

【Keywords】: dynamic viewpoint modeling; multi-viewpoint summarization; multilingual social text streams; topic modeling

【Paper Link】【Pages】:397-406

【Authors】: Hao Zhuang ; Rameez Rahman ; Xia Hu ; Tian Guo ; Pan Hui ; Karl Aberer

【Abstract】: While social data is being widely used in various applications such as sentiment analysis and trend prediction, its sheer size also presents great challenges for storing, sharing and processing such data. These challenges can be addressed by data summarization which transforms the original dataset into a smaller, yet still useful, subset. Existing methods find such subsets with objective functions based on data properties such as representativeness or informativeness but do not exploit social contexts, which are distinct characteristics of social data. Further, till date very little work has focused on topic preserving data summarization, despite the abundant work on topic modeling. This is a challenging task for two reasons. First, since topic model is based on latent variables, existing methods are not well-suited to capture latent topics. Second, it is difficult to find such social contexts that provide valuable information for building effective topic-preserving summarization model. To tackle these challenges, in this paper, we focus on exploiting social contexts to summarize social data while preserving topics in the original dataset. We take Twitter data as a case study. Through analyzing Twitter data, we discover two social contexts which are important for topic generation and dissemination, namely (i) CrowdExp topic score that captures the influence of both the crowd and the expert users in Twitter and (ii) Retweet topic score that captures the influence of Twitter users' actions. We conduct extensive experiments on two real-world Twitter datasets using two applications. The experimental results show that, by leveraging social contexts, our proposed solution can enhance topic-preserving data summarization and improve application performance by up to 18%.

【Keywords】: data summarization; social context; submodular optimization; topic model

43. Understanding Sparse Topical Structure of Short Text via Stochastic Variational-Gibbs Inference.

【Paper Link】【Pages】:407-416

【Authors】: Tianyi Lin ; Siyuan Zhang ; Hong Cheng

【Abstract】: With the soaring popularity of online social media like Twitter, analyzing short text has emerged as an increasingly important task which is challenging to classical topic models, as topic sparsity exists in short text. Topic sparsity refers to the observation that individual document usually concentrates on several salient topics, which may be rare in entire corpus. Understanding this sparse topical structure of short text has been recognized as the key ingredient for mining user-generated Web content and social medium, which are featured in the form of extremely short posts and discussions. However, the existing sparsity-enhanced topic models all assume over-complicated generative process, which severely limits their scalability and makes them unable to automatically infer the number of topics from data. In this paper, we propose a probabilistic Bayesian topic model, namely Sparse Dirichlet mixture Topic Model (SparseDTM), based on Indian Buffet Process (IBP) prior, and infer our model on the large text corpora through a novel inference procedure called stochastic variational-Gibbs inference. Unlike prior work, the proposed approach is able to achieve exact sparse topical structure of large short text collections, and automatically identify the number of topics with a good balance between completeness and homogeneity of topic coherence. Experiments on different genres of large text corpora demonstrate that our approach outperforms various existing sparse topic models. The improvement is significant on large-scale collections of short text.

【Keywords】: indian buffet process; short text; sparse topical structure; stochastic variational-gibbs inference; topic modeling

44. Annotating Points of Interest with Geo-tagged Tweets.

【Paper Link】【Pages】:417-426

【Authors】: Kaiqi Zhao ; Gao Cong ; Aixin Sun

【Abstract】: Microblogging services like Twitter contain abundant of user generated content covering a wide range of topics. Many of the tweets can be associated to real-world entities for providing additional information for the latter. In this paper, we aim to associate tweets that are semantically related to real-world locations or Points of Interest (POIs). Tweets contain dynamic and real-time information while POIs contain relatively static information. The tweets associated with POIs provide complementary information for many applications like opinion mining and POI recommendation; the associated POIs can also be used as POI tags in Twitter. We define the research problem of annotating POIs with tweets and propose a novel supervised Bayesian Model (sBM). The model takes into account the textual, spatial features and user behaviors together with the supervised information of whether a tweet is POI-related. It is able to capture user interests in latent regions for the prediction of whether a tweet is POI-related and the association between the tweet and its most semantically related POI. On tweets and POIs collected for two cities (New York City and Singapore), we demonstrate the effectiveness of our models against baseline methods.

【Keywords】: POI annotation; bayesian model; regression

Session 2f: Industry Session II 3

45. Duer: Intelligent Personal Assistant.

【Paper Link】【Pages】:427

【Authors】: Haifeng Wang

【Abstract】: Intelligent personal assistant is widely recognized as a more natural and efficient way of human-computer interaction, which has attracted extensive interests from both academia and industry. In this talk, I describe Duer, Baidu's intelligent personal assistant. In particular, I would like to focus on the following three features. Firstly, Duer comprehensively understands people's requirements via multiple channels, including not only explicit utterances, but also user models and rich contexts. Duer's user models are learnt from users' interaction history, and the rich contexts consist of temporal and geographical information, as well as the foregoing dialogues. Secondly, Duer meets diverse requirements with a range of instruments, such as chatting, information provision, reminder service, etc. These instruments are implemented based on mining the big data of web pages, applications, and user logs, which are then seamlessly integrated in the dialogue flow. Thirdly, Duer features multi-modal interaction, which allows people to interact with it by means of texts, speech, and images. We believe the above features will enable Duer to become a better and distinguished intelligent assistant for each of you.

【Keywords】: intelligent assistant

46. Measuring Metrics.

【Paper Link】【Pages】:429-437

【Authors】: Pavel Dmitriev ; Xian Wu

【Abstract】: You get what you measure, and you can't manage what you don't measure. Metrics are a powerful tool used in organizations to set goals, decide which new products and features should be released to customers, which new tests and experiments should be conducted, and how resources should be allocated. To a large extent, metrics drive the direction of an organization, and getting metrics 'right' is one of the most important and difficult problems an organization needs to solve. However, creating good metrics that capture long-term company goals is difficult. They try to capture abstract concepts such as success, delight, loyalty, engagement, life-time value, etc. How can one determine that a metric is a good one? Or, that one metric is better than another? In other words, how do we measure the quality of metrics? Can the evaluation process be automated so that anyone with an idea of a new metric can quickly evaluate it? In this paper we describe the metric evaluation system deployed at Bing, where we have been working on designing and improving metrics for over five years. We believe that by applying a data driven approach to metric evaluation we have been able to substantially improve our metrics and, as a result, ship better features and improve search experience for Bing's users.

【Keywords】: a/b testing; measurement; online experimentation; quality; search metrics

47. City-Scale Localization with Telco Big Data.

【Paper Link】【Pages】:439-448

【Authors】: Fangzhou Zhu ; Chen Luo ; Mingxuan Yuan ; Yijian Zhu ; Zhengqing Zhang ; Tao Gu ; Ke Deng ; Weixiong Rao ; Jia Zeng

【Abstract】: It is still challenging in telecommunication (telco) industry to accurately locate mobile devices (MDs) at city-scale using the measurement report (MR) data, which measure parameters of radio signal strengths when MDs connect with base stations (BSs) in telco networks for making/receiving calls or mobile broadband (MBB) services. In this paper, we find that the widely-used location based services (LBSs) have accumulated lots of over-the-top (OTT) global positioning system (GPS) data in telco networks, which can be automatically used as training labels for learning accurate MR-based positioning systems. Benefiting from these telco big data, we deploy a context-aware coarse-to-fine regression (CCR) model in Spark/Hadoop-based telco big data platform for city-scale localization of MDs with two novel contributions. First, we design map-matching and interpolation algorithms to encode contextual information of road networks. Second, we build a two-layer regression model to capture coarse-to-fine contextual features in a short time window for improved localization performance. In our experiments, we collect 108 GPS-associated MR records in the centroid of Shanghai city with 12 x 11 square kilometers for 30 days, and measure four important properties of real-world MR data related to localization errors: stability, sensitivity, uncertainty and missing values. The proposed CCR works well under different properties of MR data and achieves a mean error of 110m and a median error of $80m$, outperforming the state-of-art range-based and fingerprinting localization methods.

【Keywords】: localization; regression models; telco big data

Session 3a: Graph Analytics 4

48. Approximating Graph Pattern Queries Using Views.

【Paper Link】【Pages】:449-458

【Authors】: Jia Li ; Yang Cao ; Xudong Liu

【Abstract】: This paper studies approximation of graph pattern queries using views. Given a pattern query Q and a set V of views, we propose to find a pair of queries Qu and Ql, referred to as the upper and lower approximations of Q w.r.t. V, such that (a) for any data graph G, answers to (part of) Q in G are contained in Qu(G) and contain Ql(G); and (b) both Qu and Ql can be answered by using views in V. We consider pattern queries based on both graph simulation and subgraph isomorphism. We study fundamental problems about approximation using views. Given Q and V, (1) we study whether there exist upper and lower approximations of Q w.r.t. V. (2) How to find approximations that are closest to Q w.r.t. V if exist? (3) How to answer upper and lower approximations using views in V? We give characterizations of the problems, study their complexity and approximation-hardness, and develop algorithms with provable bounds. Using real-life datasets, we verify the effectiveness and efficiency of approximating simulation and subgraph queries using views.

【Keywords】: pattern matching; query approximation; views

49. Group-Aware Weighted Bipartite B-Matching.

【Paper Link】【Pages】:459-468

【Authors】: Cheng Chen ; Sean Chester ; Venkatesh Srinivasan ; Kui Wu ; Alex Thomo

【Abstract】: The weighted bipartite B-matching (WBM) problem models a host of data management applications, ranging from recommender systems to Internet advertising and e-commerce. Many of these applications, however, demand versatile assignment constraints, which WBM is weak at modelling. In this paper, we investigate powerful generalisations of WBM. We first show that a recent proposal for conflict-aware WBM by Chen et al. is hard to approximate by reducing their problem from Maximum Weight Independent Set. We then propose two related problems, collectively called group-aware WBM. For the first problem, which constrains the degree of groups of vertices, we show that a linear programming formulation produces a Totally Unimodular (TU) matrix and is thus polynomial-time solvable. Nonetheless, we also give a simple greedy algorithm subject to a 2-extendible system that scales to higher workloads. For the second problem, which instead limits the budget of groups of vertices, we prove its NP-hardness but again give a greedy algorithm with an approximation guarantee. Our experimental evaluation reveals that the greedy algorithms vastly outperform their theoretical guarantees and scale to bipartite graphs with more than eleven million edges.

【Keywords】: bipartite graphs; linear programming; matchings; np-hardness; submodular systems

50. Growing Graphs from Hyperedge Replacement Graph Grammars.

【Paper Link】【Pages】:469-478

【Authors】: Salvador Aguiñaga ; Rodrigo Palácios ; David Chiang ; Tim Weninger

【Abstract】: Discovering the underlying structures present in large real world graphs is a fundamental scientific problem. In this paper we show that a graph's clique tree can be used to extract a hyperedge replacement grammar. If we store an ordering from the extraction process, the extracted graph grammar is guaranteed to generate an isomorphic copy of the original graph. Or, a stochastic application of the graph grammar rules can be used to quickly create random graphs. In experiments on large real world networks, we show that random graphs, generated from extracted graph grammars, exhibit a wide range of properties that are very similar to the original graphs. In addition to graph properties like degree or eigenvector centrality, what a graph ``looks like'' ultimately depends on small details in local graph substructures that are difficult to define at a global level. We show that our generative graph model is able to preserve these local substructures when generating new graphs and performs well on new and difficult tests of model robustness.

【Keywords】: graph generation; graph mining; hyperedge replacement grammar

51. GiraphAsync: Supporting Online and Offline Graph Processing via Adaptive Asynchronous Message Processing.

【Paper Link】【Pages】:479-488

【Authors】: Yuqiong Liu ; Chang Zhou ; Jun Gao ; Zhiguo Fan

【Abstract】: It is highly desired for existing distributed graph processing systems to support both offline analytics and online queries adaptively. Existing offline graph analytics systems are mostly based on synchronous model. Although achieving high throughput, they suffer relatively high latency in answering simple queries due to synchronization overhead and slow convergence. On the other hand, online graph query systems adopting asynchronous model can response at any time, while incur overwhelmed messages and network packets, making them unable to meet the high throughput demand of offline analytics. In this work, we propose an adaptive asynchronous message processing (AAMP) method, which improves the efficiency of network communication while maintains low latency, to efficiently support offline analytics and online queries in one graph processing framework. We then design GiraphAsync, an implementation of AAMP on top of Apache Giraph, and evaluate it using several representative offline analytics and online queries on large graph datasets. Experimental results show that GiraphAsync gains an up to 10X improvement over synchronous model systems for graph analytics, while performs as well as specialized systems for online graph queries.

【Keywords】: adaptive message batching; asynchronous message processing; graph processing system; priority scheduler

Session 3b: Event Detection and Analytics 4

52. Graph Topic Scan Statistic for Spatial Event Detection.

【Paper Link】【Pages】:489-498

【Authors】: Yu Liu ; Baojian Zhou ; Feng Chen ; David W. Cheung

【Abstract】: Spatial event detection is an important and challenging problem. Unlike traditional event detection that focuses on the timing of global urgent event, the task of spatial event detection is to detect the spatial regions (e.g. clusters of neighboring cities) where urgent events occur. In this paper, we focus on the problem of spatial event detection using textual information in social media. We observe that, when a spatial event occurs, the topics relevant to the event are often discussed more coherently in cities near the event location than those far away. In order to capture this pattern, we propose a new method called Graph Topic Scan Statistic (Graph-TSS) that corresponds to a generalized log-likelihood ratio test based on topic modeling. We first demonstrate that the detection of spatial event regions under Graph-TSS is NP-hard due to a reduction from classical node-weighted prize-collecting Steiner tree problem (NW-PCST). We then design an efficient algorithm that approximately maximizes the graph topic scan statistic over spatial regions of arbitrary form. As a case study, we consider three applications using Twitter data, including Argentina civil unrest event detection, Chile earthquake detection, and United States influenza disease outbreak detection. Empirical evidence demonstrates that the proposed Graph-TSS performs superior over state-of-the-art methods on both running time and accuracy.

【Keywords】: large graph; scan statistic; spatial event detection; topic model

53. A Nonparametric Model for Event Discovery in the Geospatial-Temporal Space.

【Paper Link】【Pages】:499-508

【Authors】: Jinjin Guo ; Zhiguo Gong

【Abstract】: The availability of geographical and temporal tagged documents enables many location and time based mining tasks. Event discovery is one of such tasks, which is to identify interesting happenings in the geographical and temporal space. In recent years, several techniques have been proposed. However, no existing work has provided a nonparametric algorithm for detecting events in the joint space crossing geographical and temporal dimensions. Furthermore, though some prior works proposed to capture the periodicities of topics in their solutions, some restrictions on the temporal patterns are often placed and they usually ignore the spatial patterns of the topics. To break through such limitations, in this paper we propose a novel nonparametric model to identify events in the geographical and temporal space, where any recurrent patterns of events can be automatically captured. In our approach, parameters are automatically determined by exploiting a Dirichlet Process. To reduce the influence from noisy terms in the detection, we distinguish its event role from its background role using a Bernoulli model in the solution. Experimental results on three real world datasets show the proposed algorithm outperforms previous state-of-the-art approaches.

【Keywords】: dirichlet process; event discovery; geographical-temporal space; probabilistic graphical model

54. A Multiple Instance Learning Framework for Identifying Key Sentences and Detecting Events.

【Paper Link】【Pages】:509-518

【Authors】: Wei Wang ; Yue Ning ; Huzefa Rangwala ; Naren Ramakrishnan

【Abstract】: State-of-the-art event encoding approaches rely on sentence or phrase level labeling, which are both time consuming and infeasible to extend to large scale text corpora and emerging domains. Using a multiple instance learning approach, we take advantage of the fact that while labels at the sentence level are difficult to obtain, they are relatively easy to gather at the document level. This enables us to view the problems of event detection and extraction in a unified manner. Using distributed representations of text, we develop a multiple instance formulation that simultaneously classifies news articles and extracts sentences indicative of events without any engineered features. We evaluate our model in its ability to detect news articles about civil unrest events (from Spanish text) across ten Latin American countries and identify the key sentences pertaining to these events. Our model, trained without annotated sentence labels, yields performance that is competitive with selected state-of-the-art models for event detection and sentence identification. Additionally, qualitative experimental results show that the extracted event-related sentences are informative and enhance various downstream applications such as article summarization, visualization, and event encoding.

【Keywords】: cnn; deep learning; event detection; information extraction; mil

55. PairFac: Event Analytics through Discriminant Tensor Factorization.

【Paper Link】【Pages】:519-528

【Authors】: Xidao Wen ; Yu-Ru Lin ; Konstantinos Pelechrinis

【Abstract】: The study of disaster events and their impact in the urban space has been traditionally conducted through manual collections and analysis of surveys, questionnaires and authority documents. While there have been increasingly rich troves of human behavioral data related to the events of interest, the ability to obtain hindsight following a disaster event has not been scaled up. In this paper, we propose a novel approach for analyzing events called PairFac. PairFac utilizes discriminant tensor analysis to automatically discover the impact of a major event from rich human behavioral data. Our method aims to (i) uncover the persistent patterns across multiple interrelated aspects of urban behavior (e.g., when, where and what citizens do in a city) and at the same time (ii) identify the salient changes following a potentially impactful event. We show the effectiveness of PairFac in comparison with previous methods through extensive experiments. We also demonstrate the advantages of our approach through case studies with real-world traffic sensor data and social media streams surrounding the 2015 terrorist attacks in Paris. Our work has both methodological contributions in studying the impact of an external stimulus on a system as well as practical implications in the area of disaster event analysis and assessment.

【Keywords】: event analytics; tensor factorization; urban computing

Session 3c: Crowdsourcing 4

56. Active Content-Based Crowdsourcing Task Selection.

【Paper Link】【Pages】:529-538

【Authors】: Piyush Bansal ; Carsten Eickhoff ; Thomas Hofmann

【Abstract】: Crowdsourcing has long established itself as a viable alternative to corpus annotation by domain experts for tasks such as document relevance assessment. The crowdsourcing process traditionally relies on high degrees of label redundancy in order to mitigate the detrimental effects of individually noisy worker submissions. Such redundancy comes at the cost of increased label volume, and, subsequently, monetary requirements. In practice, especially as the size of datasets increases, this is undesirable. In this paper, we focus on an alternate method that exploits document information instead, to infer relevance labels for unjudged documents. We present an active learning scheme for document selection that aims at maximising the overall relevance label prediction accuracy, for a given budget of available relevance judgements by exploiting system-wide estimates of label variance and mutual information. Our experiments are based on TREC 2011 Crowdsourcing Track data and show that our method is able to achieve state-of-the-art performance while requiring 17% - 25% less budget.

【Keywords】: active learning; crowdsourcing; relevance assessment

57. CrowdSelect: Increasing Accuracy of Crowdsourcing Tasks through Behavior Prediction and User Selection.

【Paper Link】【Pages】:539-548

【Authors】: Chenxi Qiu ; Anna Cinzia Squicciarini ; Barbara Carminati ; James Caverlee ; Dev Rishi Khare

【Abstract】: Crowdsourcing allows many people to complete tasks of various difficulty with minimal recruitment and administration costs. However, the lack of participant accountability may entice people to complete as many tasks as possible without fully engaging in them, jeopardizing the quality of responses. In this paper, we present a dynamic and time efficient solution to the task assignment problem in crowdsourcing platforms. Our proposed approach, CrowdSelect, offers a theoretically proven algorithm to assign workers to tasks in a cost efficient manner, while ensuring high accuracy of the overall task. In contrast to existing works, our approach makes minimal assumptions on the probability of error for workers, and completely removes the assumptions that such probability is known apriori and that it remains consistent over time. Through experiments over real Amazon Mechanical Turk traces and synthetic data, we find that CrowdSelect has a significant gain in term of accuracy compared to state-of-the-art algorithms, and can provide a 17.5\% gain in answers' accuracy compared to previous methods, even when there are over 50\% malicious workers.

【Keywords】: crowdsourcing; malicious worker; task assignment

58. Attribute-based Crowd Entity Resolution.

【Paper Link】【Pages】:549-558

【Authors】: Asif R. Khan ; Hector Garcia-Molina

【Abstract】: We study the problem of using the crowd to perform entity resolution (ER) on a set of records. For many types of records, especially those involving images, such a task can be difficult for machines, but relatively easy for humans. Typical crowd-based ER approaches ask workers for pairwise judgments between records, which quickly becomes prohibitively expensive even for moderate numbers of records. In this paper, we reduce the cost of pairwise crowd ER approaches by soliciting the crowd for attribute labels on records, and then asking for pairwise judgments only between records with similar sets of attribute labels. However, due to errors induced by crowd-based attribute labeling, a naive attribute-based approach becomes extremely inaccurate even with few attributes. To combat these errors, we use error mitigation strategies which allow us to control the accuracy of our results while maintaining significant cost reductions. We develop a probabilistic model which allows us to determine the optimal, lowest-cost combination of error mitigation strategies needed to achieve a minimum desired accuracy. We test our approach with actual crowdworkers on a dataset of celebrity images, and find that our results yield crowd ER strategies which achieve high accuracy yet are significantly lower cost than pairwise-only approaches.

【Keywords】: crowd; crowd computation; crowdsourcing; deduplication; entity resolution; record linkage

59. Efficient Processing of Location-Aware Group Preference Queries.

【Paper Link】【Pages】:559-568

【Authors】: Miao Li ; Lisi Chen ; Gao Cong ; Yu Gu ; Ge Yu

【Abstract】: With the proliferation of geo-positioning techniques that enable users to acquire their geographical positions, there has been increasing popularity of online location-based services. This development has generated a large volume of points of interest labeled with category features (e.g., hotel, resort, stores, stations, and tourist attractions). It gives prominence to various types of spatial-keyword queries, which are employed to provide fundamental querying functionality for location-based services. We study the Location-aware Group Preference (LGP) query that aims to find a destination place for a group of users. The group of users want to go to a place labeled with a specified category feature (e.g., hotel) together, and each of them has a location and a set of additional preferences. It is expected that the result place of the query belongs to the specified category feature, and it is close to places satisfying the preferences of each user. We develop a novel framework for answering the LGP query, which can be used to compute both exact query result and approximate result with a proven approximation ratio. The efficiency and efficacy of the proposed algorithms for answering the LGP query are verified by extensive experiments on two real datasets.

【Keywords】: group; location; preference; query processing

Session 3d: Mobile 4

60. Mining Shopping Patterns for Divergent Urban Regions by Incorporating Mobility Data.

【Paper Link】【Pages】:569-578

【Authors】: Tianran Hu ; Ruihua Song ; Yingzi Wang ; Xing Xie ; Jiebo Luo

【Abstract】: What people buy is an important aspect or view of lifestyles. Studying people's shopping patterns in different urban regions can not only provide valuable information for various commercial opportunities, but also enable a better understanding about urban infrastructure and urban lifestyle. In this paper, we aim to predict citywide shopping patterns. This is a challenging task due to the sparsity of the available data -- over 60% of the city regions are unknown for their shopping records. To address this problem, we incorporate another important view of human lifestyles, namely mobility patterns. With information on "where people go", we infer "what people buy". Moreover, to model the relations between regions, we exploit spatial interactions in our method. To that end, Collective Matrix Factorization (CMF) with an interaction regularization model is applied to fuse the data from multiple views or sources. Our experimental results have shown that our model outperforms the baseline methods on two standard metrics. Our prediction results on multiple shopping patterns reveal the divergent demands in different urban regions, and thus reflect key functional characteristics of a city. Furthermore, we are able to extract the connection between the two views of lifestyles, and achieve a better or novel understanding of urban lifestyles.

【Keywords】: mobility patterns; multiview lifestyles; shopping patterns; urban computing

61. Large-Scale Analysis of Viewing Behavior: Towards Measuring Satisfaction with Mobile Proactive Systems.

【Paper Link】【Pages】:579-588

【Authors】: Qi Guo ; Yang Song

【Abstract】: Recently, proactive systems such as Google Now and Microsoft Cortana have become increasingly popular in reforming the way users access information on mobile devices. In these systems, relevant content is presented to users based on their context without a query in the form of information cards that do not require a click to satisfy the users. As a result, prior approaches based on clicks cannot provide reliable measurements of user satisfaction with such systems. It is also unclear how much of the previous findings regarding good abandonment with reactive Web searches can be applied to these proactive systems due to the intrinsic difference in user intent, the greater variety of content types and their presentations. In this paper, we present the first large-scale analysis of viewing behavior based on the viewport (the visible fraction of a Web page) of the mobile devices, towards measuring user satisfaction with the information cards of the mobile proactive systems. In particular, we identified and analyzed a variety of factors that may influence the viewing behavior, including biases from ranking positions, the types and attributes of the information cards, and the touch interactions with the mobile devices. We show that by modeling the various factors we can better measure user satisfaction with the mobile proactive systems, enabling stronger statistical power in large-scale online A/B testing.

【Keywords】: large-scale log analysis; mobile proactive systems; satisfaction measures; viewport modeling

62. Where Did You Go: Personalized Annotation of Mobility Records.

【Paper Link】【Pages】:589-598

【Authors】: Fei Wu ; Zhenhui Li

【Abstract】: Recent advances in positioning technology have generated massive volume of human mobility data. At the same time, large amount of spatial context data are available and provide us with rich context information. Combining the mobility data with surrounding spatial context enables us to understand the semantics of the mobility records, e.g., what is a user doing at a location, e.g., dining at a restaurant or attending a football game). In this paper, we aim to answer this question by annotating the mobility records with surrounding venues that were actually visited by the user. The problem is non-trivial due to high ambiguity of surrounding contexts. Unlike existing methods that annotate each location record independently, we propose to use all historical mobility records to capture user preferences, which results in more accurate annotations. Our method does not assume the availability to any training data on user preference because of the difficulties to obtain such data in the real-world setting. Instead, we design a Markov random field model to find the best annotations that maximize the consistency of annotated venues. Through extensive experiments on real datasets, we demonstrate that our method significantly outperforms the baseline methods.

【Keywords】: human mobility; recommendation; semantic annotation; social network

63. Understanding Mobile Searcher Attention with Rich Ad Formats.

【Paper Link】【Pages】:599-608

【Authors】: Dmitry Lagun ; Donal McMahon ; Vidhya Navalpakkam

【Abstract】: Mobile Search experiences have evolved significantly from a few blue links that require users to click. Recent search and ad units surface instant information to the user in a variety of visually rich formats that include images, horizontal swipes, and vertical scrolls. These innovative experiences call for new metrics and models to better understand searcher behavior on mobile phones. In this paper, we study how the presence of ads and their formats impacts searcher's gaze and satisfaction. We systematically vary presentation format of the sponsored result, while controlling for other factors, such as position and quality of organic results. We experiment with several configurations of text ad and rich ad formats. Our findings indicate that showing rich ad formats improve search experience, by drawing more attention to the information-rich ad and allowing users to interact to view more offers, which increases user satisfaction with search. In addition, we extend prior work by comparing the performance of various models to infer user's gaze from viewport data. Our models improve accuracy of existing viewport-based gaze inference methods by 30% in Pearson's correlation. Together, our findings show that viewport data can be used for fast, accurate and scalable measurement of user attention on a per-element basis, for both ads as well as organic search results.

【Keywords】: ads; eye tracking; user attention; user study; viewport

【Paper Link】【Pages】:609-617

【Authors】: Sumit Negi ; Santanu Chaudhury

【Abstract】: A heterogeneous social network is characterized by multiple link types which makes the task of link prediction in such networks more involved. In the last few years collective link prediction methods have been proposed for the problem of link prediction in heterogeneous networks. These methods capture the correlation between different types of links and utilize this information in the link prediction task. In this paper we pose the problem of link prediction in heterogeneous networks as a multi-task, metric learning (MTML) problem. For each link-type (relation) we learn a corresponding distance measure, which utilizes both network and node features. These link-type specific distance measures are learnt in a coupled fashion by employing the Multi-Task Structure Preserving Metric Learning (MT-SPML) setup. We further extend the MT-SPML method to account for task correlations, robustness to non-informative features and non-stationary degree distribution across networks. Experiments on the Flickr and DBLP network demonstrates the effectiveness of our proposed approach vis-à-vis competitive baselines.

【Keywords】: heterogeneous network; link prediction; multi-task metric learning

65. Who are My Familiar Strangers?: Revealing Hidden Friend Relations and Common Interests from Smart Card Data.

【Paper Link】【Pages】:619-628

【Authors】: Fusang Zhang ; Beihong Jin ; Tingjian Ge ; Qiang Ji ; Yanling Cui

【Abstract】: The newly emerging location-based social networks (LBSN) such as Tinder and Momo extends social interaction from friends to strangers, providing novel experiences of making new friends. Familiar strangers refer to the strangers who meet frequently in daily life and may share common interests; thus they may be good candidates for friend recommendation. In this paper, we study the problem of discovering familiar strangers, specifically, public transportation trip companions, and their common interests. We collect 5.7 million transaction records of smart cards from about 3.02 million people in the city of Beijing, China. We first analyze this dataset and reveal the temporal and spatial characteristics of passenger encounter behaviors. Then we propose a stability metric to measure hidden friend relations. This metric facilitates us to employ community detection techniques to capture the communities of trip companions. Further, we infer common interests of each community using a topic model, i.e., LDA4HFC (Latent Dirichlet Allocation for Hidden Friend Communities) model. Such topics for communities help to understand how hidden friend clusters are formed. We evaluate our method using large-scale and real-world datasets, consisting of two-week smart card records and 901,855 Point of Interests (POIs) in Beijing. The results show that our method outperforms three baseline methods with higher recommendation accuracy. Moreover, our case study demonstrates that the discovered topics interpret the communities very well.

【Keywords】: community detection; familiar strangers; friend recommendation; location based social network; topic model

66. PIN-TRUST: Fast Trust Propagation Exploiting Positive, Implicit, and Negative Information.

【Paper Link】【Pages】:629-638

【Authors】: Min-Hee Jang ; Christos Faloutsos ; Sang-Wook Kim ; U. Kang ; Jiwoon Ha

【Abstract】: Given "who-trusts/distrusts-whom" information, how can we propagate the trust and distrust? With the appearance of fraudsters in social network sites, the importance of trust prediction has increased. Most such methods use only explicit and implicit trust information (e.g., if Smith likes several of Johnson's reviews, then Smith implicitly trusts Johnson), but they do not consider distrust. In this paper, we propose PIN-TRUST, a novel method to handle all three types of interaction information: explicit trust, implicit trust, and explicit distrust. The novelties of our method are the following: (a) it is carefully designed, to take into account positive, implicit, and negative information, (b) it is scalable (i.e., linear on the input size), (c) most importantly, it is effective and accurate. Our extensive experiments with a real dataset, Epinions.com data, of 100K nodes and 1M edges, confirm that PIN-TRUST is scalable and outperforms existing methods in terms of prediction accuracy, achieving up to 50.4 percentage relative improvement.

【Keywords】: belief propagation; graph mining; trust prediction

67. Predicting Popularity of Twitter Accounts through the Discovery of Link-Propagating Early Adopters.

【Paper Link】【Pages】:639-648

【Authors】: Daichi Imamori ; Keishi Tajima

【Abstract】: In this paper, we propose a method of ranking recently created Twitter accounts according to their prospective popularity. Early detection of new promising accounts is useful for trend prediction, viral marketing, user recommendation, and so on. New accounts are, however, difficult to evaluate because they have not yet established the reputation they deserve, and we cannot apply existing link-based or other popularity-based account evaluation methods. Our method first finds early adopters, i.e., users who often find new good information sources earlier than others. Our method then regards new accounts followed by good early adopters as promising, even if they do not have many followers now. In order to find good early adopters, we estimate the frequency of link propagation from each account, i.e., how many times the follow links from the account have been copied by its followers. If the frequency is high, the account must be a good early adopter who often find good information sources earlier than its followers. We develop a method of inferring which links are created by copying which links. One important advantage of our method is that our method only uses information that can be easily obtained only by crawling neighbors of the target accounts in the current Twitter graph. We evaluated our method by an experiment on Twitter data. We chose then-new accounts from an old snapshot of Twitter, compute their ranking by our method, and compare it with the ranking based on the number of followers the accounts currently have. The result shows that our method produces better rankings than various baseline methods, especially for very new accounts that have only a few followers.

【Keywords】: graph analysis; graph evolution; graph mining; hubs; influence; link prediction; link-propagation; micro-blogging

Session 3f: Industry Session III 4

68. "Shall I Be Your Chat Companion?": Towards an Online Human-Computer Conversation System.

【Paper Link】【Pages】:649-658

【Authors】: Rui Yan ; Yiping Song ; Xiangyang Zhou ; Hua Wu

【Abstract】: To establish an automatic conversation system between human and computer is regarded as one of the most hardcore problems in computer science. It requires interdisciplinary techniques in information retrieval, natural language processing, and data management, etc. The challenges lie in how to respond like a human, and to maintain a relevant, meaningful, and continuous conversation. The arrival of big data era reveals the feasibility to create such a system empowered by data-driven approaches. We can now organize the conversational data as a chat companion. In this paper, we introduce a chat companion system, which is a practical conversation system between human and computer as a real application. Given the human utterances as queries, our proposed system will respond with corresponding replies retrieved and highly ranked from a massive conversational data repository. Note that 'practical' here indicates effectiveness and efficiency: both issues are important for a real-time system based on a massive data repository. We have two scenarios of single-turn and multi-turn conversations. In our system, we have a base ranking without conversational context information (for single-turn) and a context-aware ranking (for multi-turn). Both rankings can be conducted either by a shallow learning or deep learning paradigm. We combine these two rankings together in optimization. In the experimental setups, we investigate the performance between effectiveness and efficiency for the proposed methods, and we also compare against a series of baselines to demonstrate the advantage of the proposed framework in terms of p@1, MAP, and nDCG. We present a new angle to launch a practical online conversation system between human and computer.

【Keywords】: big data; human-computer conversation; rank optimization

69. To Click or Not To Click: Automatic Selection of Beautiful Thumbnails from Videos.

【Paper Link】【Pages】:659-668

【Authors】: Yale Song ; Miriam Redi ; Jordi Vallmitjana ; Alejandro Jaimes

【Abstract】: Thumbnails play such an important role in online videos. As the most representative snapshot, they capture the essence of a video and provide the first impression to the viewers; ultimately, a great thumbnail makes a video more attractive to click and watch. We present an automatic thumbnail selection system that exploits two important characteristics commonly associated with meaningful and attractive thumbnails: high relevance to video content and superior visual aesthetic quality. Our system selects attractive thumbnails by analyzing various visual quality and aesthetic metrics of video frames, and performs a clustering analysis to determine the relevance to video content, thus making the resulting thumbnails more representative of the video. On the task of predicting thumbnails chosen by professional video editors, we demonstrate the effectiveness of our system against six baseline methods, using a real-world dataset of 1,118 videos collected from Yahoo Screen. In addition, we study what makes a frame a good thumbnail by analyzing the statistical relationship between thumbnail frames and non-thumbnail frames in terms of various image quality features. Our study suggests that the selection of a good thumbnail is highly correlated with objective visual quality metrics, such as the frame texture and sharpness, implying the possibility of building an automatic thumbnail selection system based on visual aesthetics.

【Keywords】: video thumbnail extraction

70. Separating-Plane Factorization Models: Scalable Recommendation from One-Class Implicit Feedback.

【Paper Link】【Pages】:669-678

【Authors】: Haolan Chen ; Di Niu ; Kunfeng Lai ; Yu Xu ; Masoud Ardakani

【Abstract】: We study the video recommendation problem based on a large amount of user viewing logs instead of explicit ratings. As viewing records are implicitly suggest user preferences, existing matrix factorization methods fail to generate discriminative recommendations based on such one-class positive samples. We propose a scalable approach called separating-plane matrix factorization (SPMF) to make effective recommendations based on positive implicit feedback, with a learning complexity that is comparable to traditional matrix factorization. With extensive offline evaluation in Tencent Data Warehouse (TDW) based on a large amount of data, we show that our approach outperforms a wide range of state-of-the-art methods. We also deployed our system in the QQ Browser App of Tencent and performed online A/B testing with real users. Results suggest that our approach increased the video click through rate by $23% over implicit-feedback collaborative filtering (IFCF), a scheme available in Apache Spark's MLlib.

【Keywords】: big data; collaborative filtering; factorization machine; factorization model; implicit feedback; matrix factorization; one-class feedback; personalized recommendation; recommender systems; tencent; video recommendation; video watching habits

71. User Response Learning for Directly Optimizing Campaign Performance in Display Advertising.

【Paper Link】【Pages】:679-688

【Authors】: Kan Ren ; Weinan Zhang ; Yifei Rong ; Haifeng Zhang ; Yong Yu ; Jun Wang

【Abstract】: Learning and predicting user responses, such as clicks and conversions, are crucial for many Internet-based businesses including web search, e-commerce, and online advertising. Typically, a user response model is established by optimizing the prediction accuracy, e.g., minimizing the error between the prediction and the ground truth user response. However, in many practical cases, predicting user responses is only part of a rather larger predictive or optimization task, where on one hand, the accuracy of a user response prediction determines the final (expected) utility to be optimized, but on the other hand, its learning may also be influenced from the follow-up stochastic process. It is, thus, of great interest to optimize the entire process as a whole rather than treat them independently or sequentially. In this paper, we take real-time display advertising as an example, where the predicted user's ad click-through rate (CTR) is employed to calculate a bid for an ad impression in the second price auction. We reformulate a common logistic regression CTR model by putting it back into its subsequent bidding context: rather than minimizing the prediction error, the model parameters are learned directly by optimizing campaign profit. The gradient update resulted from our formulations naturally fine-tunes the cases where the market competition is high, leading to a more cost-effective bidding. Our experiments demonstrate that, while maintaining comparable CTR prediction accuracy, our proposed user response learning leads to campaign profit gains as much as 78.2% for offline test and 25.5% for online A/B test over strong baselines.

【Keywords】: bid optimization; ctr estimation; real-time bidding; user response learning

Keynote Address 2 1

72. Personalized Search: Potential and Pitfalls.

【Paper Link】【Pages】:689

【Authors】: Susan T. Dumais

【Abstract】: Traditionally search engines have returned the same results to everyone who asks the same question. However, using a single ranking for everyone in every context at every point in time limits how well a search engine can do in providing relevant information. In this talk I present a framework to quantify the "potential for personalization" which we use to characterize the extent to which different people have different intents for the same query. I describe several examples of how we represent and use different kinds of contextual features to improve search quality for individuals and groups. Finally, I conclude by highlighting important challenges in developing personalized systems at Web scale including privacy, transparency, serendipity, and evaluation.

【Keywords】: human computer interaction for information retrieval; personalized search; user modeling; web search

Session 4a: Information Retrieval 4

73. Query Variations and their Effect on Comparing Information Retrieval Systems.

【Paper Link】【Pages】:691-700

【Authors】: Guido Zuccon ; João R. M. Palotti ; Allan Hanbury

【Abstract】: We explore the implications of using query variations for evaluating information retrieval systems and how these variations should be exploited to compare system effectiveness. Current evaluation approaches consider the availability of a set of topics (information needs), and only one expression of each topic in the form of a query is used for evaluation and system comparison. While there is strong evidence that considering query variations better models the usage of retrieval systems and accounts for the important user aspect of user variability, it is unclear how to best exploit query variations for evaluating and comparing information retrieval systems. We propose a framework for evaluating retrieval systems that explicitly takes into account query variations. The framework considers both the system mean effectiveness and its variance over query variations and topics, as opposed to current approaches that only consider the mean across topics or perform a topic-focused analysis of variance across systems. Furthermore, the framework extends current evaluation practice by encoding: (1) user tolerance to effectiveness variations, (2) the popularity of different query variations, and (3) the relative importance of individual topics. These extensions and our findings make information retrieval comparisons more aligned with user behaviour.

【Keywords】: evaluation framework; information retrieval evaluation; mean variance analysis; mean variance evaluation; portfolio theory; query variations

74. Semantic Matching by Non-Linear Word Transportation for Information Retrieval.

【Paper Link】【Pages】:701-710

【Authors】: Jiafeng Guo ; Yixing Fan ; Qingyao Ai ; W. Bruce Croft

【Abstract】: A common limitation of many information retrieval (IR) models is that relevance scores are solely based on exact (i.e., syntactic) matching of words in queries and documents under the simple Bag-of-Words (BoW) representation. This not only leads to the well-known vocabulary mismatch problem, but also does not allow semantically related words to contribute to the relevance score. Recent advances in word embedding have shown that semantic representations for words can be efficiently learned by distributional models. A natural generalization is then to represent both queries and documents as Bag-of-Word-Embeddings (BoWE), which provides a better foundation for semantic matching than BoW. Based on this representation, we introduce a novel retrieval model by viewing the matching between queries and documents as a non-linear word transportation (NWT) problem. With this formulation, we define the capacity and profit of a transportation model designed for the IR task. We show that this transportation problem can be efficiently solved via pruning and indexing strategies. Experimental results on several representative benchmark datasets show that our model can outperform many state-of-the-art retrieval models as well as recently introduced word embedding-based models. We also conducted extensive experiments to analyze the effect of different settings on our semantic matching model.

【Keywords】: retrieval model; word embedding; word transportation

75. Generalizing Translation Models in the Probabilistic Relevance Framework.

【Paper Link】【Pages】:711-720

【Authors】: Navid Rekabsaz ; Mihai Lupu ; Allan Hanbury ; Guido Zuccon

【Abstract】: A recurring question in information retrieval is whether term associations can be properly integrated in traditional information retrieval models while preserving their robustness and effectiveness. In this paper, we revisit a wide spectrum of existing models (Pivoted Document Normalization, BM25, BM25 Verboseness Aware, Multi-Aspect TF, and Language Modelling) by introducing a generalisation of the idea of the translation model. This generalisation is a de facto transformation of the translation models from Language Modelling to the probabilistic models. In doing so, we observe a potential limitation of these generalised translation models: they only affect the term frequency based components of all the models, ignoring changes in document and collection statistics. We correct this limitation by extending the translation models with the 15 statistics of term associations and provide extensive experimental results to demonstrate the benefit of the newly proposed methods. Additionally, we compare the translation models with query expansion methods based on the same term association resources, as well as based on Pseudo-Relevance Feedback (PRF). We observe that translation models always outperform the first, but provide complementary information with the second, such that by using PRF and our translation models together we observe results better than the current state of the art.

【Keywords】: IR models; related terms; translation model; word embeddings

76. Axiomatic Result Re-Ranking.

【Paper Link】【Pages】:721-730

【Authors】: Matthias Hagen ; Michael Völske ; Steve Göring ; Benno Stein

【Abstract】: We consider the problem of re-ranking the top-k documents returned by a retrieval system given some search query. This setting is common to learning-to-rank scenarios, and it is often solved with machine learning and feature weighting based on user preferences such as clicks, dwell times, etc. In this paper, we combine the learning-to-rank paradigm with the recent developments on axioms for information retrieval. In particular, we suggest to re-rank the top-k documents of a retrieval system using carefully chosen axiom combinations. In recent years, research on axioms for information retrieval has focused on identifying reasonable constraints that retrieval systems should fulfill. Researchers have analyzed a wide range of standard retrieval models for conformance to the proposed axioms and, at times, suggested certain adjustments to the models. We take up this axiomatic view---but, instead of adjusting the retrieval models themselves, we suggest the following innovation: to adopt the learning-to-rank idea and to re-rank the top-k results directly using promising axiom combinations. This way, we can turn every reasonable basic retrieval model into an axiom-based retrieval model. In large-scale experiments on the ClueWeb corpora, we identify promising axiom combinations for a variety of retrieval models. Our experiments show that for most of these models our axiom-based re-ranking significantly improves the original retrieval performance.

【Keywords】: axiom combinations; axiomatic re-ranking; axiomatic retrieval; axiomatic retrieval model; axioms for information retrieval; clueweb; learning to rank; rank aggregation; term proximity axioms; top-k retrieval; trec web track

Session 4b: User Behavior and Interfaces 4

77. Agents, Simulated Users and Humans: An Analysis of Performance and Behaviour.

【Paper Link】【Pages】:731-740

【Authors】: David Maxwell ; Leif Azzopardi

【Abstract】: Most of the current models that are used to simulate users in Interactive Information Retrieval (IIR) lack realism and agency. Such models generally make decisions in a stochastic manner, without recourse to the actual information encountered or the underlying information need. In this paper, we develop a more sophisticated model of the user that includes their cognitive state within the simulation. The cognitive state maintains data about what the simulated user knows, has done and has seen, along with representations of what it considers attractive and relevant. Decisions to inspect or judge are then made based upon the simulated user's current state, rather than stochastically. In the context of ad-hoc topic retrieval, we evaluate the quality of the simulated users and agents by comparing their behaviour and performance against 48 human subjects under the same conditions, topics, time constraints, costs and search engine. Our findings show that while naive configurations of simulated users and agents substantially outperform our human subjects, their search behaviour is notably different from actual searchers. However, more sophisticated search agents can be tuned to act more like actual searchers providing greater realism. This innovation advances the state of the art in simulation, from simulated users towards autonomous agents. It provides a much needed step forward enabling the creation of more realistic simulations, while also motivating the development of more advanced cognitive agents and tools to help support and augment human searchers. Future work will focus not only on the pragmatics of tuning and training such agents for topic retrieval, but will also look at developing agents for other tasks and contexts such as collaborative search and slow search.

【Keywords】: agents; autonomous agents; continuation strategies; interactive information retrieval; querying strategies; simulation; stopping strategies; user modeling

78. Inspiration or Preparation?: Explaining Creativity in Scientific Enterprise.

【Paper Link】【Pages】:741-750

【Authors】: Xinyang Zhang ; Dashun Wang ; Ting Wang

【Abstract】: Human creativity is the ultimate driving force behind scientific progress. While the building blocks of innovations are often embodied in existing knowledge, it is creativity that blends seemingly disparate ideas. Existing studies have made striding advances in quantifying creativity of scientific publications by investigating their citation relationships. Yet, little is known hitherto about the underlying mechanisms governing scientific creative processes, largely due to that a paper's references, at best, only partially reflect its authors' actual information consumption. This work represents an initial step towards fine-grained understanding of creative processes in scientific enterprise. In specific, using two web-scale longitudinal datasets (120.1 million papers and 53.5 billion web requests spanning 4 years), we directly contrast authors' information consumption behaviors against their knowledge products. We find that, of 59.0% papers across all scientific fields, 25.7% of their creativity can be readily explained by information consumed by their authors. Further, by leveraging these findings, we develop a predictive framework that accurately identifies the most critical knowledge to fostering target scientific innovations. We believe that our framework is of fundamental importance to the study of scientific creativity. It promotes strategies to stimulate and potentially automate creative processes, and provides insights towards more effective designs of information recommendation platforms.

【Keywords】: creative process; knowledge production; science of science

【Paper Link】【Pages】:751-760

【Authors】: Jaewon Kim ; Paul Thomas ; Ramesh Sankaranarayana ; Tom Gedeon ; Hwan-Jin Yoon

【Abstract】: Vertical scrolling is the standard method of exploring search results pages. For touch-enabled mobile devices that are not equipped with a mouse or keyboard, we adopt other methods of controlling the viewport with the aim of investigating user interaction. From the intuition that people are used to reading books by turning pages horizontally, we conducted a user experiment to investigate the effects of horizontal and vertical control types (pagination versus scrolling) on a touch-enabled mobile phone. Our findings suggest that participants using pagination were more likely to find relevant documents, especially those over the fold; spent more time attending to relevant results; and were faster to click while spending less time on the search result pages overall. We also found that the main reason for the difference in search speed is the time taken for the scroll itself. We conclude that search engines need to provide different viewport controls to allow better search experiences on touch-enabled mobile devices.

【Keywords】: mobile device; pagination; scroll effect; user study

80. Studying the Dark Triad of Personality through Twitter Behavior.

【Paper Link】【Pages】:761-770

【Authors】: Daniel Preotiuc-Pietro ; Jordan Carpenter ; Salvatore Giorgi ; Lyle H. Ungar

【Abstract】: Research into the darker traits of human nature is growing in interest especially in the context of increased social media usage. This allows users to express themselves to a wider online audience. We study the extent to which the standard model of dark personality -- the dark triad -- consisting of narcissism, psychopathy and Machiavellianism, is related to observable Twitter behavior such as platform usage, posted text and profile image choice. Our results show that we can map various behaviors to psychological theory and study new aspects related to social media usage. Finally, we build a machine learning algorithm that predicts the dark triad of personality in out-of-sample users with reliable accuracy.

【Keywords】: dark triad; personality; social media; twitter; user profiling

Session 4c: Documents 4

81. Document Filtering for Long-tail Entities.

【Paper Link】【Pages】:771-780

【Authors】: Ridho Reinanda ; Edgar Meij ; Maarten de Rijke

【Abstract】: Filtering relevant documents with respect to entities is an essential task in the context of knowledge base construction and maintenance. It entails processing a time-ordered stream of documents that might be relevant to an entity in order to select only those that contain vital information. State-of-the-art approaches to document filtering for popular entities are entity-dependent: they rely on and are also trained on the specifics of differentiating features for each specific entity. Moreover, these approaches tend to use so-called extrinsic information such as Wikipedia page views and related entities which is typically only available only for popular head entities. Entity-dependent approaches based on such signals are therefore ill-suited as filtering methods for long-tail entities. In this paper we propose a document filtering method for long-tail entities that is entity-independent and thus also generalizes to unseen or rarely seen entities. It is based on intrinsic features, i.e., features that are derived from the documents in which the entities are mentioned. We propose a set of features that capture informativeness, entity-saliency, and timeliness. In particular, we introduce features based on entity aspect similarities, relation patterns, and temporal expressions and combine these with standard features for document filtering. Experiments following the TREC KBA 2014 setup on a publicly available dataset show that our model is able to improve the filtering performance for long-tail entities over several baselines. Results of applying the model to unseen entities are promising, indicating that the model is able to learn the general characteristics of a vital document. The overall performance across all entities---i.e., not just long-tail entities---improves upon the state-of-the-art without depending on any entity-specific training data.

【Keywords】: document filtering; long-tail entities; semantic search

82. Estimating Time Models for News Article Excerpts.

【Paper Link】【Pages】:781-790

【Authors】: Arunav Mishra ; Klaus Berberich

【Abstract】: It is often difficult to ground text to precise time intervals due to the inherent uncertainty arising from either missing or multiple expressions at year, month, and day time granularities. We address the problem of estimating an excerpt-time model capturing the temporal scope of a given news article excerpt as a probability distribution over chronons. For this, we propose a semi-supervised distribution propagation framework that leverages redundancy in the data to improve the quality of estimated time models. Our method generates an event graph with excerpts as nodes and models various inter-excerpt relations as edges. It then propagates empirical excerpt-time models estimated for temporally annotated excerpts, to those that are strongly related but miss annotations. In our experiments, we first generate a test query set by randomly sampling 100 Wikipedia events as queries. For each query, making use of a standard text retrieval model, we then obtain top-10 documents with an average of 150 excerpts. From these, each temporally annotated excerpt is considered as gold standard. The evaluation measures are first computed for each gold standard excerpt for a single query, by comparing the estimated model with our method to the empirical model from the original expressions. Final scores are reported by averaging over all the test queries. Experiments on the English Gigaword corpus show that our method estimates significantly better time models than several baselines taken from the literature.

【Keywords】: distribution propagation; excerpt-time model; probabilistic models; sparsity reduction; temporal content analysis; temporal scoping

83. A Framework for Task-specific Short Document Expansion.

【Paper Link】【Pages】:791-800

【Authors】: Ramakrishna B. Bairi ; Raghavendra Udupa ; Ganesh Ramakrishnan

【Abstract】: Collections that contain a large number of short texts are becoming increasingly common (eg., tweets, reviews, etc). Analytical tasks (such as classification, clustering, etc.) involving short texts could be challenging due to the lack of context and owing to their sparseness. An often encountered problem is low accuracy on the task. A standard technique used in the handling of short texts is expanding them before subjecting them to the task. However, existing works on short text expansion suffer from certain limitations: (i) they depend on domain knowledge to expand the text; (ii) they employ task-specific heuristics; and (iii) the expansion procedure is tightly coupled to the task. This makes it hard to adapt a procedure, designed for one task, into another. We present an expansion technique -- TIDE (Task-specIfic short Document Expansion) -- that can be applied on several Machine Learning, NLP and Information Retrieval tasks on short texts (such as short text classification, clustering, entity disambiguation, and the like) without using task specific heuristics and domain-specific knowledge for expansion. At the same time, our technique is capable of learning to expand short texts in a task-specific way. That is, the same technique that is applied to expand a short text in two different tasks is able to learn to produce different expansions depending upon what expansion benefits the task's performance. To speed up the learning process, we also introduce a technique called block learning. Our experiments with classification and clustering tasks show that our framework improves upon several baselines according to the standard evaluation metrics which includes the accuracy and normalized mutual information (NMI).

【Keywords】: document expansion; short text expansion

84. Beyond Clustering: Sub-DAG Discovery for Categorising Documents.

【Paper Link】【Pages】:801-810

【Authors】: Ramakrishna B. Bairi ; Mark James Carman ; Ganesh Ramakrishnan

【Abstract】: We study the problem of generating DAG-structured category hierarchies over a given set of documents associated with "importance" scores. Example application includes automatically generating Wikipedia disambiguation pages for a set of articles having click counts associated with them. Unlike previous works, which focus on clustering the set of documents using the category hierarchy as features, we directly pose the problem as that of finding a DAG structured generative mode that has maximum likelihood of generating the observed "importance" scores for each document where documents are modeled as the leaf nodes in the DAG structure. Desirable properties of the categories in the inferred DAG-structured hierarchy include document coverage and category relevance, each of which, we show, is naturally modeled by our generative model. We propose two different algorithms for estimating the model parameters. One by modeling the DAG as a Bayesian Network and estimating its parameters via Gibbs Sampling; and the other by estimating the path probabilities using the Expectation Maximization algorithm. We empirically evaluate our method on the problem of automatically generating Wikipedia disambiguation pages using human generated clusterings as the ground truth. We find that our framework improves upon the baselines according to the F1 score and Entropy that are used as standard metrics to evaluate the hierarchical clustering.

【Keywords】: gibbs sampling; hierarchical categorisation; topic model

Session 4d: Knowledge Mining and Management 4

85. On Transductive Classification in Heterogeneous Information Networks.

【Paper Link】【Pages】:811-820

【Authors】: Xiang Li ; Ben Kao ; Yudian Zheng ; Zhipeng Huang

【Abstract】: A heterogeneous information network (HIN) is used to model objects of different types and their relationships. Objects are often associated with properties such as labels. In many applications, such as curated knowledge bases for which object labels are manually given, only a small fraction of the objects are labeled. Studies have shown that transductive classification is an effective way to classify and to deduce labels of objects, and a number of transductive classifiers have been put forward to classify objects in an HIN. We study the performance of a few representative transductive classification algorithms on HINs. We identify two fundamental properties, namely, cohesiveness and connectedness, of an HIN that greatly influence the effectiveness of transductive classifiers. We define metrics that measure the two properties. Through experiments, we show that the two properties serve as very effective indicators that predict the accuracy of transductive classifiers. Based on cohesiveness and connectedness we derive (1) a black-box tester that evaluates whether transductive classifiers should be applied for a given classification task and (2) an active learning algorithm that identifies the objects in an HIN whose labels should be sought in order to improve classification accuracy.

【Keywords】: heterogeneous information network; knowledge base; transductive classification

86. Efficient Hidden Trajectory Reconstruction from Sparse Data.

【Paper Link】【Pages】:821-830

【Authors】: Ning Yang ; Philip S. Yu

【Abstract】: In this paper, we investigate the problem of reconstructing hidden trajectories from a collective of separate spatial-temporal points without ID information, given the number of hidden trajectories. The challenge is three-fold: lack of meaningful features, data sparsity, and missing trajectory links. We propose a novel approach called Hidden Trajectory Reconstruction (HTR). From an information-theoretic perspective, we devise five novel temporal features and combine them into an Latent Spatial-Temporal Feature Vector (LSTFV) to characterize the dynamics of a single spatial-temporal point. The proposed features have the potential of distinguishing spatial-temporal points between trajectories. To overcome the data sparsity, we assemble the LSTFVs to a sparse Temporal Feature Tensor (TF-Tensor) and propose an algorithm called Parallel Iterative Collaborative Approximation of Sparse Tensor (PICAST). PICAST approximates the TF-Tensor by decomposing it into a tensor product of a low-rank core identity tensor and three dense factor matrices with a divide-and-conquer strategy. To achieve a dense approximate tensor with good accuracy and efficiency, PICAST minimizes a sparsity-measure and fuses an additional matrix of static geographical region features. To recover the missing trajectory links, we propose a mapping, Cross-Temporal Connectivity Preserving Transformation (CTCPT), to map the LSTFVs of the separate spatial-temporal points to an intrinsic space called Cross-Temporal Connectivity Preserving Space (CTCPS). CTCPT uses Cross-Temporal Connectivity (CTC) to evaluate whether two spatial-temporal points belong to the same trajectory and if they do, how strong the connectivity between them is. Due to the CTCPT, the hidden trajectories can be reconstructed from clusters generated in CTCPS by a clustering algorithm. At last, the extensive experiments conducted on synthetic datasets and real datasets verify the effectiveness and efficiency of our algorithms.

【Keywords】: cross-temporal connectivity; latent spatial-temporal feature; sparse tensor decomposition; trajectory reconstruction

87. Quark-X: An Efficient Top-K Processing Framework for RDF Quad Stores.

【Paper Link】【Pages】:831-840

【Authors】: Jyoti Leeka ; Srikanta Bedathur ; Debajyoti Bera ; Medha Atre

【Abstract】: There is a growing trend towards enriching the RDF content from its classical Subject-Predicate-Object triple form to an annotated representation which can model richer relationships such as including fact provenance, fact confidence, higher-order relationships and so on. One of the recommended ways to achieve this is to use reification and represent it as N-Quads "or simply quads" where an additional identifier is associated with the entire RDF statement which can then be used to add further annotations. A typical use of such annotations is to have quantifiable confidence values to be attached to facts. In such settings, it is important to support efficient top-k queries, typically over user-defined ranking functions containing sentence level confidence values in addition to other quantifiable values in the database. In this paper, we present Quark-X, an RDF-store and SPARQL processing system for reified RDF data represented in the form of quads. This paper presents the overall architecture of our system -- illustrating the modifications which need to be made to a native quad store for it to process top-k queries. In Quark-X, we propose indexing and query processing techniques for making top-k querying efficient. In addition, we present the results of a comprehensive empirical evaluation of our system over Yago2S and DBpedia datasets. Our performance study shows that the proposed method achieves one to two order of magnitude speed-up over baseline solutions.

【Keywords】: rdf; sparql; top-k

88. Reenactment for Read-Committed Snapshot Isolation.

【Paper Link】【Pages】:841-850

【Authors】: Bahareh Sadat Arab ; Dieter Gawlick ; Vasudha Krishnaswamy ; Venkatesh Radhakrishnan ; Boris Glavic

【Abstract】: Provenance for transactional updates is critical for many applications such as auditing and debugging of transactions. Recently, we have introduced MV-semirings, an extension of the semiring provenance model that supports updates and transactions. Furthermore, we have proposed reenactment, a declarative form of replay with provenance capture, as an efficient and non-invasive method for computing this type of provenance. However, this approach is limited to the snapshot isolation (SI) concurrency control protocol while many real world applications apply the read committed version of snapshot isolation (RC-SI) to improve performance at the cost of consistency. We present non trivial extensions of the model and reenactment approach to be able to compute provenance of RC-SI transactions efficiently. In addition, we develop techniques for applying reenactment across multiple RC-SI transactions. Our experiments demonstrate that our implementation in the GProM system supports efficient re-construction and querying of provenance.

【Keywords】: provenance; read-committed snapshot isolation; reenactment; transaction

Session 4e: Truth Discovery 4

89. Influence-Aware Truth Discovery.

【Paper Link】【Pages】:851-860

【Authors】: Hengtong Zhang ; Qi Li ; Fenglong Ma ; Houping Xiao ; Yaliang Li ; Jing Gao ; Lu Su

【Abstract】: In the age of big data, information for the same entity can be obtained from different sources, which is inevitably conflicting. Therefore, aggregation methods are needed to identify the trustworthy information from such conflicting data. Truth discovery, which improves the aggregation results by estimating source trustworthiness and discovering truths simultaneously, has become an emerging field. Most truth discovery methods assume that sources make their claims independently, which may not be true in practice. As a matter of fact, influences among sources are ubiquitous and the claims made by one source may be influenced by others. Although there is some work that considers source correlation, those methods are designed to handle categorical claims, which is not general enough to represent the complicated real world applications. To tackle these challenges in truth discovery, we propose an unsupervised probabilistic model named IATD. The model takes source correlations as prior for influence derivation. To model influences among sources, we introduce "claim trustworthiness", which fuses the trustworthiness of the source which provides the claim and the trustworthiness of its influencers. Besides, the proposed model can handle different data types using different distributions in the probabilistic model. Experiments on real-world datasets show that IATD model can improve the aggregation performance compared with the state-of-the-art truth discovery approaches. The properties of IATD model are further illustrated using simulated datasets.

【Keywords】: probabilistic model; truth discovery; unsupervised learning

90. Truth Discovery via Exploiting Implications from Multi-Source Data.

【Paper Link】【Pages】:861-870

【Authors】: Xianzhi Wang ; Quan Z. Sheng ; Lina Yao ; Xue Li ; Xiu Susie Fang ; Xiaofei Xu ; Boualem Benatallah

【Abstract】: Data veracity is a grand challenge for various tasks on the Web. Since the web data sources are inherently unreliable and may provide conflicting information about the same real-world entities, truth discovery is emerging as a countermeasure of resolving the conflicts by discovering the truth, which conforms to the reality, from the multi-source data. A major challenge related to truth discovery is that different data items may have varying numbers of true values (or multi-truth), which counters the assumption of existing truth discovery methods that each data item should have exactly one true value. In this paper, we address this challenge by exploiting and leveraging the implications from multi-source data. In particular, we exploit three types of implications, namely the implicit negative claims, the distribution of positive/negative claims, and the co-occurrence of values in sources' claims, to facilitate multi-truth discovery. We propose a probabilistic approach with improvement measures that incorporate the three implications in all stages of truth discovery process. In particular, incorporating the negative claims enables multi-truth discovery, considering the distribution of positive/negative claims relieves truth discovery from the impact of sources' behavioral features in the specific datasets, and considering values' co-occurrence relationship compensates the information lost from evaluating each value in the same claims individually. Experimental results on three real-world datasets demonstrate the effectiveness of our approach.

【Keywords】: imbalanced claims; multiple true values; probabilistic model; truth discovery

【Paper Link】【Pages】:871-880

【Authors】: Tarique Siddiqui ; Xiang Ren ; Aditya G. Parameswaran ; Jiawei Han

【Abstract】: Given the large volume of technical documents available, it is crucial to automatically organize and categorize these documents to be able to understand and extract value from them. Towards this end, we introduce a new research problem called Facet Extraction. Given a collection of technical documents, the goal of Facet Extraction is to automatically label each document with a set of concepts for the key facets (e.g., application, technique, evaluation metrics, and dataset) that people may be interested in. Facet Extraction has numerous applications, including document summarization, literature search, patent search and business intelligence. The major challenge in performing Facet Extraction arises from multiple sources: concept extraction, concept to facet matching, and facet disambiguation. To tackle these challenges, we develop FacetGist, a framework for facet extraction. Facet Extraction involves constructing a graph-based heterogeneous network to capture information available across multiple local sentence-level features, as well as global context features. We then formulate a joint optimization problem, and propose an efficient algorithm for graph-based label propagation to estimate the facet of each concept mention. Experimental results on technical corpora from two domains demonstrate that Facet Extraction can lead to an improvement of over 25% in both precision and recall over competing schemes.

【Keywords】: concept extraction; heterogeneous graphs; label propagation; scientific documents

92. Empowering Truth Discovery with Multi-Truth Prediction.

【Paper Link】【Pages】:881-890

【Authors】: Xianzhi Wang ; Quan Z. Sheng ; Lina Yao ; Xue Li ; Xiu Susie Fang ; Xiaofei Xu ; Boualem Benatallah

【Abstract】: Truth discovery is the problem of detecting true values from the conflicting data provided by multiple sources on the same data items. Since sources' reliability is unknown a priori, a truth discovery method usually estimates sources' reliability along with the truth discovery process. A major limitation of existing truth discovery methods is that they commonly assume exactly one true value on each data item and therefore cannot deal with the more general case that a data item may have multiple true values (or multi-truth). Since the number of true values may vary from data item to data item, this requires truth discovery methods being able to detect varying numbers of truth values from the multi-source data. In this paper, we propose a multi-truth discovery approach, which addresses the above challenges by providing a generic framework for enhancing existing truth discovery methods. In particular, we redeem the numbers of true values as an important clue for facilitating multi-truth discovery. We present the procedure and components of our approach, and propose three models, namely the byproduct model, the joint model, and the synthesis model to implement our approach. We further propose two extensions to enhance our approach, by leveraging the implications of similar numerical values and values' co-occurrence information in sources' claims to improve the truth discovery accuracy. Experimental studies on real-world datasets demonstrate the effectiveness of our approach.

【Keywords】: empowerment model; multiple truths; truth discovery; value co-occurrence

Session 4f: Industry Session IV 3

93. Using Machine Learning to Improve the Email Experience.

【Paper Link】【Pages】:891

【Authors】: Marc Najork

【Abstract】: Email is an essential communication medium for billions of people, with most users relying on web-based email services. Two recent trends are changing the email experience: smartphones have become the primary tool for accessing online services including email, and machine learning has come of age. Smartphones have a number of compelling properties (they are location-aware, usually with us, and allow us to record and share photos and videos), but they also have a few limitations, notably limited screen size and small and tedious virtual keyboards. Over the past few years, Google researchers and engineers have leveraged machine learning to ameliorate these weaknesses, and in the process created novel experiences. In this talk, I will give three examples of machine learning improving the email experience. The first example describes how we are improving email search. Displaying the most relevant results as the query is being typed is particularly useful on smartphones due to the aforementioned limitations. Combining hand-crafted and machine-learned rankers is powerful, but training learned rankers requires a relevance-labeled training set. User privacy prohibits us from employing raters to produce relevance labels. Instead, we leverage implicit feedback (namely clicks) provided by the users themselves. Using click logs as training data in a learning-to-rank setting is intriguing, since there is a vast and continuous supply of fresh training data. However, the click stream is biased towards queries that receive more clicks -- e.g. queries for which we already return the best result in the top-ranked position. I will summarize our work on neutralizing that bias. The second example describes how we extract key information from appointment and reservation emails and surface it at the appropriate time as a reminder on the user's smartphone. Our basic approach is to learn the templates that were used to generate these emails, use these templates to extract key information such as places, dates and times, store the extracted records in a personal information store, and surface them at the right time, taking contextual information such as estimated transit time into account. The third example describes Smart Reply, a system that offers a set of three short responses to those incoming emails for which a short response is appropriate, allowing users to respond quickly with just a few taps, without typing or involving voice-to-text transcription. The basic approach is to learn a model of likely short responses to original emails from the corpus, and then to apply the model whenever a new message arrives. Other considerations include offering a set of responses that are all appropriate and yet diverse, and triggering only when sufficiently confident that each responses is of high quality and appropriate.

【Keywords】: email; information extraction; machine learning; ranking

94. Hashtag Recommendation for Enterprise Applications.

【Paper Link】【Pages】:893-902

【Authors】: Dhruv Mahajan ; Vishwajit Kolathur ; Chetan Bansal ; Suresh Parthasarathy ; Sundararajan Sellamanickam ; S. Sathiya Keerthi ; Johannes Gehrke

【Abstract】: Hashtags have been popularly used in several social cum consumer network settings such as Twitter and Facebook. In this paper, we consider the problem of recommending hashtags for enterprise applications. These applications include emails (e.g., Outlook), enterprise social networks (e.g., Yammer) and special interest group email lists. This problem arises in an organization setting and hashtags are enterprise domain specific. One important aspect of our recommendation system is that we recommend hashtags for Inline hashtag scenario where recommendations change as the user inserts hashtags while typing the message. This involves working with partial content information. Besides this, we consider the conventional Post} hashtagging scenario where hashtags are recommended for the full message. We also consider an important (sub)scenario, viz., Auto-complete where hashtags are recommended with user provided partial information such as sub-string present in the hashtag. Auto-complete can be used with both Inline and Post scenarios. To the best of our knowledge, Inline, Auto-complete hashtag recommendations and hashtagging in enterprise applications have not been studied before. We propose to learn a joint model that uses features of three types, namely, temporal, structural and content. Our learning formulation handles all the hashtagging scenarios naturally. Comprehensive experimental study on five datasets of user email accounts collected by running an Outlook plugin (a key requirement for large scale industrial deployment), one dataset of special interest group email list and one enterprise social network data set shows that the proposed method performs significantly better than the state of the art methods used in consumer applications such as Twitter. The primary reason is that different feature types play dominant role in different scenarios and datasets. Since the joint model makes use of all feature types effectively, it performs better in almost all scenarios and datasets.

【Keywords】: auto-complete; enterprise application; hashtag recommendation; hashtags

95. Survival Analysis based Framework for Early Prediction of Student Dropouts.

【Paper Link】【Pages】:903-912

【Authors】: Sattar Ameri ; Mahtab Jahanbani Fard ; Ratna Babu Chinnam ; Chandan K. Reddy

【Abstract】: Retention of students at colleges and universities has been a concern among educators for many decades. The consequences of student attrition are significant for students, academic staffs and the universities. Thus, increasing student retention is a long term goal of any academic institution. The most vulnerable students are the freshman, who are at the highest risk of dropping out at the beginning of their study. Therefore, the early identification of {\emph{``at-risk''}} students is a crucial task that needs to be effectively addressed. In this paper, we develop a survival analysis framework for early prediction of student dropout using Cox proportional hazards regression model (Cox). We also applied time-dependent Cox (TD-Cox), which captures time-varying factors and can leverage those information to provide more accurate prediction of student dropout. For this prediction task, our model utilizes different groups of variables such as demographic, family background, financial, high school information, college enrollment and semester-wise credits. The proposed framework has the ability to address the challenge of predicting dropout students as well as the semester that the dropout will occur. This study enables us to perform proactive interventions in a prioritized manner where limited academic resources are available. This is critical in the student retention problem because not only correctly classifying whether a student is going to dropout is important but also when this is going to happen is crucial for a focused intervention. We evaluate our method on real student data collected at Wayne State University. Results show that the proposed Cox-based framework can predict the student dropouts and semester of dropout with high accuracy and precision compared to the other state-of-the-art methods.

【Keywords】: classification; event prediction; longitudinal data; regression; student retention; survival analysis

Session 5a: Sentiment and Opinion Mining 4

96. Generative Feature Language Models for Mining Implicit Features from Customer Reviews.

【Paper Link】【Pages】:929-938

【Authors】: Shubhra Kanti Karmaker Santu ; Parikshit Sondhi ; ChengXiang Zhai

【Abstract】: Online customer reviews are very useful for both helping consumers make buying decisions on products or services and providing business intelligence. However, it is a challenge for people to manually digest all the opinions buried in large amounts of review data, raising the need for automatic opinion summarization and analysis. One fundamental challenge in automatic opinion summarization and analysis is to mine implicit features, i.e., recognizing the features implicitly mentioned (referred to) in a review sentence. Existing approaches require many ad hoc manual parameter tuning, and are thus hard to optimize or generalize; their evaluation has only been done with Chinese review data. In this paper, we propose a new approach based on generative feature language models that can mine the implicit features more effectively through unsupervised statistical learning. The parameters are optimized automatically using an Expectation-Maximization algorithm. We also created eight new data sets to facilitate evaluation of this task in English. Experimental results show that our proposed approach is very effective for assigning features to sentences that do not explicitly mention the features, and outperforms the existing algorithms by a large margin.

【Keywords】: implicit feature mining; language models; opinion analysis; review summarization

97. Data-Driven Contextual Valence Shifter Quantification for Multi-Theme Sentiment Analysis.

【Paper Link】【Pages】:939-948

【Authors】: Hongkun Yu ; Jingbo Shang ; Meichun Hsu ; Malú Castellanos ; Jiawei Han

【Abstract】: Users often write reviews on different themes involving linguistic structures with complex sentiments. The sentiment polarity of a word can be different across themes. Moreover, contextual valence shifters may change sentiment polarity depending on the contexts that they appear in. Both challenges cannot be modeled effectively and explicitly in traditional sentiment analysis. Studying both phenomena requires multi-theme sentiment analysis at the word level, which is very interesting but significantly more challenging than overall polarity classification. To simultaneously resolve the multi-theme and sentiment shifting problems, we propose a data-driven framework to enable both capabilities: (1) polarity predictions of the same word in reviews of different themes, and (2) discovery and quantification of contextual valence shifters. The framework formulates multi-theme sentiment by factorizing the review sentiments with theme/word embeddings and then derives the shifter effect learning problem as a logistic regression. The improvement of sentiment polarity classification accuracy demonstrates not only the importance of multi-theme and sentiment shifting, but also effectiveness of our framework. Human evaluations and case studies further show the success of multi-theme word sentiment predictions and automatic effect quantification of contextual valence shifters.

【Keywords】: multi-theme; sentiment analysis; sentiment shifting

98. Sentiment Domain Adaptation with Multi-Level Contextual Sentiment Knowledge.

【Paper Link】【Pages】:949-958

【Authors】: Fangzhao Wu ; Sixing Wu ; Yongfeng Huang ; Songfang Huang ; Yong Qin

【Abstract】: Sentiment domain adaptation is widely studied to tackle the domain-dependence problem in sentiment analysis field. Existing domain adaptation methods usually train a sentiment classifier in a source domain and adapt it to the target domain using transfer learning techniques. However, when the sentiment feature distributions of the source and target domains are significantly different, the adaptation performance will heavily decline. In this paper, we propose a new sentiment domain adaptation approach by adapting the sentiment knowledge in general-purpose sentiment lexicons to a specific domain. Since the general sentiment words of general-purpose sentiment lexicons usually convey consistent sentiments in different domains, they have better generalization performance than the sentiment classifier trained in a source domain. In addition, we propose to extract various kinds of contextual sentiment knowledge from massive unlabeled samples in target domain and formulate them as sentiment relations among sentiment expressions. It can propagate the sentiment information in general sentiment words to massive domain-specific sentiment expressions. Besides, we propose a unified framework to incorporate these different kinds of sentiment knowledge and learn an accurate domain-specific sentiment classifier for target domain. Moreover, we propose an efficient optimization algorithm to solve the model of our approach. Extensive experiments on benchmark datasets validate the effectiveness and efficiency of our approach.

【Keywords】: domain adaptation; sentiment classification

【Paper Link】【Pages】:959-968

【Authors】: Dae Hoon Park ; Yi Fang ; Mengwen Liu ; ChengXiang Zhai

【Abstract】: People often implicitly or explicitly express their needs in social media in the form of "user status text". Such text can be very useful for service providers and product manufacturers to proactively provide relevant services or products that satisfy people's immediate needs. In this paper, we study how to infer a user's intent based on the user's "status text" and retrieve relevant mobile apps that may satisfy the user's needs. We address this problem by framing it as a new entity retrieval task where the query is a user's status text and the entities to be retrieved are mobile apps. We first propose a novel approach that generates a new representation for each query. Our key idea is to leverage social media to build parallel corpora that contain implicit intention text and the corresponding explicit intention text. Specifically, we model various user intentions in social media text using topic models, and we predict user intention in a query that contains implicit intention. Then, we retrieve relevant mobile apps with the predicted user intention. We evaluate the mobile app retrieval task using a new data set we create. Experiment results indicate that the proposed model is effective and outperforms the state-of-the-art retrieval models.

【Keywords】: app recommendation; implicit intent; mobile app retrieval

Session 5b: Time Series 4

100. Derivative Delay Embedding: Online Modeling of Streaming Time Series.

【Paper Link】【Pages】:969-978

【Authors】: Zhifei Zhang ; Yang Song ; Wei Wang ; Hairong Qi

【Abstract】: The staggering amount of streaming time series coming from the real world calls for more efficient and effective online modeling solution. For time series modeling, most existing works make some unrealistic assumptions such as the input data is of fixed length or well aligned, which requires extra effort on segmentation or normalization of the raw streaming data. Although some literature claim their approaches to be invariant to data length and misalignment, they are too time-consuming to model a streaming time series in an online manner. We propose a novel and more practical online modeling and classification scheme, DDE-MGM, which does not make any assumptions on the time series while maintaining high efficiency and state-of-the-art performance. The derivative delay embedding (DDE) is developed to incrementally transform time series to the embedding space, where the intrinsic characteristics of data is preserved as recursive patterns regardless of the stream length and misalignment. Then, a non-parametric Markov geographic model (MGM) is proposed to both model and classify the pattern in an online manner. Experimental results demonstrate the effectiveness and superior classification accuracy of the proposed DDE-MGM in an online setting as compared to the state-of-the-art.

【Keywords】: delay embedding; markov geographical model; online modeling and classification; streaming time series

101. PISA: An Index for Aggregating Big Time Series Data.

【Paper Link】【Pages】:979-988

【Authors】: Xiangdong Huang ; Jianmin Wang ; Raymond K. Wong ; Jinrui Zhang ; Chen Wang

【Abstract】: Aggregation operation plays an important role in time series database management. As the amount of data increases, current solutions such as summary table and MapReduce-based methods struggle to respond to such queries with low latency. Other approaches such as segment tree based methods have a poor insertion performance when the data size exceeds the available memory. This paper proposes a new segment tree based index called PISA, which has fast insertion performance and low latency for aggregation queries. PISA uses a forest to overcome the performance disadvantages of insertions in traditional segment trees. By defining two kinds of tags, namely code number and serial number, we propose an algorithm to accelerate queries by avoiding reading unnecessary data on disk. The index is stored on disk and only takes a few hundred bytes of memory for billions of data points. PISA can be easily implemented on both traditional databases and NoSQL systems, examples including MySQL and Cassandra. It handles aggregation queries within milliseconds on a commodity server for a time range that may contain tens of billions of data points.

【Keywords】: aggregation index; temporal data

102. Multi-View Time Series Classification: A Discriminative Bilinear Projection Approach.

【Paper Link】【Pages】:989-998

【Authors】: Sheng Li ; Yaliang Li ; Yun Fu

【Abstract】: By virtue of the increasingly large amount of various sensors, information about the same object can be collected from multiple views. These mutually enriched information can help many real-world applications, such as daily activity recognition in which both video cameras and on-body sensors are continuously collecting information. Such multivariate time series (m.t.s.) data from multiple views can lead to a significant improvement of classification tasks. However, the existing methods for time series data classification only focus on single-view data, and the benefits of mutual-support multiple views are not taken into account. In light of this challenge, we propose a novel approach, named Multi-view Discriminative Bilinear Projections (MDBP), for extracting discriminative features from multi-view m.t.s. data. First, MDBP keeps the original temporal structure of m.t.s. data, and projects m.t.s. from different views onto a shared latent subspace. Second, MDBP incorporates discriminative information by minimizing the within-class separability and maximizing the between-class separability of m.t.s. in the shared latent subspace. Moreover, a Laplacian regularization term is designed to preserve the temporal smoothness within m.t.s.. Extensive experiments on two real-world datasets demonstrate the effectiveness of our approach. Compared to the state-of-the-art multi-view learning and m.t.s. classification methods, our approach greatly improves the classification accuracy due to the full exploration of multi-view streaming data. Moreover, by using a feature fusion strategy, our approach further improves the classification accuracy by at least 10%.

【Keywords】: bilinear projections; discriminative regularization; multi-view learning; time series classification

103. Semi-Supervision Dramatically Improves Time Series Clustering under Dynamic Time Warping.

【Paper Link】【Pages】:999-1008

【Authors】: Hoang Anh Dau ; Nurjahan Begum ; Eamonn J. Keogh

【Abstract】: The research community seems to have converged in agreement that for time series classification problems, Dynamic Time Warping (DTW)-based nearest-neighbor classifiers are exceptionally hard to beat. Obtaining the best performance from DTW requires setting its only parameter, the warping window width (w). This is typically set by cross validation in the training stage. However, for clustering, by definition we do not have access to such labeled data. This issue seems to have been largely ignored in the literature, with many practitioners simply assuming that "the larger the better" for the value of w, and using as large a value of w as computational resources permit. In this work we show that this is a naive approach which in most circumstances produces inferior clusterings. To address this problem, we introduce a novel semi-supervised technique that allows us to set the best value of w. Unlike virtually all other semi-supervised techniques, our ideas are completely independent of the clustering algorithm used, and can be utilized to improve time series clustering under partitional, hierarchical, spectral or density-based clustering. Our approach requires very little human intervention; moreover, we show that in many cases, true human annotation efforts can be replaced with automatically-generated "pseudo" supervision information. We demonstrate our technique by testing with more than one hundred publicly available datasets.

【Keywords】: dynamic time warping; semi-supervised learning; time series

Session 5c: Learning for Classification and Prediction 4

104. Model-Based Oversampling for Imbalanced Sequence Classification.

【Paper Link】【Pages】:1009-1018

【Authors】: Zhichen Gong ; Huanhuan Chen

【Abstract】: Sequence classification is critical in the data mining communities. It becomes more challenging when the class distribution is imbalanced, which occurs in many real-world applications. Oversampling algorithms try to re-balance the skewed class by generating synthetic data for minority classes, but most of existing oversampling approaches could not consider the temporal structure of sequences, or handle multivariate and long sequences. To address these problems, this paper proposes a novel oversampling algorithm based on the 'generative' models of sequences. In particular, a recurrent neural network was employed to learn the generative mechanics for sequences as representations for the corresponding sequences. These generative models are then utilized to form a kernel to capture the similarity between different sequences. Finally, oversampling is performed in the kernel feature space to generate synthetic data. The proposed approach can handle highly imbalanced sequential data and is robust to noise. The competitiveness of the proposed approach is demonstrated by experiments on both synthetic data and benchmark data, including univariate and multivariate sequences.

【Keywords】: imbalanced learning; model space; oversampling; sequence classification

105. CRISP: Consensus Regularized Selection based Prediction.

【Paper Link】【Pages】:1019-1028

【Authors】: Ping Wang ; Karthik K. Padthe ; Bhanukiran Vinzamuri ; Chandan K. Reddy

【Abstract】: Integrating regularization methods with standard loss functions such as the least squares, hinge loss, etc., within a regression framework has become a popular choice for researchers to learn predictive models with lower variance and better generalization ability. Regularizers also aid in building interpretable models with high-dimensional data which makes them very appealing. It is observed that each regularizer is uniquely formulated in order to capture data-specific properties such as correlation, structured sparsity and temporal smoothness. The problem of obtaining a consensus among such diverse regularizers while learning a predictive model is extremely important in order to determine the optimal regularizer for the problem. The advantage of such an approach is that it preserves the simplicity of the final model learned by selecting a single candidate model which is not the case with ensemble methods as they use multiple candidate models for prediction. This is called the consensus regularization problem which has not received much attention in the literature due to the inherent difficulty associated with learning and selecting a model from an integrated regularization framework. To solve this problem, in this paper, we propose a method to generate a committee of non-convex regularized linear regression models, and use a consensus criterion to determine the optimal model for prediction. Each corresponding non-convex optimization problem in the committee is solved efficiently using the cyclic-coordinate descent algorithm with the generalized thresholding operator. Our Consensus RegularIzation Selection based Prediction (CRISP) model is evaluated on electronic health records (EHRs) obtained from a large hospital for the congestive heart failure readmission prediction problem. We also evaluate our model on high-dimensional synthetic datasets to assess its performance. The results indicate that CRISP outperforms several state-of-the-art methods such as additive, interactions-based and other competing non-convex regularized linear regression methods.

【Keywords】: consensus prediction; regression.; regularization

106. Regularizing Structured Classifier with Conditional Probabilistic Constraints for Semi-supervised Learning.

【Paper Link】【Pages】:1029-1038

【Authors】: Vincent Wenchen Zheng ; Kevin Chen-Chuan Chang

【Abstract】: Constraints have been shown as an effective way to incorporate unlabeled data for semi-supervised structured classification. We recognize that, constraints are often conditional and probabilistic; moreover, a constraint can have its condition depend on either just observations (which we call x-type constraint) or even hidden variables (which we call y-type constraint). We wish to design a constraint formulation that can flexibly model the constraint probability for both x-type and y-type constraints, and later use it to regularize general structured classifiers for semi-supervision. Surprisingly, none of the existing models have such a constraint formulation. Thus in this paper, we propose a new conditional probabilistic formulation for modeling both x-type and y-type constraints. We also recognize the inference complication for y-type constraint, and propose a systematic selective evaluation approach to efficiently realize the constraints. Finally, we evaluate our model in three applications, including named entity recognition, part-of-speech tagging and entity information extraction, with totally nine data sets. We show that our model is generally more accurate and efficient than the state-of-the-art baselines. Our code and data are available at https://bitbucket.org/vwz/cikm2016-cpf/.

【Keywords】: conditional probabilistic constraint; structured classifier

107. Scalability of Continuous Active Learning for Reliable High-Recall Text Classification.

【Paper Link】【Pages】:1039-1048

【Authors】: Gordon V. Cormack ; Maura R. Grossman

【Abstract】: For finite document collections, continuous active learning ('CAL') has been observed to achieve high recall with high probability, at a labeling cost asymptotically proportional to the number of relevant documents. As the size of the collection increases, the number of relevant documents typically increases as well, thereby limiting the applicability of CAL to low-prevalence high-stakes classes, such as evidence in legal proceedings, or security threats, where human effort proportional to the number of relevant documents is justified. We present a scalable version of CAL ('S-CAL') that requires O(log N) labeling effort and O(N log N) computational effort---where N is the number of unlabeled training examples---to construct a classifier whose effectiveness for a given labeling cost compares favorably with previously reported methods. At the same time, S-CAL offers calibrated estimates of class prevalence, recall, and precision, facilitating both threshold setting and determination of the adequacy of the classifier.

【Keywords】: cal; continuous active learning; ediscovery; electronic discovery; predictive coding; relevance feedback; tar; technology-assisted review; test collections; text categorization; volume estimation

【Paper Link】【Pages】:1049-1058

【Authors】: Henry S. Vieira ; Altigran Soares da Silva ; Pável Calado ; Marco Cristo ; Edleno Silva de Moura

【Abstract】: Online social media has become an essential part of our life. This media is often characterized by its diverse content, which is produced by ordinary users. The potential to easily express ideas and opinions has made social media a source of valuable information on a variety of topics. In particular, information containing comments about consumer products has become prevalent. Here, we are interested in linking products mentioned in unstructured user-generated content, namely open discussion forums, to their respective entities in consumer product catalogs. Among the issues associated with this task, ambiguity is a particularly hard problem, as users typically refer to the same product using many different forms and different products may share the same form. We argue that this problem can be effectively solved using a set of evidences that can be easily extracted from social media content and product descriptions. To achieve this, we show which features should be used, how they can be extracted, and then how to combine them through machine learning techniques. Experiments in three different product categories and two different datasets demonstrate that all the sources of evidence here proposed are important, while contextual information is fundamental to achieve higher levels of precision. In fact, our method, although straightforward, was able to achieve an average improvement of 0.17 in precision and 0.13 in F1, when compared to the current state-of-the-art solution.

【Keywords】: product linking; social media; text mining

【Paper Link】【Pages】:1059-1068

【Authors】: Tuan-Anh Hoang ; Ee-Peng Lim

【Abstract】: In social media, the magnitude of information propagation hinges on the virality and susceptibility of users spreading and receiving the information respectively, as well as the virality of information items. These users' and items' behavioral factors evolve dynamically at the same time interacting with one another. Previous works however measure the factors statically and independently in a restricted case: each user has only a single adoption on each item, and/or users' exposure to items are observable. In this work, we investigate the inter-relationship among the factors and users' multiple adoptions on items to propose both new static and temporal models for measuring the factors without requiring user - item exposure. These models are designed to cope with even more realistic propagation scenarios where an item may be propagated many times from the same user(s) to the same other user(s). We further propose an incremental model for measuring the factors in large data streams. We evaluated the proposed models and existing models through extensive experiments on a large Twitter dataset covering information propagation in one month. The experiments show that our proposed models can effectively mine the behavioral factors and outperform the existing ones in a propagation prediction task. The incremental model is shown more than 10 times faster than the temporal model, while still obtains very similar results.

【Keywords】: information propagation; susceptibility; user behavior; virality

110. Feature Driven and Point Process Approaches for Popularity Prediction.

【Paper Link】【Pages】:1069-1078

【Authors】: Swapnil Mishra ; Marian-Andrei Rizoiu ; Lexing Xie

【Abstract】: Predicting popularity, or the total volume of information outbreaks, is an important subproblem for understanding collective behavior in networks. Each of the two main types of recent approaches to the problem, feature-driven and generative models, have desired qualities and clear limitations. This paper bridges the gap between these solutions with a new hybrid approach and a new performance benchmark. We model each social cascade with a marked Hawkes self-exciting point process, and estimate the content virality, memory decay, and user influence. We then learn a predictive layer for popularity prediction using a collection of cascade history. To our surprise, Hawkes process with a predictive overlay outperform recent feature-driven and generative approaches on existing tweet data [44] and a new public benchmark on news tweets. We also found that a basic set of user features and event time summary statistics performs competitively in both classification and regression tasks, and that adding point process information to the feature set further improves predictions. From these observations, we argue that future work on popularity prediction should compare across feature-driven and generative modeling approaches in both classification and regression tasks.

【Keywords】: cascade prediction.; information diffusion; self-exciting point process; social media

111. Adaptive Evolutionary Filtering in Real-Time Twitter Stream.

【Paper Link】【Pages】:1079-1088

【Authors】: Feifan Fan ; Yansong Feng ; Lili Yao ; Dongyan Zhao

【Abstract】: With the explosive growth of microblogging service, Twitter has become a leading platform consisting of real-time world wide information. Users tend to explore breaking news or general topics in Twitter according to their interests. However, the explosive amount of incoming tweets leads users to information overload. Therefore, filtering interesting tweets based on users' interest profiles from real-time stream can be helpful for users to easily access the relevant and key information hidden among the tweets. On the other hand, real-time twitter stream contains enormous amount of noisy and redundant tweets. Hence, the filtering process should consider previously pushed interesting tweets to provide users with diverse tweets. What's more, different from traditional document summarization methods which focus on static dataset, the twitter stream is dynamic, fast-arriving and large-scale, which means we have to decide whether to filter the coming tweet for users from the real-time stream as early as possible. In this paper, we propose a novel adaptive evolutionary filtering framework to push interesting tweets for users from real-time twitter stream. First, we propose an adaptive evolutionary filtering algorithm to filter interesting tweets from the twitter stream with respect to user interest profiles. And then we utilize the maximal marginal relevance model in fixed time window to estimate the relevance and diversity of potential tweets. Besides, to overcome the enormous number of redundant tweets and characterize the diversity of potential tweets, we propose a hierarchical tweet representation learning model (HTM) to learn the tweet representations dynamically over time. Experiments on large scale real-time twitter stream datasets demonstrate the efficiency and effectiveness of our framework.

【Keywords】: evolutionary filtering; timeline; twitter stream

Session 5e: Queries and Search 4

112. Multiple Queries as Bandit Arms.

【Paper Link】【Pages】:1089-1098

【Authors】: Cheng Li ; Paul Resnick ; Qiaozhu Mei

【Abstract】: Existing retrieval systems rely on a single active query to pull documents from the index. Relevance feedback may be used to iteratively refine the query, but only one query is active at a time. If the user's information need has multiple aspects, the query must represent the union of these aspects. We consider a new paradigm of retrieval where multiple queries are kept active'' simultaneously. In the presence of rate limits, the active queries take turns accessing the index to retrieve anotherpage'' of results. Turns are assigned by a multi-armed bandit based on user feedback. This allows the system to explore which queries return more relevant results and to exploit the best ones. In empirical tests, query pools outperform solo, combined queries. Significant improvement is observed both when the subtopic queries are known in advance and when the queries are generated in a user-interactive process.

【Keywords】: multi-armed bandits; query pooling

113. Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search.

【Paper Link】【Pages】:1099-1108

【Authors】: Leonid Boytsov ; David Novak ; Yury Malkov ; Eric Nyberg

【Abstract】: Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations. While an exact brute-force k-NN search using this similarity function is slow, we demonstrate that an approximate algorithm can be nearly two orders of magnitude faster at the expense of only a small loss in accuracy. A retrieval pipeline using an approximate k-NN search can be more effective and efficient than the term-based pipeline. This opens up new possibilities for designing effective retrieval pipelines. Our software (including data-generating code) and derivative data based on the Stack Overflow collection is available online.

【Keywords】: ibm model 1; k-nn search; lsh; non-metric spaces

114. Scalability and Total Recall with Fast CoveringLSH.

【Paper Link】【Pages】:1109-1118

【Authors】: Ninh Pham ; Rasmus Pagh

【Abstract】: Locality-sensitive hashing (LSH) has emerged as the dominant algorithmic technique for similarity search with strong performance guarantees in high-dimensional spaces. A drawback of traditional LSH schemes is that they may have false negatives, i.e., the recall is less than 100%. This limits the applicability of LSH in settings requiring precise performance guarantees. Building on the recent theoretical "CoveringLSH" construction that eliminates false negatives, we propose a fast and practical covering LSH scheme for Hamming space called Fast CoveringLSH (fcLSH). Inheriting the design benefits of CoveringLSH our method avoids false negatives and always reports all near neighbors. Compared to CoveringLSH we achieve an asymptotic improvement to the hash function computation time from O(dL) to O(d + (LlogL), where d is the dimensionality of data and L is the number of hash tables. Our experiments on synthetic and real-world data sets demonstrate that fcLSH is comparable (and often superior) to traditional hashing-based approaches for search radius up to 20 in high-dimensional Hamming space.

【Keywords】: fast hadamard transform; lsh; near neighbor search; total recall

115. Query-Biased Partitioning for Selective Search.

【Paper Link】【Pages】:1119-1128

【Authors】: Zhuyun Dai ; Chenyan Xiong ; Jamie Callan

【Abstract】: Selective search is a cluster-based distributed retrieval architecture that reduces computational costs by partitioning a corpus into topical shards, and selectively searching them. Prior research formed topical shards by clustering the corpus based on the documents' contents. This content-based partitioning strategy reveals common topics in a corpus. However, the topic distribution produced by clustering may not match the distribution of topics in search traffic, which may reduce the effectiveness of selective search. This paper presents a query-biased partitioning strategy that aligns document partitions with topics from query logs. It focuses on two parts of the partitioning process: clustering initialization and document similarity calculation. A query-driven clustering initialization algorithm uses topics from query logs to form cluster seeds. A query-biased similarity metric favors terms that are important in query logs. Both methods boost retrieval effectiveness, reduce variance, and produce a more balanced distribution of shard sizes.

【Keywords】: distributed retrieval; selective search; shard partitioning

Session 5f: Industry Session V 2

116. Characterizing Diseases from Unstructured Text: A Vocabulary Driven Word2vec Approach.

【Paper Link】【Pages】:1129-1138

【Authors】: Saurav Ghosh ; Prithwish Chakraborty ; Emily Cohn ; John S. Brownstein ; Naren Ramakrishnan

【Abstract】: Traditional disease surveillance can be augmented with a wide variety of real-time sources such as, news and social media. However, these sources are in general unstructured and, construction of surveillance tools such as taxonomical correlations and trace mapping involves considerable human supervision. In this paper, we motivate a disease vocabulary driven word2vec model (Dis2Vec) to model diseases and constituent attributes as word embeddings from the HealthMap news corpus. We use these word embeddings to automatically create disease taxonomies and evaluate our model against corresponding human annotated taxonomies. We compare our model accuracies against several state-of-the art word2vec methods. Our results demonstrate that Dis2Vec outperforms traditional distributed vector representations in its ability to faithfully capture taxonomical attributes across different class of diseases such as endemic, emerging and rare.

【Keywords】: application-specific word embeddings; disease characterization; emerging diseases; healthmap; rare diseases

117. Network-Efficient Distributed Word2vec Training System for Large Vocabularies.

【Paper Link】【Pages】:1139-1148

【Authors】: Erik Ordentlich ; Lee Yang ; Andy Feng ; Peter Cnudde ; Mihajlo Grbovic ; Nemanja Djuric ; Vladan Radosavljevic ; Gavin Owens

【Abstract】: Word2vec is a popular family of algorithms for unsupervised training of dense vector representations of words on large text corpuses. The resulting vectors have been shown to capture semantic relationships among their corresponding words, and have shown promise in reducing a number of natural language processing (NLP) tasks to mathematical operations on these vectors. While heretofore applications of word2vec have centered around vocabularies with a few million words, wherein the vocabulary is the set of words for which vectors are simultaneously trained, novel applications are emerging in areas outside of NLP with vocabularies comprising several 100 million words. Existing word2vec training systems are impractical for training such large vocabularies as they either require that the vectors of all vocabulary words be stored in the memory of a single server or suffer unacceptable training latency due to massive network data transfer. In this paper, we present a novel distributed, parallel training system that enables unprecedented practical training of vectors for vocabularies with several 100 million words on a shared cluster of commodity servers, using far less network traffic than the existing solutions. We evaluate the proposed system on a benchmark data set, showing that the quality of vectors does not degrade relative to non-distributed training. Finally, for several quarters, the system has been deployed for the purpose of matching queries to ads in Gemini, the sponsored search advertising platform at Yahoo, resulting in significant improvement of business metrics.

【Keywords】: distributed systems; distributed training; hadoop; parameter server; spark; word embeddings; word2vec

Keynote Address 3 1

118. A Personal Perspective and Retrospective on Web Search Technology.

【Paper Link】【Pages】:1149

【Authors】: Andrei Z. Broder

【Abstract】: This talk is a review of some Web research and predictions that I co-authored over the last two decades: both what turned out gratifyingly right and what turned out embarrassingly wrong. Topics will include near-duplicates, the Web graph, query intent, inverted indices efficiency, and others. While this seems a completely idiosyncratic collection there are in fact concealed connections that offer good clues to the big question: what will happen next?

【Keywords】: keynote

Session 6a: Learning Algorithms 4

119. Scalable Spectral k-Support Norm Regularization for Robust Low Rank Subspace Learning.

【Paper Link】【Pages】:1151-1160

【Authors】: Yiu-ming Cheung ; Jian Lou

【Abstract】: As a fundamental tool in the fields of data mining and computer vision, robust low rank subspace learning is to recover a low rank matrix under gross corruptions that are often modeled by another sparse matrix. Within this learning, we investigate the spectral k-support norm, a more appealing convex relaxation than the popular nuclear norm, as a low rank penalty in this paper. Despite the better recovering performance, the spectral k-support norm entails the model difficult to be optimized efficiently, which severely limits its scalability from the practical perspective. Therefore, this paper proposes a scalable and efficient algorithm which considers the dual objective of the original problem that can take advantage of the more computational efficient linear oracle of the spectral k-support norm to be evaluated. Further, by studying the sub-gradient of the loss of the dual objective, a line-search strategy is adopted in the algorithm to enable it to adapt to the Holder smoothness. Experiments on various tasks demonstrate the superior prediction performance and computation efficiency of the proposed algorithm.

【Keywords】: conditional gradient; robust low rank subspace learning; spectral k-support norm

120. Online Adaptive Passive-Aggressive Methods for Non-Negative Matrix Factorization and Its Applications.

【Paper Link】【Pages】:1161-1170

【Authors】: Chenghao Liu ; Steven C. H. Hoi ; Peilin Zhao ; Jianling Sun ; Ee-Peng Lim

【Abstract】: This paper aims to investigate efficient and scalable machine learning algorithms for resolving Non-negative Matrix Factorization (NMF), which is important for many real-world applications, particularly for collaborative filtering and recommender systems. Unlike traditional batch learning methods, a recently proposed online learning technique named "NN-PA" tackles NMF by applying the popular Passive-Aggressive (PA) online learning, and found promising results. Despite its simplicity and high efficiency, NN-PA falls short in at least two critical limitations: (i) it only exploits the first-order information and thus may converge slowly especially at the beginning of online learning tasks; (ii) it is sensitive to some key parameters which are often difficult to be tuned manually, particularly in a practical online learning system. In this work, we present a novel family of online Adaptive Passive-Aggressive (APA) learning algorithms for NMF, named "NN-APA", which overcomes two critical limitations of NN-PA by (i) exploiting second-order information to enhance PA in making more informative updates at each iteration; and (ii) achieving the parameter auto-selection by exploring the idea of online learning with expert advice in deciding the optimal combination of the key parameters in NMF. We theoretically analyze the regret bounds of the proposed method and show its advantage over the state-of-the-art NN-PA method, and further validate the efficacy and scalability of the proposed technique through an extensive set of experiments on a variety of large-scale real recommender systems datasets.

【Keywords】: adaptive regularization; learning with expert advice; non-negative matrix factorization; online learning

121. aptMTVL: Nailing Interactions in Multi-Task Multi-View Multi-Label Learning using Adaptive-basis Multilinear Factor Analyzers.

【Paper Link】【Pages】:1171-1180

【Authors】: Xiaoli Li ; Jun Huan

【Abstract】: We investigate a new direction of multi-task multi-view learning where we have data sets with multiple tasks, multiple views and multiple labels. We call this problem a multi-task multi-view multi-label learning problem or MTVL learning for short. There is a wide application of MTVL leaning where examples include Internet of Things, brain science, and document classification. In designing effective MTVL learning algorithms, we hypothesize that a key component is to "disentangle" interactions among tasks, views, and labels, or the Ütask-view-label interactions. For that purpose we have developed an adaptive-basis multilinear analyzers(aptMLFA) that utilizes a loading tensor to modulate interactions among multiple latent factors. With aptMLFA we designed a new MTVL learning algorithm, aptMTVL, and evaluated its performance on 3 real-world data sets. The experimental results demonstrated the effectiveness of our proposed method as compared to the state-of-the-art MTVL learning algorithm.

【Keywords】: multi-task multi-view learning; tensor analyzers

122. An Adaptive Framework for Multistream Classification.

【Paper Link】【Pages】:1181-1190

【Authors】: Swarup Chandra ; Ahsanul Haque ; Latifur Khan ; Charu C. Aggarwal

【Abstract】: A typical data stream classification involves predicting label of data instances generated from a non-stationary process. Studies in the past decade have focused on this problem setting to address various challenges such as concept drift and concept evolution. Most techniques assume availability of class labels associated with unlabeled data instances, soon after label prediction, for further training and drift detection. Moreover, training and test data distributions are assumed to be similar. These assumptions are not always true in practice. For instance, a semi-supervised setting that aims to utilize only a fraction of labels may induce bias during data selection. Consequently, the resulting data distribution of training and test instances may differ. In this paper, we present a novel stream classification problem setting involving two independent non-stationary data generating processes, relaxing the above assumptions. A source stream continuously generates labeled data instances whose distribution is biased compared to that of a target stream which generates unlabeled data instances from the same domain. The problem, we call Multistream Classification, is to predict the class labels of data instances in the target stream, while utilizing labels available on the source stream. Since concept drift can occur asynchronously on these two streams, we design an adaptive framework that uses a technique for supervised concept drift detection in the biased source stream, and unsupervised concept drift detection in the target stream. A weighted ensemble of classifiers is updated after each drift detection on either streams, while utilizing a bias correction mechanism that leverage source information to predict labels of target instances whenever necessary. We empirically evaluate the multistream classifier's performance on both real-world and synthetic datasets, while comparing with various baseline methods and its variants.

【Keywords】: classification; concept drift; covariate shift; data stream

Session 6b: Databases and Data Processing 4

123. Optimizing Update Frequencies for Decaying Information.

【Paper Link】【Pages】:1191-1200

【Authors】: Simon Razniewski

【Abstract】: Many kinds of information, e.g., addresses, crawls of webpages, or academic affiliations, are prone to becoming outdated over time. Therefore, if data quality shall be maintained over time, often periodical refreshing is done. As refreshing data usually has a cost, for instance computation time, network bandwidth or human work time, a problem is to find the right update frequency depending on the benefit gained from the information and on the speed with which the information is expected to get outdated. This is especially important since often entities exhibit a different speed of getting outdated, e.g., addresses of students change more frequently than addresses of retirees, or news portals change more frequently than homepages. Consequently, there is no uniform best update frequency for all entities. Previous work on data freshness has investigated how to best distribute a fixed number of updates among entities, in order to maximize average freshness. For businesses that are able to adapt their resources, another question is to determine the number of updates that optimizes the income derived from the data. In this paper we present a model for describing the relationship between update frequency and income derived from data, present solutions for calculating the optimal update frequency for two common classes of functions for describing decay behaviour, and validate the benefits of our framework.

【Keywords】: data currency; data quality; information decay

【Paper Link】【Pages】:1201-1210

【Authors】: Paris Carbone ; Jonas Traub ; Asterios Katsifodimos ; Seif Haridi ; Volker Markl

【Abstract】: Aggregation queries on data streams are evaluated over evolving and often overlapping logical views called windows. While the aggregation of periodic windows were extensively studied in the past through the use of aggregate sharing techniques such as Panes and Pairs, little to no work has been put in optimizing the aggregation of very common, non-periodic windows. Typical examples of non-periodic windows are punctuations and sessions which can implement complex business logic and are often expressed as user-defined operators on platforms such as Google Dataflow or Apache Storm. The aggregation of such non-periodic or user-defined windows either falls back to expensive, best-effort aggregate sharing methods, or is not optimized at all. In this paper we present a technique to perform efficient aggregate sharing for data stream windows, which are declared as user-defined functions (UDFs) and can contain arbitrary business logic. To this end, we first introduce the concept of User-Defined Windows (UDWs), a simple, UDF-based programming abstraction that allows users to programmatically define custom windows. We then define semantics for UDWs, based on which we design Cutty, a low-cost aggregate sharing technique. Cutty improves and outperforms the state of the art for aggregate sharing on single and multiple queries. Moreover, it enables aggregate sharing for a broad class of non-periodic UDWs. We implemented our techniques on Apache Flink, an open source stream processing system, and performed experiments demonstrating orders of magnitude of reduction in aggregation costs compared to the state of the art.

【Keywords】: data stream aggregation; data stream optimisation; data stream processing; data stream windows; data streams; data structures; databases; functional programming; operator sharing; programming models; user-defined functions

125. Relational Database Schema Design for Uncertain Data.

【Paper Link】【Pages】:1211-1220

【Authors】: Sebastian Link ; Henri Prade

【Abstract】: We investigate the impact of uncertainty on relational data-base schema design. Uncertainty is modeled qualitatively by assigning to tuples a degree of possibility with which they occur, and assigning to functional dependencies a degree of certainty which says to which tuples they apply. A design theory is developed for possibilistic functional dependencies, including efficient axiomatic and algorithmic characterizations of their implication problem. Naturally, the possibility degrees of tuples result in a scale of different degrees of data redundancy. Scaled versions of the classical syntactic Boyce-Codd and Third Normal Forms are established and semantically justified in terms of avoiding data redundancy of different degrees. Classical decomposition and synthesis techniques are scaled as well. Therefore, possibilistic functional dependencies do not just enable designers to control the levels of data integrity and losslessness targeted but also to balance the classical trade-off between query and update efficiency. Extensive experiments confirm the efficiency of our framework and provide original insight into relational schema design.

【Keywords】: axioms; boyce-codd normal form; data redundancy; implication problem; possibility theory; third normal form

【Paper Link】【Pages】:1221-1230

【Authors】: Shengyu Huang ; K. Selçuk Candan ; Maria Luisa Sapino

【Abstract】: With many applications relying on multi-dimensional datasets for decision making, tensors (or multi-dimensional arrays) are emerging as a popular data representation to support diverse types of data, such as sensor streams and social networks. Consequently, tensor decomposition forms the basis for many data analysis and knowledge discovery tasks, from clustering, trend detection, anomaly detection, to correlation analysis. In applications where data evolves over time and the tensor-based analysis results need to be continuously maintained, re-computation of the whole tensor decomposition with each update will cause high computational costs and incur large memory overheads. In this paper, we propose a two-phase block-incremental CP-based tensor decomposition technique, BICP, that efficiently and effectively maintains tensor decomposition results in the presence of dynamically evolving tensor data. In its first phase, instead of repeatedly conducting ALS on each sub-tensor, BICP only revises the decompositions of the tensors that contain updated data. Moreover, when updates are relatively small with respect to the block size, BICP relies on a incremental factor tracking to avoid re-decomposition the updated sub-tensor. In its second phase, BICP limits the block-centric refinement process to only those blocks that are critical given the update. Experiment results show that the proposed method significantly reduces the execution time while assuring high accuracy.

【Keywords】: cp decomposition; incremental analysis; tensor decomposition; tensors

Session 6c: Large Graph Processing 4

127. Topological Graph Sketching for Incremental and Scalable Analytics.

【Paper Link】【Pages】:1231-1240

【Authors】: Bortik Bandyopadhyay ; David Fuhry ; Aniket Chakrabarti ; Srinivasan Parthasarathy

【Abstract】: We propose a novel, scalable, and principled graph sketching technique based on minwise hashing of local neighborhood. For an n-node graph with e-edges (e >> n), we incrementally maintain in real-time a minwise neighbor sampled subgraph using k hash functions in O(n x k) memory, limit being user-configurable by the parameter k. Symmetrization and similarity based techniques can recover from these data structures a significant portion of the original graph. We present theoretical analysis of the minwise sampling strategy and also derive unbiased estimators for important graph properties such as triangle count and neighborhood overlap. We perform an extensive empirical evaluation of our graph sketch and it's derivatives on a wide variety of real-world graph data sets drawn from different application domains using important large network analysis algorithms: local and global clustering coefficient, PageRank, and local graph sparsification. With bounded memory, the quality of results using the sketch representation is competitive against baselines which use the full graph, and the computational performance is often better. Our framework is flexible and configurable to be leveraged by numerous other graph analytics algorithms, potentially reducing the information mining time on large streamed graphs for a variety of applications.

【Keywords】: graph sketching; minwise hashing; streaming graph

128. Querying Minimal Steiner Maximum-Connected Subgraphs in Large Graphs.

【Paper Link】【Pages】:1241-1250

【Authors】: Jiafeng Hu ; Xiaowei Wu ; Reynold Cheng ; Siqiang Luo ; Yixiang Fang

【Abstract】: Given a graph G and a set Q of query nodes, we examine the Steiner Maximum-Connected Subgraph (SMCS). The SMCS, or G's induced subgraph that contains Q with the largest connectivity, can be useful for customer prediction, product promotion, and team assembling. Despite its importance, the SMCS problem has only been recently studied. Existing solutions evaluate the maximum SMCS, whose number of nodes is the largest among all the SMCSs of Q. However, the maximum SMCS, which may contain a lot of nodes, can be difficult to interpret. In this paper, we investigate the minimal SMCS, which is the minimal subgraph of G with the maximum connectivity containing Q. The minimal SMCS contains much fewer nodes than its maximum counterpart, and is thus easier to be understood. However, the minimal SMCS can be costly to evaluate. We thus propose efficient Expand-Refine algorithms, as well as their approximate versions with accuracy guarantees. Extensive experiments on six large real graph datasets validate the effectiveness and efficiency of our approaches.

【Keywords】: community search; edge connectivity; graph query; steiner maximum-connected subgraph search

129. Efficient Estimation of Triangles in Very Large Graphs.

【Paper Link】【Pages】:1251-1260

【Authors】: Roohollah Etemadi ; Jianguo Lu ; Yung H. Tsin

【Abstract】: The number of triangles in a graph is an important metric for understanding the graph. It is also directly related to the clustering coefficient of a graph, which is one of the most important indicator for social networks. Counting the number of triangles is computationally expensive for very large graphs. Hence, estimation is necessary for large graphs, particularly for graphs that are hidden behind searchable interfaces where the graphs in their entirety are not available. For instance, user networks in Twitter and Facebook are not available for third parties to explore their properties directly. This paper proposes a new method to estimate the number of triangles based on random edge sampling. It improves the traditional random edge sampling by probing the edges that have a higher probability of forming triangles. The method outperforms the traditional method consistently, and can be better by orders of magnitude when the graph is very large. The result is demonstrated on 20 graphs, including the largest graphs we can find. More importantly, we proved the improvement ratio, and verified our result on all the datasets. The analytical results are achieved by simplifying the variances of the estimators based on the assumption that the graph is very large. We believe that such big data assumption can lead to interesting results not only in triangle estimation, but also in other sampling problems.

【Keywords】: clustering coefficient; estimation; graph algorithms; graph sampling; triangles

130. Efficient Batch Processing for Multiple Keyword Queries on Graph Data.

【Paper Link】【Pages】:1261-1270

【Authors】: Lu Chen ; Chengfei Liu ; Xiaochun Yang ; Bin Wang ; Jianxin Li ; Rui Zhou

【Abstract】: Recently, answering keyword queries on graph data has drawn a great deal of attention from database communities. However, most graph keyword search solutions proposed so far primarily focus on a single query setting. We observe that for a popular keyword query system, the number of keyword queries received could be substantially large even in a short time interval, and the chance that these queries share common keywords is quite high. Therefore, answering keyword queries in batches would significantly enhance the performance of the system. Motivated by this, this paper studies efficient batch processing for multiple keyword queries on graph data. Realized that finding both the optimal query plan for multiple queries and the optimal query plan for a single keyword query on graph data are computationally hard, we first propose two heuristic approaches which target maximizing keyword overlap and give preferences for processing keywords with short sizes. Then we devise a cardinality based cost estimation model that takes both graph data statistics and search semantics into account. Based on the model, we design an A* based algorithm to find the global optimal execution plan for multiple queries. We evaluate the proposed model and algorithms on two real datasets and the experimental results demonstrate their efficacy.

【Keywords】: batch processing; graph; keyword query

Session 6d: Information Retrieval II 4

131. Supervised Robust Discrete Multimodal Hashing for Cross-Media Retrieval.

【Paper Link】【Pages】:1271-1280

【Authors】: Ting-Kun Yan ; Xin-Shun Xu ; Shanqing Guo ; Zi Huang ; Xiaolin Wang

【Abstract】: Recently, multimodal hashing techniques have received considerable attention due to their low storage cost and fast query speed for multimodal data retrieval. Many methods have been proposed; however, there are still some problems that need to be further considered. For example, some of these methods just use a similarity matrix for learning hash functions which will discard some useful information contained in original data; some of them relax binary constraints or separate the process of learning hash functions and binary codes into two independent stages to bypass the obstacle of handling the discrete constraints on binary codes for optimization, which may generate large quantization error; some of them are not robust to noise. All these problems may degrade the performance of a model. To consider these problems, in this paper, we propose a novel supervised hashing framework for cross-modal retrieval, i.e., Supervised Robust Discrete Multimodal Hashing (SRDMH). Specifically, SRDMH tries to make final binary codes preserve label information as same as that in original data so that it can leverage more label information to supervise the binary codes learning. In addition, it learns hashing functions and binary codes directly instead of relaxing the binary constraints so as to avoid large quantization error problem. Moreover, to make it robust and easy to solve, we further integrate a flexible l2,p loss with nonlinear kernel embedding and an intermediate presentation of each instance. Finally, an alternating algorithm is proposed to solve the optimization problem in SRDMH. Extensive experiments are conducted on three benchmark data sets. The results demonstrate that the proposed method (SRDMH) outperforms or is comparable to several state-of-the-art methods for cross-modal retrieval task.

【Keywords】: approximate nearest neighbor search; cross-media retrieval; discrete hashing; learning to hash; multimodal hashing

132. Word Vector Compositionality based Relevance Feedback using Kernel Density Estimation.

【Paper Link】【Pages】:1281-1290

【Authors】: Dwaipayan Roy ; Debasis Ganguly ; Mandar Mitra ; Gareth J. F. Jones

【Abstract】: A limitation of standard information retrieval (IR) models is that the notion of term composionality is restricted to pre-defined phrases and term proximity. Standard text based IR models provide no easy way of representing semantic relations between terms that are not necessarily phrases, such as the equivalence relationship between osteoporosis' and the termsbone' and decay'. To alleviate this limitation, we introduce a relevance feedback (RF) method which makes use of word embedded vectors. We leverage the fact that the vector addition of word embeddings leads to a semantic composition of the corresponding terms, e.g. addition of the vectors forbone' and decay' yields a vector that is likely to be close to the vector for the wordosteoporosis'. Our proposed RF model enables incorporation of semantic relations by exploiting term compositionality with embedded word vectors. We develop our model for RF as a generalization of the relevance model (RLM). Our experiments demonstrate that our word embedding based RF model significantly outperforms the RLM model on standard TREC test collections, namely the TREC 6,7,8 and Robust ad-hoc and the TREC 9 and 10 WT10G test collections.

【Keywords】: kernel density estimation; relevance feedback; word compositionality; word vector embedding

133. Q+Tree: An Efficient Quad Tree based Data Indexing for Parallelizing Dynamic and Reverse Skylines.

【Paper Link】【Pages】:1291-1300

【Authors】: Md. Saiful Islam ; Chengfei Liu ; J. Wenny Rahayu ; Tarique Anwar

【Abstract】: Skyline queries play an important role in multi-criteria decision making applications of many areas. Given a dataset of objects, a skyline query retrieves data objects that are not dominated by any other data object in the dataset. Unlike standard skyline queries where the different aspects of data objects are compared directly, dynamic and reverse skyline queries adhere to the around-by semantics, which is realized by comparing the relative distances of the data objects w.r.t. a given query. Though, there are a number of works on parallelizing the standard skyline queries, only a few works are devoted to the parallel computation of dynamic and reverse skyline queries. This paper presents an efficient quad-tree based data indexing scheme, called Q+Tree, for parallelizing the computations of the dynamic and reverse skyline queries. We compare the performance of Q+Tree with an existing quad-tree based indexing scheme. We also present several optimization heuristics to improve the performance of both of the indexing schemes further. Experimentation with both real and synthetic datasets verifies the efficiency of the proposed indexing scheme and optimization heuristics.

【Keywords】: aggressive partitioning; dynamic skyline; load balancing; parallel computation; quad tree; reverse skyline

134. Luhn Revisited: Significant Words Language Models.

【Paper Link】【Pages】:1301-1310

【Authors】: Mostafa Dehghani ; Hosein Azarbonyad ; Jaap Kamps ; Djoerd Hiemstra ; Maarten Marx

【Abstract】: Users tend to articulate their complex information needs in only a few keywords, making underspecified statements of request the main bottleneck for retrieval effectiveness. Taking advantage of feedback information is one of the best ways to enrich the query representation, but can also lead to loss of query focus and harm performance in particular when the initial query retrieves only little relevant information when overfitting to accidental features of the particular observed feedback documents. Inspired by the early work of Luhn [23], we propose significant words language models of feedback documents that capture all, and only, the significant shared terms from feedback documents. We adjust the weights of common terms that are already well explained by the document collection as well as the weight of rare terms that are only explained by specific feedback documents, which eventually results in having only the significant terms left in the feedback model. Our main contributions are the following. First, we present significant words language models as the effective models capturing the essential terms and their probabilities. Second, we apply the resulting models to the relevance feedback task, and see a better performance over the state-of-the-art methods. Third, we see that the estimation method is remarkably robust making the models in- sensitive to noisy non-relevant terms in feedback documents. Our general observation is that the significant words language models more accurately capture relevance by excluding general terms and feedback document specific terms.

【Keywords】: pseudo relevance feedback; relevance feed- back; significant words language models

Session 6e: Entity Detection and Analysis 4

135. ESPRESSO: Explaining Relationships between Entity Sets.

【Paper Link】【Pages】:1311-1320

【Authors】: Stephan Seufert ; Klaus Berberich ; Srikanta J. Bedathur ; Sarath Kumar Kondreddi ; Patrick Ernst ; Gerhard Weikum

【Abstract】: Analyzing and explaining relationships between entities in a knowledge graph is a fundamental problem with many applications. Prior work has been limited to extracting the most informative subgraph connecting two entities of interest. This paper extends and generalizes the state of the art by considering the relationships between two sets of entities given at query time. Our method, coined ESPRESSO, explains the connection between these sets in terms of a small number of relatedness cores: dense sub-graphs that have strong relations with both query sets. The intuition for this model is that the cores correspond to key events in which entities from both sets play a major role. For example, to explain the relationships between US politicians and European politicians, our method identifies events like the PRISM scandal and the Syrian Civil War as relatedness cores. Computing cores of bounded size is NP-hard. This paper presents efficient approximation algorithms. Our experiments with real-life knowledge graphs demonstrate the practical viability of our approach and, through user studies, the superior output quality compared to state-of-the-art baselines.

【Keywords】: knowledge graphs; relationship analysis; social networks

136. Geotagging Named Entities in News and Online Documents.

【Paper Link】【Pages】:1321-1330

【Authors】: Jiangwei Yu Rafiei ; Davood Rafiei

【Abstract】: News sources generate constant streams of text with many references to real world entities; understanding the content from such sources often requires effectively detecting the geographic foci of the entities. We study the problem of associating geography to named entities in online documents. More specifically, given a named entity and a page (or a set of pages) where the entity is mentioned, the problem being studied is how the geographic focus of the name can be resolved at a location granularity (e.g. city or country), assuming that the name has a geographic focus. We further study dispersion, and show that the dispersion of a name can be estimated with a good accuracy, allowing a geo-centre to be detected at an exact dispersion level. Two key features of our approach are: (i) minimal assumption is made on the structure of the mentions hence the approach can be applied to a diverse and heterogeneous set of web pages, and (ii) the approach is unsupervised, leveraging shallow English linguistic features and the large volume of location data in public domain. We evaluate our methods under different task settings and with different categories of named entities. Our evaluation reveals that the geo-centre of a name can be estimated with a good accuracy based on some simple statistics of the mentions, and that the accuracy of the estimation varies with the categories of the names.

【Keywords】: content analysis; disambiguation; geographic tagging; geotagging; location tagging; named entities; web mining

137. Discovering Entities with Just a Little Help from You.

【Paper Link】【Pages】:1331-1340

【Authors】: Jaspreet Singh ; Johannes Hoffart ; Avishek Anand

【Abstract】: Linking entities like people, organizations, books, music groups and their songs in text to knowledge bases (KBs) is a fundamental task for many downstream search and mining applications. Achieving high disambiguation accuracy crucially depends on a rich and holistic representation of the entities in the KB. For popular entities, such a representation can be easily mined from Wikipedia, and many current entity disambiguation and linking methods make use of this fact. However, Wikipedia does not contain long-tail entities that only few people are interested in, and also at times lags behind until newly emerging entities are added. For such entities, mining a suitable representation in a fully automated fashion is very difficult, resulting in poor linking accuracy. What can automatically be mined, though, is a high-quality representation given the context of a new entity occurring in any text. Due to the lack of knowledge about the entity, no method can retrieve these occurrences automatically with high precision, resulting in a chicken-egg problem. To address this, our approach automatically generates candidate occurrences of entities, prompting the user for feedback to decide if the occurrence refers to the actual entity in question. This feedback gradually improves the knowledge and allows our methods to provide better candidate suggestions to keep the user engaged. We propose novel human-in-the-loop retrieval methods for generating candidates based on gradient interleaving of diversification and textual relevance approaches. We conducted extensive experiments on the FACC dataset, showing that our approaches convincingly outperform carefully selected baselines in both intrinsic and extrinsic measures while keeping users engaged.

【Keywords】: diversity; human-in-the-loop; knowledge base acceleration; named entity disambiguation; relevance feedback; retrieval model; user simulation

138. Bayesian Non-Exhaustive Classification A Case Study: Online Name Disambiguation using Temporal Record Streams.

【Paper Link】【Pages】:1341-1350

【Authors】: Baichuan Zhang ; Murat Dundar ; Mohammad Al Hasan

【Abstract】: The name entity disambiguation task aims to partition the records of multiple real-life persons so that each partition contains records pertaining to a unique person. Most of the existing solutions for this task operate in a batch mode, where all records to be disambiguated are initially available to the algorithm. However, more realistic settings require that the name disambiguation task be performed in an online fashion, in addition to, being able to identify records of new ambiguous entities having no preexisting records. In this work, we propose a Bayesian non-exhaustive classification framework for solving online name disambiguation task. Our proposed method uses a Dirichlet process prior with a Normal x Normal x Inverse Wishart data model which enables identification of new ambiguous entities who have no records in the training data. For online classification, we use one sweep Gibbs sampler which is very efficient and effective. As a case study we consider bibliographic data in a temporal stream format and disambiguate authors by partitioning their papers into homogeneous groups. Our experimental results demonstrate that the proposed method is better than existing methods for performing online name disambiguation task.

【Keywords】: bayesian non-exhaustive classification; emerging class; online name disambiguation; temporal record stream

Session 6f: Industry Session VI 3

139. Large-scale Robust Online Matching and Its Application in E-commerce.

【Paper Link】【Pages】:1351

【Authors】: Rong Jin

【Abstract】: This talk will be focused on large-scale matching problem that aims to find the optimal assignment of tasks to different agents under linear constraints. Large-scale matching has found numerous applications in e-commerce. An well known example is budget aware online advertisement. A common practice in online advertisement is to find, for each opportunity or user, the advertisements that fit best with his/her interests. The main shortcoming with this greedy approach is that it did not take into account the budget limits set by advertisers. Our studies, as well as others, have shown that by carefully taking into budget limits of individual advertisers, we could significantly improve the performance of the advertisement system. Despite of rich literature, two important issues are often overlooked in the previous studies of matching/assignment problem. The first issues arises from the fact that most quantities used by optimization are estimated based on historical data and therefore are likely to be inaccurate and unreliable. The second challenge is how to perform online matching as in many e-commerce problems, tasks are created in an online fashion and algorithm has to make assignment decision immediately when every task emerges. We refer to these two issues as challenges of "robust matching" and "online matching". To address the first challenge, I will introduce two different techniques for robust matching. The first approach is based on the theory of robust optimization that takes into account the uncertainties of estimated quantities when performing optimization. The second approach is based on the theory of two-sided matching whose result only depends on the partial preference of estimated quantities. To deal with the challenge of online matching, I will discuss two online optimization techniques, one based on theory of primal-dual online optimization and one based on minimizing dynamic regret under long term constraints. We verify the effectiveness of all these approaches by applying them to real-world projects developed in Alibaba.

【Keywords】: assignment problem; online optimization; robust optimization; two-sided matching

140. A Distributed Graph Algorithm for Discovering Unique Behavioral Groups from Large-Scale Telco Data.

【Paper Link】【Pages】:1353-1362

【Authors】: Qirong Ho ; Wenqing Lin ; Eran Shaham ; Shonali Krishnaswamy ; The Anh Dang ; Jingxuan Wang ; Isabel Choo Zhongyan ; Amy She-Nash

【Abstract】: It is critical for a large telecommunications company such as Singtel to truly understand the behavior and preference of its customers, in order to win their loyalty in a highly fragmented and competitive market. In this paper we propose a novel graph edge-clustering algorithm (DGEC) that can discover unique behavioral groups, from rich usage data sets (such as CDRs and beyond). A behavioral group is a set of nodes that share similar edge properties reflecting customer behavior, but are not necessarily connected to each other and therefore different from the usual notion of graph communities. DGEC is an optimization-based model that uses the stochastic proximal gradient method, implemented as a distributed algorithm that scales to tens of millions of nodes and edges. The performance of DGEC is satisfactory for deployment, with an execution time of 2.4 hours over a graph of 5 million nodes and 27 million edges in a 8-machine environment (32 cores and 64GB memory per machine). We evaluate the behavioral groups discovered by DGEC by combining other information such as demographics and customer profiles, and demonstrate that these behavioral groups are objective, consistent and insightful. DGEC has now been deployed in production, and also shows promising potential to extract new usage behavioral features from other data sources such as web browsing, app usage and TV consumption.

【Keywords】: behavioral groups; cdr; edge-clustering; graph; large scale distributed implementation; telecommunications

141. Urban Traffic Prediction through the Second Use of Inexpensive Big Data from Buildings.

【Paper Link】【Pages】:1363-1372

【Authors】: Zimu Zheng ; Dan Wang ; Jian Pei ; Yi Yuan ; Cheng Fan ; Linda Fu Xiao

【Abstract】: Traffic prediction, particularly in urban regions, is an important application of tremendous practical value. In this paper, we report a novel and interesting case study of urban traffic prediction in Central, Hong Kong, one of the densest urban areas in the world. The novelty of our study is that we make good second use of inexpensive big data collected from the Hong Kong International Commerce Centre (ICC), a 118-story building in Hong Kong where more than 10,000 people work. As building environment data are much cheaper to obtain than traffic data, we demonstrate that it is highly effective to estimate building occupancy information using building environment data, and then to further use the information on occupancy to provide traffic predictions in the proximate area. Scientifically, we investigate how and to what extent building data can complement traffic data in predicting traffic. In general, this study sheds new light on the development of accurate data mining applications through the second use of inexpensive big data.

【Keywords】: building occupancy; traffic prediction

Session 7a: Advertising and Ranking 4

142. A Probabilistic Multi-Touch Attribution Model for Online Advertising.

【Paper Link】【Pages】:1373-1382

【Authors】: Wendi Ji ; Xiaoling Wang ; Dell Zhang

【Abstract】: It is an important problem in computational advertising to study the effects of different advertising channels upon user conversions, as advertisers can use the discoveries to plan or optimize advertising campaigns. In this paper, we propose a novel Probabilistic Multi-Touch Attribution (PMTA) model which takes into account not only which ads have been viewed or clicked by the user but also when each such interaction occurred. Borrowing the techniques from survival analysis, we use the Weibull distribution to describe the observed conversion delay and use the hazard rate of conversion to measure the influence of an ad exposure. It has been shown by extensive experiments on a large real-world dataset that our proposed model is superior to state-of-the-art methods in both conversion prediction and attribution analysis. Furthermore, a surprising research finding obtained from this dataset is that search ads are often not the root cause of final conversions but just the consequence of previously viewed ads.

【Keywords】: computational advertising; multi-touch attribution; survival analysis

【Paper Link】【Pages】:1383-1392

【Authors】: Shaojie Tang ; Jing Yuan

【Abstract】: Social advertising (or social promotion) is an effective approach that produces a significant cascade of adoption through influence in the online social networks. The goal of this work is to optimize the ad allocation from the platform's perspective. On the one hand, the platform would like to maximize revenue earned from each advertiser by exposing their ads to as many people as possible, on the other hand, the platform wants to reduce free-riding to ensure the truthfulness of the advertiser. To this end, we introduce a utility function that can access the above tradeoff. Based on this utility function, we define and study two social advertising problems: budgeted social advertising problem and unconstrained social advertising problem. In the first problem, we aim at selecting a set of seeds for each advertiser that maximizes the utility while setting budget constraints on the attention cost; in the second problem, we propose to optimize a linear combination of the utility and attention costs. We prove that both problems are NP-hard, and then develop constant factor approximation algorithms for both problems.

【Keywords】: approximation algorithm; influence maximization; matroid; social advertising; submodular

【Paper Link】【Pages】:1393-1402

【Authors】: Dimitrios Rafailidis ; Fabio Crestani

【Abstract】: With the advent of learning to rank methods, relevant studies showed that Collaborative Ranking (CR) models can produce accurate ranked lists in the top-N recommendation problem. However, in practice several real-world problems decrease their ranking performance, such as the sparsity and cold-start problems, which often occur in recommendation systems for inactive or new users. In this study, to account for the fact that the selections of social friends can improve the recommendation accuracy, we propose a joint CR model based on the users' social relationships. We propose two different CR strategies based on the notions of Social Reverse Height and Social Height, which consider how well the relevant and irrelevant items of users and their social friends have been ranked at the top of the list, respectively. We focus on the top of the list mainly because users see the top-N recommendations in real-world applications, and not the whole ranked list. Furthermore, we formulate a joint objective function to consider both CR strategies, and propose an alternating minimization algorithm to learn our joint CR model. Our experiments on benchmark datasets show that our proposed joint CR model outperforms other state-of-the-art models that either consider social relationships or focus on the ranking performance at the top of the list.

【Keywords】: collaborative ranking; learning to rank; recommendation systems; social relationship

145. Modeling Customer Engagement from Partial Observations.

【Paper Link】【Pages】:1403-1412

【Authors】: Jelena Stojanovic ; Djordje Gligorijevic ; Zoran Obradovic

【Abstract】: It is of high interest for a company to identify customers expected to bring the largest profit in the upcoming period. Knowing as much as possible about each customer is crucial for such predictions. However, their demographic data, preferences, and other information that might be useful for building loyalty programs is often missing. Additionally, modeling relations among different customers as a network can be beneficial for predictions at an individual level, as similar customers tend to have similar purchasing patterns. We address this problem by proposing a robust framework for structured regression on deficient data in evolving networks with a supervised representation learning based on neural features embedding. The new method is compared to several unstructured and structured alternatives for predicting customer behavior (e.g. purchasing frequency and customer ticket) on user networks generated from customer databases of two companies from different industries. The obtained results show 4% to 130% improvement in accuracy over alternatives when all customer information is known. Additionally, the robustness of our method is demonstrated when up to 80% of demographic information was missing where it was up to several folds more accurate as compared to alternatives that are either ignoring cases with missing values or learn their feature representation in an unsupervised manner.

【Keywords】: deficient data; feature learning; loyalty programs; structured learning; user networks

Session 7b: Query Analytics 4

146. On the Effectiveness of Query Weighting for Adapting Rank Learners to New Unlabelled Collections.

【Paper Link】【Pages】:1413-1422

【Authors】: Pengfei Li ; Mark Sanderson ; Mark James Carman ; Falk Scholer

【Abstract】: Query-level instance weighting is a technique for unsupervised transfer ranking, which aims to train a ranker on a source collection so that it also performs effectively on a target collection, even if no judgement information exists for the latter. Past work has shown that this approach can be used to significantly improve effectiveness; in this work, the approach is re-examined on a wide set of publicly available L2R test collections with more advanced learning to rank algorithms. Different query-level weighting strategies are examined against two transfer ranking frameworks: AdaRank and a new weighted LambdaMART algorithm. Our experimental results show that the effectiveness of different weighting strategies, including those shown in past work, vary under different transferring environments. In particular, (i) Kullback-Leibler based density-ratio estimation tends to outperform a classification-based approach and (ii) aggregating document-level weights into query-level weights is likely superior to direct estimation using a query-level representation. The Nemenyi statistical test, applied across multiple datasets, indicates that most weighting transfer learning methods do not significantly outperform baselines, although there is potential for the further development of such techniques.

【Keywords】: information retrieval; learning to rank; ranking adaptation

147. One Query, Many Clicks: Analysis of Queries with Multiple Clicks by the Same User.

【Paper Link】【Pages】:1423-1432

【Authors】: Elad Kravi ; Ido Guy ; Avihai Mejer ; David Carmel ; Yoelle Maarek ; Dan Pelleg ; Gilad Tsur

【Abstract】: In this paper, we study multi-click queries - queries for which more than one click is performed by the same user within the same query session. Such queries may reflect a more complex information need, which leads the user to examine a variety of results. We present a comprehensive analysis that reveals unique characteristics of multi-click queries, in terms of their syntax, lexical domains, contextual properties, and returned search results page. We also show that a basic classifier for predicting multi-click queries can reach an accuracy of 75% over a balanced dataset. We discuss the implications of our findings for the design of Web search tools.

【Keywords】: exploratory search; multiple click queries; query log analysis; query session

【Paper Link】【Pages】:1433-1442

【Authors】: Weize Kong ; James Allan

【Abstract】: Faceted search has been used successfully for many vertical applications such as e-commerce and digital libraries. However, it remains challenging to extend faceted search to the open-domain web due to the large and heterogeneous nature of the web. Recent work proposed an alternative solution that extracts facets for queries from their web search results, but neglected the precision-oriented perspective of the task -- users are likely to care more about precision of presented facets than recall. We improve query facet extraction performance under a precision-oriented scenario from two perspectives. First, we propose an empirical utility maximization approach to learn a probabilistic model by maximizing the expected performance measure instead of likelihood as used in previous approaches. We show that the empirical utility maximization approach can significantly improve over the previous approach under the precision-oriented scenario. Second, instead of showing facets for all queries, we propose a selective method that predicts the extraction performance for each query and selectively shows facets for some of them. We show the selective method can significantly improve the average performance with fair coverage over the whole query set.

【Keywords】: empirical utility maximization; performance prediction; query facet extraction; selective query faceting

149. Learning to Rewrite Queries.

【Paper Link】【Pages】:1443-1452

【Authors】: Yunlong He ; Jiliang Tang ; Hua Ouyang ; Changsung Kang ; Dawei Yin ; Yi Chang

【Abstract】: It is widely known that there exists a semantic gap between web documents and user queries and bridging this gap is crucial to advance information retrieval systems. The task of query rewriting, aiming to alter a given query to a rewrite query that can close the gap and improve information retrieval performance, has attracted increasing attention in recent years. However, the majority of existing query rewriters are not designed to boost search performance and consequently their rewrite queries could be sub-optimal. In this paper, we propose a learning to rewrite framework that consists of a candidate generating phase and a candidate ranking phase. The candidate generating phase provides us the flexibility to reuse most of existing query rewriters; while the candidate ranking phase allows us to explicitly optimize search relevance. Experimental results on a commercial search engine demonstrate the effectiveness of the proposed framework. Further experiments are conducted to understand the important components of the proposed framework.

【Keywords】: learning to rewrite; query rewriting; relevance

Session 7c: Information Retrieval III 4

150. When is the Time Ripe for Natural Language Processing for Patent Passage Retrieval?

【Paper Link】【Pages】:1453-1462

【Authors】: Linda Andersson ; Mihai Lupu ; João R. M. Palotti ; Allan Hanbury ; Andreas Rauber

【Abstract】: Patent text is a mixture of legal terms and domain specific terms. In technical English text, a multi-word unit method is often deployed as a word formation strategy in order to expand the working vocabulary, i.e. introducing a new concept without the invention of an entirely new word. In this paper we explore query generation using natural language processing technologies in order to capture domain specific concepts represented as multi-word units. In this paper we examine a range of query generation methods using both linguistic and statistical information. We also propose a new method to identify domain specific terms from other more general phrases. We apply a machine learning approach using domain knowledge and corpus linguistic information in order to learn domain specific terms in relation to phrases' Termhood values. The experiments are conducted on the English part of the CLEF-IP 2013 test collection. The outcome of the experiments shows that the favoured method in terms of PRES and recall is when a language model is used and search terms are extracted with a part-of-speech tagger and a noun phrase chunker. With our proposed methods we improve each evaluation metric significantly compared to the existing state-of-the-art for the CLEP-IP 2013 test collection: for PRES@100 by 26% (0.544 from 0.433), for recall@100 by 17% (0.631 from 0.540) and on document MAP by 57% (0.300 from 0.191).

【Keywords】: information extraction; natural language processing; patent retrieval; text mining

151. A Probabilistic Fusion Framework.

【Paper Link】【Pages】:1463-1472

【Authors】: Yael Anava ; Anna Shtok ; Oren Kurland ; Ella Rabinovich

【Abstract】: There are numerous methods for fusing document lists retrieved from the same corpus in response to a query. Many of these methods are based on seemingly unrelated techniques and heuristics. Herein we present a probabilistic framework for the fusion task. The framework provides a formal basis for deriving and explaining many fusion approaches and the connections between them. Instantiating the framework using various estimates yields novel fusion methods, some of which significantly outperform state-of-the-art approaches.

【Keywords】: fusion

152. Selective Cluster-Based Document Retrieval.

【Paper Link】【Pages】:1473-1482

【Authors】: Or Levi ; Fiana Raiber ; Oren Kurland ; Ido Guy

【Abstract】: We address the long standing challenge of selective cluster-based retrieval; namely, deciding on a per-query basis whether to apply cluster-based document retrieval or standard document retrieval. To address this classification task, we propose a few sets of features based on those utilized by the cluster-based ranker, query-performance predictors, and properties of the clustering structure. Empirical evaluation shows that our method outperforms state-of-the-art retrieval approaches, including cluster-based, query expansion, and term proximity methods.

【Keywords】: ad hoc retrieval; cluster-based retrieval

153. Pseudo-Relevance Feedback Based on Matrix Factorization.

【Paper Link】【Pages】:1483-1492

【Authors】: Hamed Zamani ; Javid Dadashkarimi ; Azadeh Shakery ; W. Bruce Croft

【Abstract】: In information retrieval, pseudo-relevance feedback (PRF) refers to a strategy for updating the query model using the top retrieved documents. PRF has been proven to be highly effective in improving the retrieval performance. In this paper, we look at the PRF task as a recommendation problem: the goal is to recommend a number of terms for a given query along with weights, such that the final weights of terms in the updated query model better reflect the terms' contributions in the query. To do so, we propose RFMF, a PRF framework based on matrix factorization which is a state-of-the-art technique in collaborative recommender systems. Our purpose is to predict the weight of terms that have not appeared in the query and matrix factorization techniques are used to predict these weights. In RFMF, we first create a matrix whose elements are computed using a weight function that shows how much a term discriminates the query or the top retrieved documents from the collection. Then, we re-estimate the created matrix using a matrix factorization technique. Finally, the query model is updated using the re-estimated matrix. RFMF is a general framework that can be employed with any retrieval model. In this paper, we implement this framework for two widely used document retrieval frameworks: language modeling and the vector space model. Extensive experiments over several TREC collections demonstrate that the RFMF framework significantly outperforms competitive baselines. These results indicate the potential of using other recommendation techniques in this task.

【Keywords】: language model; matrix factorization; pseudo-relevance feedback; query expansion; term recommendation

Session 7d: Data Mining 4

154. Uncovering the Spatio-Temporal Dynamics of Memes in the Presence of Incomplete Information.

【Paper Link】【Pages】:1493-1502

【Authors】: Hancheng Ge ; James Caverlee ; Nan Zhang ; Anna Cinzia Squicciarini

【Abstract】: Modeling, understanding, and predicting the spatio-temporal dynamics of online memes are important tasks, with ramifications on location-based services, social media search, targeted advertising and content delivery networks. However, the raw data revealing these dynamics are often incomplete and error-prone; for example, API limitations and data sampling policies can lead to an incomplete (and often biased) perspective on these dynamics. Hence, in this paper, we investigate new methods for uncovering the full (underlying) distribution through a novel spatio-temporal dynamics recovery framework which models the latent relationships among locations, memes, and times. By integrating these hidden relationships into a tensor-based recovery framework -- called AirCP -- we find that high-quality models of meme spread can be built with access to only a fraction of the full data. Experimental results on both synthetic and real-world Twitter hashtag data demonstrate the promising performance of the proposed framework: an average improvement of over 27% in recovering the spatio-temporal dynamics of hashtags versus five state-of-the-art alternatives.

【Keywords】: spatio-temporal dynamics; tensor completion; twitter hashtags

155. From Recommendation to Profile Inference (Rec2PI): A Value-added Service to Wi-Fi Data Mining.

【Paper Link】【Pages】:1503-1512

【Authors】: Cheng Chen ; Fang Dong ; Kui Wu ; Venkatesh Srinivasan ; Alex Thomo

【Abstract】: Portable smart devices have become prevalent and are used for ubiquitous access to the Internet in our daily life. Taking advantage of this trend, brick-and-mortar retailers have been increasingly deploying free Wi-Fi hotspots to provide easy Internet access for their customers. This opens the opportunity for retailers to collect customer information and perform data mining to improve the quality of their service. In this paper, we propose a novel value-added service to Wi-Fi data mining, Rec2PI, which can infer users' preference profiles based on recommendations pushed by third-party apps. Such profiles can be used to improve users' online experience and enable a brick-and-mortar retailer to participate in the global advertising business. Since the goal and technical difficulties of Rec2PI significantly differ from those of traditional recommender systems, we present a general framework of Rec2PI to illustrate its process. To tackle the technical challenges in profile inference, we propose novel algorithms built using copulas, a statistical tool suitable for capturing complex dependence structure beyond the scope of linear dependence. In the context of rating-based recommendations, we evaluate the proposed algorithms using an open dataset and a real-world recommender system. The evaluation results show that Rec2PI creates consistent and accurate inference results.

【Keywords】: copula modelling; profile inference; reverse engineering of recommendations; wi-fi data mining

156. On Backup Battery Data in Base Stations of Mobile Networks: Measurement, Analysis, and Optimization.

【Paper Link】【Pages】:1513-1522

【Authors】: Xiaoyi Fan ; Feng Wang ; Jiangchuan Liu

【Abstract】: Base stations have been massively deployed nowadays to afford the explosive demand to infrastructure-based mobile networking services, including both cellular networks and commercial WiFi access points. To maintain high service availability, backup battery groups are usually installed on base stations and serve as the only power source during power outages, which can be prevalent in rural areas or during severe weather conditions such as hurricanes or snow storms. Therefore, being able to understand and predict the battery group working condition is of immense technical and commercial importance as the first step towards a cost-effective battery maintenance on minimizing service interruptions. In this paper, we conduct a systematical analysis on a real world dataset collected from the battery groups installed on the base stations of China Mobile, with totally 1,550,032,984 records from July 28th, 2014 to February 17th, 2016. We find that the working condition degradation of a battery group may be accelerated under various situations and can cause premature failures on batteries in the group, which can hardly be captured by nowadays maintenance procedure and easily lead to a power-outage-triggered service interruption to a base station. To this end, we propose BatPro, a battery profiling framework, to precisely extract the features that cause the working condition degradation of the battery group. We formulate the prediction models for both battery voltage and lifetime and develop a series of solutions to yield accurate outputs. By real world trace-driven evaluations, we demonstrate that our BatPro approach can precisely predict the battery voltage and lifetime with the RMS error less than 0.01 v.

【Keywords】: backup power system; battery aging profiling; multi-instance multi-label learning; remaining lifetime prediction

157. Automatic Generation and Validation of Road Maps from GPS Trajectory Data Sets.

【Paper Link】【Pages】:1523-1532

【Authors】: Hengfeng Li ; Lars Kulik ; Kotagiri Ramamohanarao

【Abstract】: With the popularity of mobile GPS devices such as on-board navigation systems and smart phones, users can contribute their GPS trajectory data for creating geo-volunteered road maps. However, the quality of these road maps cannot be guaranteed due to the lack of expertise among contributing users. Therefore, important challenges are (i) to automatically generate accurate roads from GPS traces and (ii) to validate the correctness of existing road maps. To address these challenges, we propose a novel Spatial-Linear Clustering (SLC) technique to infer road segments from GPS traces. In our algorithm, we propose the use of spatial-linear clusters to appropriately represent the linear nature of GPS points collected from the same road segment. Through inferring road segments our algorithm can detect missing roads and checking the correctness of existing road network. For our evaluation, we conduct extensive experiments that compare our method to the state-of-the-art methods on two real data sets. The experimental results show that the F1 score of our algorithm is on average 10.7% higher than the best state-of-the-art method.

【Keywords】: gps traces; map inference; spatial-linear clustering

Session 7e: Network Analytics 4

158. Fully Dynamic Shortest-Path Distance Query Acceleration on Massive Networks.

【Paper Link】【Pages】:1533-1542

【Authors】: Takanori Hayashi ; Takuya Akiba ; Ken-ichi Kawarabayashi

【Abstract】: The distance between vertices is one of the most fundamental measures for representing relations between them, and it is the basis of other classic measures of vertices, such as similarity, centrality, and influence. The 2-hop labeling methods are known as the fastest exact point-to-point distance algorithms on million-scale networks. However, they cannot handle billion-scale networks because of the large space requirement and long preprocessing time. In this paper, we present the first algorithm that can process exact distance queries on fully dynamic billion-scale networks besides trivial non-indexing algorithms, which combines an online bidirectional breadth-first search (BFS) and an offline indexing method for handling billion-scale networks in memory. First, we accelerate bidirectional BFSs by using heuristics that exploit the small-world property of complex networks. Then, we construct bit-parallel shortest-path trees to maintain sets of shortest paths passing through high-degree vertices of networks in compact form, the information of which enables us to avoid visiting vertices with high degrees during bidirectional BFSs. Thus, the searches achieve considerable speedup. In addition, our index size reduction technique enables us to handle billion-scale networks in memory. Furthermore, we introduce dynamic update procedures of our data structure to handle fully dynamic networks. We evaluated the performance of the proposed method on real-world networks. In particular, on large-scale social networks with over 1B edges, the proposed method enables us to answer distance queries in around 1 ms, on average.

【Keywords】: dynamic updates; graphs; query processing; shortest-path

159. Hierarchical and Dynamic k-Path Covers.

【Paper Link】【Pages】:1543-1552

【Authors】: Takuya Akiba ; Yosuke Yano ; Naoto Mizuno

【Abstract】: A metric-independent data structure for spatial networks called k-all-path cover (k-APC) has recently been proposed. It involves a set of vertices that covers all paths of size k, and is a general indexing technique that can accelerate various path-related processes on spatial networks, such as route planning and path subsampling to name a few. Although it is a promising tool, it currently has drawbacks pertaining to its construction and maintenance. First, k-APCs, especially for large values of k, are computationally too expensive. Second, an important factor related to quality is ignored by a prevalent construction algorithm. Third, an existing algorithm only focuses on static networks. To address these issues, we propose novel k-APC construction and maintenance algorithms. Our algorithms recursively construct the layers of APCs, which we call the k-all-path cover hierarchy, by using vertex cover heuristics. This allows us to extract k-APCs for various values of k from the hierarchy. We also devise an algorithm to maintain k-APC hierarchies on dynamic networks. Our experiments showed that our construction algorithm can yield high solution quality, and has a short running time for large values of k. They also verified that our dynamic algorithm can handle an edge weight change within 40 ms.

【Keywords】: graphs; indexing; road networks; route planning; spatial networks

160. Efficient Computation of Importance Based Communities in Web-Scale Networks Using a Single Machine.

【Paper Link】【Pages】:1553-1562

【Authors】: Shu Chen ; Ran Wei ; Diana Popova ; Alex Thomo

【Abstract】: Finding decompositions of a graph into a family of communities is crucial to understanding its underlying structure. Algorithms for finding communities in networks often rely only on structural information and search for cohesive subsets of nodes. In practice however, we would like to find communities that are not only cohesive, but also influential or important. In order to capture such communities, Li, Qin, Yu, and Mao introduced a novel community model called "k-influential community" based on the concept of $k$-core, with numerical values representing "influence" assigned to the nodes. They formulate the problem of finding the top-r most important communities as finding r connected k-core subgraphs ordered by the lower-bound of their importance. In this paper, our goal is to scale-up the computation of top-r, k-core communities to web-scale graphs of tens of billions of edges. We feature several fast new algorithms for this problem. With our implementations, we show that we can efficiently handle massive networks using a single consumer-level machine within a reasonable amount of time.

【Keywords】: algorithm scalability; big graphs; community discovery; graph compression

161. Collective Classification via Discriminative Matrix Factorization on Sparsely Labeled Networks.

【Paper Link】【Pages】:1563-1572

【Authors】: Daokun Zhang ; Jie Yin ; Xingquan Zhu ; Chengqi Zhang

【Abstract】: We address the problem of classifying sparsely labeled networks, where labeled nodes in the network are extremely scarce. Existing algorithms, such as collective classification, have been shown to be effective for jointly deriving labels of related nodes, by exploiting class label dependencies among neighboring nodes. However, when the underlying network is sparsely labeled, most nodes have too few or even no connections to labeled nodes. This makes it very difficult to leverage supervised knowledge from labeled nodes to accurately estimate label dependencies, thereby largely degrading the classification accuracy. In this paper, we propose a novel discriminative matrix factorization (DMF) based algorithm that effectively learns a latent network representation by exploiting topological paths between labeled and unlabeled nodes, in addition to nodes' content information. The main idea is to use matrix factorization to obtain a compact representation of the network that fully encodes nodes' content information and network structure, and unleash discriminative power inferred from labeled nodes to directly benefit collective classification. To achieve this, we formulate a new matrix factorization objective function that integrates network representation learning with an empirical loss minimization for classifying node labels. An efficient optimization algorithm based on conjugate gradient methods is proposed to solve the new objective function. Experimental results on real-world networks show that DMF yields superior performance gain over the state-of-the-art baselines on sparsely labeled networks.

【Keywords】: collective classification; matrix factorization; network representation learning; sparsely labeled networks

Session 7f: Industry Session VII 4

162. LogMine: Fast Pattern Recognition for Log Analytics.

【Paper Link】【Pages】:1573-1582

【Authors】: Hossein Hamooni ; Biplob Debnath ; Jianwu Xu ; Hui Zhang ; Guofei Jiang ; Abdullah Mueen

【Abstract】: Modern engineering incorporates smart technologies in all aspects of our lives. Smart technologies are generating terabytes of log messages every day to report their status. It is crucial to analyze these log messages and present usable information (e.g. patterns) to administrators, so that they can manage and monitor these technologies. Patterns minimally represent large groups of log messages and enable the administrators to do further analysis, such as anomaly detection and event prediction. Although patterns exist commonly in automated log messages, recognizing them in massive set of log messages from heterogeneous sources without any prior information is a significant undertaking. We propose a method, named LogMine, that extracts high quality patterns for a given set of log messages. Our method is fast, memory efficient, accurate, and scalable. LogMine is implemented in map-reduce framework for distributed platforms to process millions of log messages in seconds. LogMine is a robust method that works for heterogeneous log messages generated in a wide variety of systems. Our method exploits algorithmic techniques to minimize the computational overhead based on the fact that log messages are always automatically generated. We evaluate the performance of LogMine on massive sets of log messages generated in industrial applications. LogMine has successfully generated patterns which are as good as the patterns generated by exact and unscalable method, while achieving a 500× speedup. Finally, we describe three applications of the patterns generated by LogMine in monitoring large scale industrial systems.

【Keywords】: log analysis; map-reduce; pattern recognition

163. Scaling Factorization Machines with Parameter Server.

【Paper Link】【Pages】:1583-1592

【Authors】: Erheng Zhong ; Yue Shi ; Nathan Liu ; Suju Rajan

【Abstract】: Factorization Machines (FM) have been recognized as an effective learning paradigm for incorporating complex relations to improve item recommendation in recommender systems. However, one open issue of FM lies in its factorized representation (latent factors) for each feature in the observed feature space, a characteristic often resulting in a large parameter space. Therefore, training FM (in other words, learning a large number of parameters in FM) is a computationally expensive task. Our work targets to improve the scalability of FM by building it in a distributed environment. We propose a new system framework that integrates Parameter Server (PS) with the Map/Reduce (MR) framework. In addition to the data parallelism achieved via MR, our framework particularly benefits from PS for model parallelism, a critical characteristic for learning with a large number of parameters in FM. We further address two specific challenges in our system, namely, communication cost and parameter update collision. Through both offline and online experiments on recommendation tasks, we demonstrate that the proposed system framework succeeds in scaling up FM for very large datasets, while it also maintains competitive performance on recommendation quality compared to alternative baselines.

【Keywords】: collaborative filtering; factorization machines; large scale recommender systems; latent factor models; parameter server

164. DI-DAP: An Efficient Disaster Information Delivery and Analysis Platform in Disaster Management.

【Paper Link】【Pages】:1593-1602

【Authors】: Tao Li ; Wubai Zhou ; Chunqiu Zeng ; Qing Wang ; Qifeng Zhou ; Dingding Wang ; Jia Xu ; Yue Huang ; Wentao Wang ; Minjing Zhang ; Steven Luis ; Shu-Ching Chen ; Naphtali Rishe

【Abstract】: In disaster management, people are interested in the development and the evolution of the disasters. If they intend to track the information of the disaster, they will be overwhelmed by the large number of disaster-related documents, microblogs, and news, etc. To support disaster management and minimize the loss during the disaster, it is necessary to efficiently and effectively collect, deliver, summarize, and analyze the disaster information, letting people in affected area quickly gain an overview of the disaster situation and improve their situational awareness. To present an integrated solution to address the information explosion problem during the disaster period, we designed and implemented DI-DAP, an efficient and effective disaster information delivery and analysis platform. DI-DAP is an information centric information platform aiming to provide convenient, interactive, and timely disaster information to the users in need. It is composed of three separated but complementary services: Disaster Vertical Search Engine, Disaster Storyline Generation, and Geo-Spatial Data Analysis Portal. These services provide a specific set of functionalities to enable users to consume highly summarized information and allow them to conduct ad-hoc geospatial information retrieval tasks. To support these services, DI-DAP adopts FIU-Miner, a fast, integrated, and user-friendly data analysis platform, which encapsulated all the computation and analysis workflow as well-defined tasks. Moreover, to enable ad-hoc geospatial information retrieval, an advanced query language MapQL is used and the query template engine is integrated. DI-DAP is designed and implemented as a disaster management tool and is currently been exercised as the disaster information platform by more than 100 companies and institutions in South Florida area.

【Keywords】: disaster management; sequential query pattern; storyline generation; vertical search engine

165. Approximate Aggregates in Oracle 12C.

【Paper Link】【Pages】:1603-1612

【Authors】: Hong Su ; Mohamed Zaït ; Vladimir Barrière ; Joseph Torres ; Andre Cavalheiro Menck

【Abstract】: New generation of analytic applications emerged to process data generated from non conventional sources. The challenge for the traditional database systems is that the data sets are very large and keep increasing at a very high rate while the application users have higher performance expectations. The most straightforward response to this challenge is to deploy larger hardware configurations making the solution very expensive and not acceptable for most cases. Alternative solutions fall into two categories: reduce the data set using sampling techniques or reduce the computational complexity of expensive database operations by using alternative algorithms. Alternative algorithms considered in this paper are approximate aggregates that perform a lot better at the cost of reduced and tolerable accuracy. In Oracle 12C we introduced approximate aggregates of expensive aggregate functions that are very common in analytic applications, that is, approximate count distinct and approximate percentile. The performance is improved in two ways. First, the approximate aggregates use bounded memory, often eliminating the need to use temporary storage which results in significant performance improvement over the exact aggregates. Second, we provide materialized view support that allows users to store pre-computed results of approximate aggregates. These results can be rolled up to answer queries on different dimensions (such rollup is not possible for exact aggregates).

【Keywords】: approximate query processing; bigdata

Session 8a: Learning 4

166. Supervised Feature Selection by Preserving Class Correlation.

【Paper Link】【Pages】:1613-1622

【Authors】: Jun Wang ; Jinmao Wei ; Zhenglu Yang

【Abstract】: Feature selection is an effective technique for dimension reduction, which assesses the importance of features and constructs an optimal feature subspace suitable for recognition task. Two recognition scenarios, i.e., single-label learning and multi-label learning, pose different challenges for feature selection. For the single-label task, how to accurately measure and reduce feature redundancy is crucial. For the multi-label task, how to effectively exploit class correlation information during selection is critical. However, both issues cannot be simultaneously resolved by any existing selection methods. In this paper, we propose effective supervised feature selection techniques to address the problems. The original class correlation information in the reduced feature space is preserved, and meanwhile the feature redundancy for classification is alleviated. To the best of our knowledge, this study is the first attempt to accomplish both recognition tasks in a unified framework. Comprehensive experimental evaluations on artificial, single-label, and multi-label data sets demonstrate the effectiveness of the new approach.

【Keywords】: class correlation preservation; feature redundancy reduction; feature selection; multi-label learning; multi-task learning

167. CGMOS: Certainty Guided Minority OverSampling.

【Paper Link】【Pages】:1623-1631

【Authors】: Xi Zhang ; Di Ma ; Lin Gan ; Shanshan Jiang ; Gady Agam

【Abstract】: Handling imbalanced datasets is a challenging problem that if not treated correctly results in reduced classification performance. Imbalanced datasets are commonly handled using minority oversampling, whereas the SMOTE algorithm is a successful oversampling algorithm with numerous extensions. SMOTE extensions do not have a theoretical guarantee during training to work better than SMOTE and in many instances their performance is data dependent. In this paper we propose a novel extension to the SMOTE algorithm with a theoretical guarantee for improved classification performance. The proposed approach considers the classification performance of both the majority and minority classes. In the proposed approach CGMOS (Certainty Guided Minority OverSampling) new data points are added by considering certainty changes in the dataset. The paper provides a proof that the proposed algorithm is guaranteed to work better than SMOTE for training data. Further, experimental results on 30 real-world datasets show that CGMOS works better than existing algorithms when using 6 different classifiers.

【Keywords】: imbalanced learning; oversampling; smote; synthetic data

168. Learning Hidden Features for Contextual Bandits.

【Paper Link】【Pages】:1633-1642

【Authors】: Huazheng Wang ; Qingyun Wu ; Hongning Wang

【Abstract】: Contextual bandit algorithms provide principled online learning solutions to find optimal trade-offs between exploration and exploitation with companion side-information. Most contextual bandit algorithms simply assume the learner would have access to the entire set of features, which govern the generation of payoffs from a user to an item. However, in practice it is challenging to exhaust all relevant features ahead of time, and oftentimes due to privacy or sampling constraints many factors are unobservable to the algorithm. Failing to model such hidden factors leads a system to make constantly suboptimal predictions. In this paper, we propose to learn the hidden features for contextual bandit algorithms. Hidden features are explicitly introduced in our reward generation assumption, in addition to the observable contextual features. A scalable bandit algorithm is achieved via coordinate descent, in which closed form solutions exist at each iteration for both hidden features and bandit parameters. Most importantly, we rigorously prove that the developed contextual bandit algorithm achieves a sublinear upper regret bound with high probability, and a linear regret is inevitable if one fails to model such hidden features. Extensive experimentation on both simulations and large-scale real-world datasets verified the advantages of the proposed algorithm compared with several state-of-the-art contextual bandit algorithms and existing ad-hoc combinations between bandit algorithms and matrix factorization methods.

【Keywords】: contextual bandits; latent feature learning; online recommendations; regret analysis

169. Constructing Reliable Gradient Exploration for Online Learning to Rank.

【Paper Link】【Pages】:1643-1652

【Authors】: Tong Zhao ; Irwin King

【Abstract】: With the rapid development of information retrieval (IR) systems, online learning to rank (OLR) approaches, which allow retrieval systems to automatically learn best parameters from user interactions, have attracted great research interests in recent years. In OLR, the algorithms usually need to explore some uncertain retrieval results for updating current parameters meanwhile guaranteeing to produce quality retrieval results by exploiting what have already been learned, and the final retrieval results is an interleaved list from both exploratory and exploitative results. However, existing OLR algorithms perform exploration based on either only one stochastic direction or multiple randomly selected stochastic directions, which always involve large variance and uncertainty into the exploration, and may further harm the retrieval quality. Moreover, little historical exploration knowledge is considered when conducting current exploration. In this paper, we propose two OLR algorithms that improve the reliability of the exploration by constructing robust exploratory directions. First, we describe a Dual-Point Dueling Bandit Gradient Descent (DP-DBGD) approach with a Contextual Interleaving (CI) method. In particular, the exploration of DP-DBGD is carefully conducted via two opposite stochastic directions and the proposed CI method constructs a qualified interleaved retrieval result list by taking historical explorations into account. Second, we introduce a Multi-Point Deterministic Gradient Descent (MP-DGD) method that constructs a set of deterministic standard unit basis vectors for exploration. In MP-DGD, each basis direction will be explored and the parameter updating is performed by walking along the combination of exploratory winners from the basis vectors. We conduct experiments on several datasets and show that both DP-DBGD and MP-DGD improve the online learning to rank performance over 10% compared with baseline methods.

【Keywords】: dual-point dueling bandit gradient descent; interleaved comparison; learning to rank; multi-point deterministic gradient descent; online learning to rank

170. A Model-Free Approach to Infer the Diffusion Network from Event Cascade.

【Paper Link】【Pages】:1653-1662

【Authors】: Yu Rong ; Qiankun Zhu ; Hong Cheng

【Abstract】: Information diffusion through various types of networks, such as social networks and media networks, is a very common phenomenon on the Internet nowadays. In many scenarios, we can track only the time when the information reaches a node. However, the source infecting this node is usually unobserved. Inferring the underlying diffusion network based on cascade data (observed sequence of infected nodes with timestamp) without additional information is an essential and challenging task in information diffusion. Many studies have focused on constructing complex models to infer the underlying diffusion network in a parametric way. However, the diffusion process in the real world is very complex and hard to be captured by a parametric model. Even worse, inferring the parameters of a complex model is impractical under a large data volume. Different from previous works focusing on building models, we propose to interpret the diffusion process from the cascade data directly in a non-parametric way, and design a novel and efficient algorithm named Non-Parametric Distributional Clustering (NPDC). Our algorithm infers the diffusion network according to the statistical difference of the infection time intervals between nodes connected with diffusion edges versus those with no diffusion edges. NPDC is a model-free approach since we do not define any transmission models between nodes in advance. We conduct experiments on synthetic data sets and two large real-world data sets with millions of cascades. Our algorithm achieves substantially higher accuracy of network inference and is orders of magnitude faster compared with the state-of-the-art solutions.

【Keywords】: clustering; information diffusion; network inference; non-parametric statistics

171. Multiple Infection Sources Identification with Provable Guarantees.

【Paper Link】【Pages】:1663-1672

【Authors】: Hung T. Nguyen ; Preetam Ghosh ; Michael L. Mayo ; Thang N. Dinh

【Abstract】: Given an aftermath of a cascade in the network, i.e. a set VI of "infected" nodes after an epidemic outbreak or a propagation of rumors/worms/viruses, how can we infer the sources of the cascade? Answering this challenging question is critical for computer forensic, vulnerability analysis, and risk management. Despite recent interest towards this problem, most of existing works focus only on single source detection or simple network topologies, e.g. trees or grids. In this paper, we propose a new approach to identify infection sources by searching for a seed set S that minimizes the symmetric difference between the cascade from S and VI, the given set of infected nodes. Our major result is an approximation algorithm, called SISI, to identify infection sources without the prior knowledge on the number of source nodes. SISI, to our best knowledge, is the first algorithm with provable guarantee for the problem in general graphs. It returns a 2/((1-ε)2 Δ-approximate solution with high probability, where Δ denotes the maximum number of nodes in VI that may infect a single node in the network. Our experiments on real-world networks show the superiority of our approach and SISI in detecting true source(s), boosting the F1-measure from few percents, for the state-of-the-art NETSLEUTH, to approximately 50%.

【Keywords】: approximation algorithm; infection source identification

172. Information Diffusion at Workplace.

【Paper Link】【Pages】:1673-1682

【Authors】: Jiawei Zhang ; Philip S. Yu ; Yuanhua Lv ; Qianyi Zhan

【Abstract】: People nowadays need to spend a large amount of time on their work everyday and workplace has become an important social occasion for effective communication and information exchange among employees. Besides traditional online contacts (e.g., face-to-face meetings and telephone calls), to facilitate the communication and cooperation among employees, a new type of online social networks has been launched inside the firewalls of many companies, which are named as the "enterprise social networks" (ESNs). In this paper, we want to study the information diffusion among employees at workplace via both online ESNs and online contacts. This is formally defined as the IDE (Information Diffusion in Enterprise) problem. Several challenges need to be addressed in solving the IDE problem: (1) diffusion channel extraction from online ESN and online contacts; (2) effective aggregation of the information delivered via different diffusion channels; and (3) communication channel weighting and selection. A novel information diffusion model, Muse (Multi-source Multi-channel Multi-topic diffUsion SElection), is introduced in this paper to resolve these challenges. Extensive experiments conducted on real-world ESN and organizational chart dataset demonstrate the outstanding performance of Muse in addressing the IDE problem.

【Keywords】: data mining; diffusion channel selection; enterprise social networks

【Paper Link】【Pages】:1683-1692

【Authors】: Chonggang Song ; Wynne Hsu ; Mong-Li Lee

【Abstract】: Influence maximization (IM) problem asks for a set of k nodes in a given graph G, such that it can reach the largest expected number of remaining nodes in G. Existing methods have either considered that the influence be targeted to meet certain deadline constraint, or be restricted to specific geographical region. However, if an event organizer wants to disseminate some event information on a social platform, s/he would want to select a set of users who can influence the most number of people within the neighborhood of the event location, and this influence should occur before the event takes place. Considering the location and deadline independently may lead to a less than optimal set of users. In this paper, we formalize the problem targeted influence maximization in social networks. We adopt a login model where each user is associated with a login probability and he can be influenced by his neighbors only when he is online. We develop a sampling based algorithm that returns a (1-1/e-ε)-approximate solution, as well as an efficient heuristic algorithm that focuses on nodes close to the target location. Experiments on real-world social network datasets demonstrate the effectiveness and efficiency of our proposed method.

【Keywords】: influence maximization

Session 8c: Applications 4

【Paper Link】【Pages】:1693-1702

【Authors】: Norases Vesdapunt ; Hector Garcia-Molina

【Abstract】: We study the problem of graph tracking with limited information. In this paper, we focus on updating a social graph snapshot. Say we have an existing partial snapshot, G1, of the social graph stored at some system. Over time G1 becomes out of date. We want to update G1 through a public API to the actual graph, restricted by the number of API calls allowed. Periodically recrawling every node in the snapshot is prohibitively expensive. We propose a scheme where we exploit indegrees and outdegrees to discover changes to the actual graph. When there is ambiguity, we probe the graph and verify edges. We propose a novel strategy designed for limited information that can be adapted to different levels of staleness. We evaluate our strategy against recrawling on real datasets and show that it saves an order of magnitude of API calls while introducing minimal errors.

【Keywords】: api call; competing graph; crawling; google+; graph update; limited information; snapshot; social graph; social graph snapshot; social network; social network crawling

175. Making Sense of Entities and Quantities in Web Tables.

【Paper Link】【Pages】:1703-1712

【Authors】: Yusra Ibrahim ; Mirek Riedewald ; Gerhard Weikum

【Abstract】: HTML tables and spreadsheets on the Internet or in enterprise intranets often contain valuable information, but are created ad-hoc. As a result, they usually lack systematic names for column headers and clear vocabulary for cell values. This limits the re-use of such tables and creates a huge heterogeneity problem when comparing or aggregating multiple tables. This paper aims to overcome this problem by automatically canonicalizing header names and cell values onto concepts, classes, entities and uniquely represented quantities registered in a knowledge base. To this end, we devise a probabilistic graphical model that captures coherence dependencies between cells in tables and candidate items in the space of concepts, entities and quantities. We give specific consideration to quantities which are mapped into a "measure, value, unit" triple over a taxonomy of physical (e.g. power consumption), monetary (e.g. revenue), temporal (e.g. date) and dimensionless (e.g. counts) measures. Our experiments with Web tables from diverse domains demonstrate the viability of our method and its benefits over baselines.

【Keywords】: information extraction; quantity annotation; semantic annotation; web tables

176. Influence Maximization for Complementary Goods: Why Parties Fail to Cooperate?

【Paper Link】【Pages】:1713-1722

【Authors】: Han-Ching Ou ; Chung-Kuang Chou ; Ming-Syan Chen

【Abstract】: We consider the problem where companies provide different types of products and want to promote their products through viral marketing simultaneously. Most previous works assume products are purely competitive. Different from them, our work considers that each product has a pairwise relationship which can be from strongly competitive to strongly complementary to each other's product. The problem is to maximize the spread size with the presence of different opponents with different relationships on the network. We propose Interacting Influence Maximization (IIM) game to model such problems by extending the model of the Competitive Influence Maximization (CIM) game studied by previous works, which considers purely competitive relationship. As for the theoretical approach, we prove that the Nash equilibrium of highly complementary products of different companies may still be very inefficient due to the selfishness of companies. We do so by introducing a well-known concept in game theory, called Price of Stability (PoS) of the extensive-form game. We prove that in any k selfish players symmetric complementary IIM game, the overall spread of the products can be reduced to as less as 1/k of the optimal spread. Since companies may fail to cooperate with one another, we propose different competitive objective functions that companies may consider and deal with separately. We propose a scalable strategy for maximizing influence differences, called TOPBOSS that is guaranteed to beat the first player in a single-round two-player second-move game. In the experiment, we first propose a learning method to learn the ILT model, which we propose for IIM game, from both synthetic and real data to validate the effectiveness of ILT. We then exhibit that the performance of several heuristic strategies in the traditional influence maximization problem can be improved by acquiring the knowledge of the existence of competitive/complementary products in the network. Finally, we compare the TOPBOSS with different heuristic algorithms in real data and demonstrate the merits of TOPBOSS.

【Keywords】: diffusion model; game theory; influence maximization

177. Effective Spelling Correction for Eye-based Typing using domain-specific Information about Error Distribution.

【Paper Link】【Pages】:1723-1732

【Authors】: Raíza Hanada ; Maria da Graça Campos Pimentel ; Marco Cristo ; Fernando Anglada Lores

【Abstract】: Spelling correction methods, widely used and researched, usually assume a low error probability and a small number of errors per word. These assumptions do not hold in very noisy input scenarios such as eye-based typing systems. In particular for eye typing, insertion errors are much more common than in traditional input systems, due to specific sources of noise such as the eye tracker device, particular user behaviors, and intrinsic characteristics of eye movements. The large number of common errors in such a scenario makes the use of traditional approaches unfeasible. Moreover, the lack of a large corpus of errors makes it hard to adopt probabilistic approaches based on information extracted from real world data. We address these problems by combining estimates extracted from general error corpora with domain-specific knowledge about eye-based input. Further, by relaxing restrictions on edit distance specifically related to insertion errors, we propose an algorithm that is able to find dictionary word candidates in an attainable time. We show that our method achieves good results to rank the correct word, given the input stream and similar space and time restrictions, when compared to the state-of-the-art baselines.

【Keywords】: eye-based typing; minimum message length; mor-fraenkel; noisy-channel model; spelling correction

Session 8d: Algorithms 4

178. Computing and Summarizing the Negative Skycube.

【Paper Link】【Pages】:1733-1742

【Authors】: Nicolas Hanusse ; Patrick Kamnang Wanko ; Sofian Maabout

【Abstract】: Given a table T with a set of dimensions D, the skycube of T is the union of all skylines obtained by considering each of the subsets of D (subspaces). The number of these skylines is exponential w.r.t D. To make the skycube practically useful, two lines of research have been pursued so far: the first one aims to propose efficient algorithms for computing it and the second one considers either that the skycube is too large to be computed in a reasonable time or it requires too much memory space to be stored. They therefore propose skycube summarization techniques to reduce time and space consumption. Intuitively, previous efforts have been devoted to compute or summarize the following information: for every tuple t, list the skylines where t belongs to". In this paper, we consider the complementary statement, i.e.,for every tuple t, list the skylines where t does not belong to". This is what we call the negative skycube. Despite the apparent equivalence between these two statements, our analysis and extensive experiments show that these two points of views do not lead to the same behavior of the related algorithms. More specifically, our proposal shows that (i) the negative summary can be obtained much faster than state of the art techniques for positive summaries, (ii) in general, it consumes less space, (iii) skyline queries evaluation using this summary are much faster, (iv) the positive skycube can be obtained much more rapidly than state of the art algorithms, and (v) it can be used for a larger class of queries, namely k-domination skylines.

【Keywords】: optimization; query; skyline; algorithms

179. Efficient Orthogonal Non-negative Matrix Factorization over Stiefel Manifold.

【Paper Link】【Pages】:1743-1752

【Authors】: Wei Emma Zhang ; Mingkui Tan ; Quan Z. Sheng ; Lina Yao ; Qinfeng Shi

【Abstract】: Orthogonal Non-negative Matrix Factorization (ONMF) approximates a data matrix X by the product of two lower dimensional factor matrices: X -- UVT, with one of them orthogonal. ONMF has been widely applied for clustering, but it often suffers from high computational cost due to the orthogonality constraint. In this paper, we propose a method, called Nonlinear Riemannian Conjugate Gradient ONMF (NRCG-ONMF), which updates U and V alternatively and preserves the orthogonality of U while achieving fast convergence speed. Specifically, in order to update U, we develop a Nonlinear Riemannian Conjugate Gradient (NRCG) method on the Stiefel manifold using Barzilai-Borwein (BB) step size. For updating V, we use a closed-form solution under non-negativity constraint. Extensive experiments on both synthetic and real-world data sets show consistent superiority of our method over other approaches in terms of orthogonality preservation, convergence speed and clustering performance.

【Keywords】: clustering; orthogonal nmf; stiefel manifold

180. Paired Restricted Boltzmann Machine for Linked Data.

【Paper Link】【Pages】:1753-1762

【Authors】: Suhang Wang ; Jiliang Tang ; Fred Morstatter ; Huan Liu

【Abstract】: Restricted Boltzmann Machines (RBMs) are widely adopted unsupervised representation learning methods and have powered many data mining tasks such as collaborative filtering and document representation. Recently, linked data that contains both attribute and link information has become ubiquitous in various domains. For example, social media data is inherently linked via social relations and web data is networked via hyperlinks. It is evident from recent work that link information can enhance a number of real-world applications such as clustering and recommendations. Therefore, link information has the potential to advance RBMs for better representation learning. However, the majority of existing RBMs have been designed for independent and identically distributed data and are unequipped for linked data. In this paper, we aim to design a new type of Restricted Boltzmann Machines that takes advantage of linked data. In particular, we propose a paired Restricted Boltzmann Machine (pRBM), which is able to leverage the attribute and link information of linked data for representation learning. Experimental results on real-world datasets demonstrate the effectiveness of the proposed framework pRBM.

【Keywords】: linked data; restricted boltzmann machine; unsupervised representation learning

181. LDA Revisited: Entropy, Prior and Convergence.

【Paper Link】【Pages】:1763-1772

【Authors】: Jianwei Zhang ; Jia Zeng ; Mingxuan Yuan ; Weixiong Rao ; Jianfeng Yan

【Abstract】: Inference algorithms of latent Dirichlet allocation (LDA), either for small or big data, can be broadly categorized into expectation-maximization (EM), variational Bayes (VB) and collapsed Gibbs sampling (GS). Looking for a unified understanding of these different inference algorithms is currently an important open problem. In this paper, we revisit these three algorithms from the entropy perspective, and show that EM can achieve the best predictive perplexity (a standard performance metric for LDA accuracy) by minimizing directly the cross entropy between the observed word distribution and LDA's predictive distribution. Moreover, EM can change the entropy of LDA's predictive distribution through tuning priors of LDA, such as the Dirichlet hyperparameters and the number of topics, to minimize the cross entropy with the observed word distribution. Finally, we propose the adaptive EM (AEM) algorithm that converges faster and more accurate than the current state-of-the-art SparseLDA [20] and AliasLDA [12] from small to big data and LDA models. The core idea is that the number of active topics, measured by the residuals between E-steps at successive iterations, decreases significantly, leading to the amortized σ(1) time complexity in terms of the number of topics. The open source code of AEM is available at GitHub.

【Keywords】: adaptive em algorithms; big data; convergence; entropy; latent dirichlet allocation; prior

Session 8e: High Performance Big Data 4

182. Cost-Effective Stream Join Algorithm on Cloud System.

【Paper Link】【Pages】:1773-1782

【Authors】: Junhua Fang ; Rong Zhang ; Xiaotong Wang ; Tom Z. J. Fu ; Zhenjie Zhang ; Aoying Zhou

【Abstract】: Matrix-based scheme (Join-Matrix) can prefectly support distributed stream joins, especially for arbitrary join predicates, because it guarantees any tuples from two streams to meet with each other. However,the dynamics and unpredictability features of stream require quick actions on scheme changing. Otherwise, they may lead to degradation of system throughputs and increament of processing latency with the waste of system resources, such as CPUs and Memories. Since Join-Matrix model has the fixed processing architecture with replicated data, these kinds of adverseness will be magnified. Therefore, it is urgent to find a solution that preserves advantages of Join-Matrix model and promises a good usage to computation resources when it meets scheme changing. In this paper, we propose a cost-effective stream join algorithm, which ensures the adaptability of Join-Matrix but with lower resources consumption. Specifically, a varietal matrix generation algorithm is proposed to generate an irregular matrix scheme for assigning the minimal number of tasks; a lightweight migration algorithm is designed to ensure state migration at a low cost; a complete load balance process framework is described to guarantee the correctness during the scheme changing. We conduct extensive experiments to compare our method with baseline systems on both benchmarks and real-workloads, and explain the results in detail.

【Keywords】: cost effective; dstributed stream join; matrix model; theta-join

183. Leveraging Multiple GPUs and CPUs for Graphlet Counting in Large Networks.

【Paper Link】【Pages】:1783-1792

【Authors】: Ryan A. Rossi ; Rong Zhou

【Abstract】: Massively parallel architectures such as the GPU are becoming increasingly important due to the recent proliferation of data. In this paper, we propose a key class of hybrid parallel graphlet algorithms that leverages multiple CPUs and GPUs simultaneously for computing k-vertex induced subgraph statistics (called graphlets). In addition to the hybrid multi-core CPU-GPU framework, we also investigate single GPU methods (using multiple cores) and multi-GPU methods that leverage all available GPUs simultaneously for computing induced subgraph statistics. Both methods leverage GPU devices only, whereas the hybrid multi-core CPU-GPU framework leverages all available multi-core CPUs and multiple GPUs for computing graphlets in large networks. Compared to recent approaches, our methods are orders of magnitude faster, while also more cost effective enjoying superior performance per capita and per watt. In particular, the methods are up to 300 times faster than a recent state-of-the-art method. To the best of our knowledge, this is the first work to leverage multiple CPUs and GPUs simultaneously for computing induced subgraph statistics.

【Keywords】: GPU computing; feature learning; graph classification; graph kernels; graph mining; graph representation learning; graphlet decomposition; graphlets; heterogeneous computing; induced subgraphs; multi-GPU; network motifs; orbits; parallel algorithms; relational learning; role discovery

184. Scalable Local-Recoding Anonymization using Locality Sensitive Hashing for Big Data Privacy Preservation.

【Paper Link】【Pages】:1793-1802

【Authors】: Xuyun Zhang ; Christopher Leckie ; Wanchun Dou ; Jinjun Chen ; Kotagiri Ramamohanarao ; Zoran Salcic

【Abstract】: While cloud computing has become an attractive platform for supporting data intensive applications, a major obstacle to the adoption of cloud computing in sectors such as health and defense is the privacy risk associated with releasing datasets to third-parties in the cloud for analysis. A widely-adopted technique for data privacy preservation is to anonymize data via local recoding. However, most existing local-recoding techniques are either serial or distributed without directly optimizing scalability, thus rendering them unsuitable for big data applications. In this paper, we propose a highly scalable approach to local-recoding anonymization in cloud computing, based on Locality Sensitive Hashing (LSH). Specifically, a novel semantic distance metric is presented for use with LSH to measure the similarity between two data records. Then, LSH with the MinHash function family can be employed to divide datasets into multiple partitions for use with MapReduce to parallelize computation while preserving similarity. By using our efficient LSH-based scheme, we can anonymize each partition through the use of a recursive agglomerative $k$-member clustering algorithm. Extensive experiments on real-life datasets show that our approach significantly improves the scalability and time-efficiency of local-recoding anonymization by orders of magnitude over existing approaches.

【Keywords】: LSH; big data; cloud; mapreduce; privacy preservation

185. Approximate Discovery of Functional Dependencies for Large Datasets.

【Paper Link】【Pages】:1803-1812

【Authors】: Tobias Bleifuß ; Susanne Bülow ; Johannes Frohnhofen ; Julian Risch ; Georg Wiese ; Sebastian Kruse ; Thorsten Papenbrock ; Felix Naumann

【Abstract】: Functional dependencies (FDs) are an important prerequisite for various data management tasks, such as schema normalization, query optimization, and data cleansing. However, automatic FD discovery entails an exponentially growing search and solution space, so that even today's fastest FD discovery algorithms are limited to small datasets only, due to long runtimes and high memory consumptions. To overcome this situation, we propose an approximate discovery strategy that sacrifices possibly little result correctness in return for large performance improvements. In particular, we introduce AID-FD, an algorithm that approximately discovers FDs within runtimes up to orders of magnitude faster than state-of-the-art FD discovery algorithms. We evaluate and compare our performance results with a focus on scalability in runtime and memory, and with measures for completeness, correctness, and minimality.

【Keywords】: aid-fd; approximate discovery; dependency discovery; functional dependencies; negative cover inversion

Session 8f: Industry Session VIII 4

186. On Structural Health Monitoring Using Tensor Analysis and Support Vector Machine with Artificial Negative Data.

【Paper Link】【Pages】:1813-1822

【Authors】: Prasad Cheema ; Nguyen Lu Dang Khoa ; Mehrisadat Makki Alamdari ; Wei Liu ; Yang Wang ; Fang Chen ; Peter Runcie

【Abstract】: Structural health monitoring is a condition-based technology to monitor infrastructure using sensing systems. Since we usually only have data associated with the healthy state of a structure, one-class approaches are more practical. However, tuning the parameters for one-class techniques (like one-class Support Vector Machines) still remains a relatively open and difficult problem. Moreover, in structural health monitoring, data are usually multi-way, highly redundant and correlated, which a matrix-based two-way approach cannot capture all these relationships and correlations together. Tensor analysis allows us to analyse the multi-way vibration data at the same time. In our approach, we propose the use of tensor learning and support vector machines with artificial negative data generated by density estimation techniques for damage detection, localization and estimation in a one-class manner. The artificial negative data can help tuning SVM parameters and calibrating probabilistic outputs, which is not possible to do with one-class SVM. The proposed method shows promising results using data from laboratory-based structures and also with data collected from the Sydney Harbour Bridge, one of the most iconic structures in Australia. The method works better than the one-class approach and the approach without using tensor analysis.

【Keywords】: artificial negative data; damage identification; density estimation; support vector machine; tensor analysis

187. A Self-Learning and Online Algorithm for Time Series Anomaly Detection, with Application in CPU Manufacturing.

【Paper Link】【Pages】:1823-1832

【Authors】: Xing Wang ; Jessica Lin ; Nital Patel ; Martin Braun

【Abstract】: The problem of anomaly detection in time series has received a lot of attention in the past two decades. However, existing techniques cannot locate where the anomalies are within anomalous time series, or they require users to provide the length of potential anomalies. To address these limitations, we propose a self-learning online anomaly detection algorithm that automatically identifies anomalous time series, as well as the exact locations where the anomalies occur in the detected time series. We evaluate our approach on several real datasets, including two CPU manufacturing data from Intel. We demonstrate that our approach can successfully detect the correct anomalies without requiring any prior knowledge about the data.

【Keywords】: anomaly detection; self-learning; time series

188. Deep Match between Geology Reports and Well Logs Using Spatial Information.

【Paper Link】【Pages】:1833-1842

【Authors】: Bin Tong ; Martin Klinkigt ; Makoto Iwayama ; Yoshiyuki Kobayashi ; Anshuman Sahu ; Ravigopal Vennelakanti

【Abstract】: In the shale oil & gas industry, operators are looking toward big data and new analytics tools and techniques to optimize operations and reduce cost. Formation evaluation is one of the most crucial steps before the fracturing operation. To assist engineers in understanding the subsurface and in turn make optimal operations, we focus on learning semantic relations between geology reports and well logs, which are collected during down-hole drilling. The challenges are how to represent the features of the geology reports and the well logs collected at measured depths and how to effectively embed them into a common feature space. We propose both linear and nonlinear (artificial neural network) models to achieve such an embedding. Extensive validations are conducted on public well data of North Dakota in the United States. We empirically discover that both geology reports and well logs follow a neighborhood property measured by geological distance. We show that this spatial information is highly effective in both the linear and nonlinear models and our nonlinear model with the spatial information performs the best among the state-of-the-art methods.

【Keywords】: geology report; learning to match; neural network; well log

189. MIST: Missing Person Intelligence Synthesis Toolkit.

【Paper Link】【Pages】:1843-1867

【Authors】: Elham Shaabani ; Hamidreza Alvari ; Paulo Shakarian ; J. E. Kelly Snyder

【Abstract】: Each day, approximately 500 missing persons cases occur that go unsolved/unresolved in the United States. The non-profit organization known as the Find Me Group (FMG), led by former law enforcement professionals, is dedicated to solving or resolving these cases. This paper introduces the Missing Person Intelligence Synthesis Toolkit (MIST) which leverages a data-driven variant of geospatial abductive inference. This system takes search locations provided by a group of experts and rank-orders them based on the probability assigned to areas based on the prior performance of the experts taken as a group. We evaluate our approach compared to the current practices employed by the Find Me Group and found it significantly reduces the search area - leading to a reduction of 31 square miles over 24 cases we examined in our experiments. Currently, we are using MIST to aid the Find Me Group in an active missing person case.

【Keywords】: abductive inference; geospatial abduction; law enforcement; missing person

Poster Session I: Short Papers 55

190. Skipping Word: A Character-Sequential Representation based Framework for Question Answering.

【Paper Link】【Pages】:1869-1872

【Authors】: Lingxun Meng ; Yan Li ; Mengyi Liu ; Peng Shu

【Abstract】: Recent works using artificial neural networks based on word distributed representation greatly boost the performance of various natural language learning tasks, especially question answering. Though, they also carry along with some attendant problems, such as corpus selection for embedding learning, dictionary transformation for different learning tasks, etc. In this paper, we propose to straightforwardly model sentences by means of character sequences, and then utilize convolutional neural networks to integrate character embedding learning together with point-wise answer selection training. Compared with deep models pre-trained on word embedding (WE) strategy, our character-sequential representation (CSR) based method shows a much simpler procedure and more stable performance across different benchmarks. Extensive experiments on two benchmark answer selection datasets exhibit the competitive performance compared with the state-of-the-art methods.

【Keywords】: convolutional neural networks; deep learning; semantic matching; word embeddings

191. Towards Time-Discounted Influence Maximization.

【Paper Link】【Pages】:1873-1876

【Authors】: Arijit Khan

【Abstract】: The classical influence maximization (IM) problem in social networks does not distinguish between whether a campaign gets viral in a week or in a year. From the practical standpoint, however, campaigns for a new technology or an upcoming movie must be spread as quickly as possible, otherwise they will be obsolete. To this end, we formulate and investigate the novel problem of maximizing the time-discounted influence spread in a social network, that is, the campaigner is interested in both "when" and "how likely" a user would be influenced. In particular, we assume that the campaigner has a utility function which monotonically decreases with the time required for a user to get influenced, since the activation of the seed nodes. The problem that we solve in this paper is to maximize the expected aggregated value of this utility function over all network users. This is a novel and relevant problem that, surprisingly, has not been studied before. Time-discounted influence maximization (TDIM), being a generalization of the classical IM, still remains NP-hard. However, our main contribution is to prove the sub-modularity of the objective function for any monotonically decreasing function of time, under a variety of influence cascading models, e.g., the independent cascade, linear threshold, and maximum influence arborescence models, thereby designing approximate algorithms with theoretical performance guarantees. We also illustrate that the existing optimization techniques (e.g., CELF) for influence maximization are more efficient over TDIM. Our experimental results demonstrate the effectiveness of our solutions over several baselines including the classical influence maximization algorithms.

【Keywords】: influence maximization; information diffusion time; social networks

192. Quantifying Query Ambiguity with Topic Distributions.

【Paper Link】【Pages】:1877-1880

【Authors】: Yuki Yano ; Yukihiro Tagami ; Akira Tajima

【Abstract】: Query ambiguity is a useful metric for search engines to understand users' intents. Existing methods quantify query ambiguity by calculating an entropy of clicks. These methods assign each click to a one-hot vector corresponding to some mutually exclusive groups. However, they cannot incorporate non-obvious structures such as similarity among documents. In this paper, we propose a new approach for quantifying query ambiguity using topic distributions. We show that it is a natural extension of an existing entropy-based method. Further, we use our approach to achieve topic-based extensions of major existing entropy-based methods. Through an evaluation using e-commerce search logs combined with human judgments, our approach successfully extended existing entropy-based methods and improved the quality of query ambiguity measurements.

【Keywords】: query ambiguity; search intent; topic distribution

【Paper Link】【Pages】:1881-1884

【Authors】: Xuezhi Cao ; Yong Yu

【Abstract】: Aligning heterogeneous online social networks is a highly beneficial task proposed in recent years. It targets at automatically aligning accounts from multiple networks by whether they are held by the same natural person. Aligning the networks can improve personalized services by cross-platform user modeling, and is the prerequisite for cross-network analysis. However, there is currently no public benchmark dataset available due to its recency. As performances of this task depend highly on the dataset, experiments using different private datasets are not directly comparable. Therefore, in this paper we propose ASNets, a benchmark dataset with two sets of aligned social networks. With this dataset, we can now properly evaluate different approaches and compare them fairly. The two sets of aligned networks have 328,224 and 141,614 aligned users respectively, covering multilingual usage (Chinese and English) and various types of social networks including general purposed networks, review sites and microblogging sites. We describe the collecting methodology and statistics in details, and evaluate several state-of-the-art network aligning approaches. Beside introducing the dataset, we further propose several potential research directions that benefit from ASNets.

【Keywords】: benchmark dataset; network alignment; user modeling

194. Data Locality in Graph Engines: Implications and Preliminary Experimental Results.

【Paper Link】【Pages】:1885-1888

【Authors】: Yong-Yeon Jo ; Jiwon Hong ; Myung-Hwan Jang ; Jae-Geun Bang ; Sang-Wook Kim

【Abstract】: The size of graphs has dramatically increased. Graph engines for a single machine have been emerged to process these graphs efficiently. However, existing engines have overlooked a data locality which is an imperative factor to improve the performance of these engines in the previous literature. In this paper, we show the importance of data locality with graph algorithms by running on graph engines based on a single machine.

【Keywords】: data layout; data locality; graph engines; single machine

195. Active Zero-Shot Learning.

【Paper Link】【Pages】:1889-1892

【Authors】: Sihong Xie ; Shaoxiong Wang ; Philip S. Yu

【Abstract】: In multi-label classification in the big data age, the number of classes can be in thousands, and obtaining sufficient training data for each class is infeasible. Zero-shot learning aims at predicting a large number of unseen classes using only labeled data from a small set of classes and external knowledge about class relations. However, previous zero-shot learning models passively accept labeled data collected beforehand, relinquishing the opportunity to select the proper set of classes to inquire labeled data and optimize the performance of unseen class prediction. To resolve this issue, we propose an active class selection strategy to intelligently query labeled data for a parsimonious set of informative classes. We demonstrate two desirable probabilistic properties of the proposed method that can facilitate unseen classes prediction. Experiments on 4 text datasets demonstrate that the active zero-shot learning algorithm is superior to a wide spectrum of baselines. We indicate promising future directions at the end of this paper.

【Keywords】: machine learning

196. Learning to Account for Good Abandonment in Search Success Metrics.

【Paper Link】【Pages】:1893-1896

【Authors】: Madian Khabsa ; Aidan C. Crook ; Ahmed Hassan Awadallah ; Imed Zitouni ; Tasos Anastasakos ; Kyle Williams

【Abstract】: Abandonment in web search has been widely used as a proxy to measure user satisfaction. Initially it was considered a signal of dissatisfaction, however with search engines moving towards providing answer-like results, a new category of abandonment was introduced and referred to as Good Abandonment. Predicting good abandonment is a hard problem and it was the subject of several previous studies. All those studies have focused, though, on predicting good abandonment in offline settings using manually labeled data. Thus, it remained a challenge how to have an online metric that accounts for good abandonment. In this work we describe how a search success metric can be augmented to account for good abandonment sessions using a machine learned metric that depends on user's viewport information. We use real user traffic from millions of users to evaluate the proposed metric in an A/B experiment. We show that taking good abandonment into consideration has a significant effect on the overall performance of the online metric.

【Keywords】: good abandonment; success metric

197. Modeling and Predicting Popularity Dynamics via an Influence-based Self-Excited Hawkes Process.

【Paper Link】【Pages】:1897-1900

【Authors】: Peng Bao

【Abstract】: Modeling and predicting the popularity dynamics of individual user generated items on online social networks has important implications in a wide range of areas. The challenge of this problem comes from the inequality of the popularity of content and the numerous complex factors. Existing works mainly focus on exploring relevant factors for prediction and fitting the time series of popularity dynamics into certain class of functions, while ignoring the underlying arrival process of attentions. Also, the exogenous effect of user activity variation on the platform has been neglected. In this paper, we propose a probabilistic model using an influence-based self-excited Hawkes process (ISEHP) to characterize the process through which individual microblogs gain their popularity. This model explicitly captures three ingredients: the intrinsic attractiveness of a microblog with exponential time decay, the user-specific triggering effect of each forwardings based on the endogenous influence among users, and the exogenous effect from the platform. We validate the ISEHP model by applying it on Sina Weibo, the most popular microblogging network in China. Experimental results demonstrate that our proposed model consistently outperforms existing prediction models.

【Keywords】: influence; microblogs; periodicity; popularity dynamics; self-excited hawkes process; survival theory

198. Incorporate Group Information to Enhance Network Embedding.

【Paper Link】【Pages】:1901-1904

【Authors】: Jifan Chen ; Qi Zhang ; Xuanjing Huang

【Abstract】: The problem of representing large-scale networks with low-dimensional vectors has received considerable attention in recent years. Except the networks that include only vertices and edges, a variety of networks contain information about groups or communities. For example, on Facebook, in addition to users and the follower-followee relations between them, users can also create and join groups. However, previous studies have rarely utilized this valuable information to generate embeddings of vertices. In this paper, we investigate a novel method for learning the network embeddings with valuable group information for large-scale networks. The proposed methods take both the inner structures of the groups and the information across groups into consideration. Experimental results demonstrate that the embeddings generated by the proposed methods significantly outperform state-of-the-art network embedding methods on two different scale real-world network

【Keywords】: group information; large scale; network embedding

199. Exploiting Cluster-based Meta Paths for Link Prediction in Signed Networks.

【Paper Link】【Pages】:1905-1908

【Authors】: Jiangfeng Zeng ; Ke Zhou ; Xiao Ma ; Fuhao Zou ; Hua Wang

【Abstract】: Many online social networks can be described by signed networks, where positive links signify friendships, trust and like; while negative links indicate enmity, distrust and dislike. Predicting the sign of the links in these networks has attracted a great deal of attentions in the areas of friendship recommendation and trust relationship prediction. Existing methods for sign prediction tend to rely on path-based features which are somehow limited to the sparsity problem of the network. In order to solve this issue, in this paper, we introduce a novel sign prediction model by exploiting cluster-based meta paths, which can take advantage of both local and global information of the input networks. First, cluster-based meta paths based features are constructed by incorporating the newly generated clusters through hierarchically clustering the input networks. Then, the logistic regression classifier is employed to train the model and predict the hidden signs of the links. Extensive experiments on Epinions and Slashdot datasets demonstrate the efficiency of our proposed method in terms of Accuracy and Coverage.

【Keywords】: cluster-based meta path; link prediction; signed networks

200. Predicting Importance of Historical Persons using Wikipedia.

【Paper Link】【Pages】:1909-1912

【Authors】: Adam Jatowt ; Daisuke Kawai ; Katsumi Tanaka

【Abstract】: Wikipedia contains a lot of contemporary as well as history-related information, and given its vast coverage and richness, it can be used to rank entities in a variety of different ways. In this work, we are interested in utilizing Wikipedia for judging historical person's importance. Based on the two well-known lists of the most important people in the last millennium, we look closely into factors that determine significance of historical persons. We predict person's importance using six classifiers equipped with features derived from link structure, visit logs and article content.

【Keywords】: historical analysis; person influence; wikipedia

201. Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks.

【Paper Link】【Pages】:1913-1916

【Authors】: Jinfeng Rao ; Hua He ; Jimmy J. Lin

【Abstract】: We study answer selection for question answering, in which given a question and a set of candidate answer sentences, the goal is to identify the subset that contains the answer. Unlike previous work which treats this task as a straightforward pointwise classification problem, we model this problem as a ranking task and propose a pairwise ranking approach that can directly exploit existing pointwise neural network models as base components. We extend the Noise-Contrastive Estimation approach with a triplet ranking loss function to exploit interactions in triplet inputs over the question paired with positive and negative examples. Experiments on TrecQA and WikiQA datasets show that our approach achieves state-of-the-art effectiveness without the need for external knowledge sources or feature engineering.

【Keywords】: answer selection; learning to rank; neural networks; noise-contrastive estimation; question answering

【Paper Link】【Pages】:1917-1920

【Authors】: Qinzhe Zhang ; Jia Wu ; Hong Yang ; Weixue Lu ; Guodong Long ; Chengqi Zhang

【Abstract】: Social recommendation has been widely studied in recent years. Existing social recommendation models use various explicit pieces of social information as regularization terms in recommendation, for instance, social links are considered as new constraints. However, social influence, an implicit source of information in social networks, is seldomly considered, even though it often drives recommendations in social networks. In this paper, we introduce a new global and local influence-based social recommendation model. Based on the observation that user purchase behaviour is influenced by both global influential nodes and the local influential nodes of the user, we formulate the global and local influence as an regularization terms, and incorporate them into a matrix factorization-based recommendation model. Experimental results on large data sets demonstrate the performance of the proposed method.

【Keywords】: dual social influence; social recommendation

203. Tag-Aware Personalized Recommendation Using a Deep-Semantic Similarity Model with Negative Sampling.

【Paper Link】【Pages】:1921-1924

【Authors】: Zhenghua Xu ; Cheng Chen ; Thomas Lukasiewicz ; Yishu Miao ; Xiangwu Meng

【Abstract】: With the rapid growth of social tagging systems, many efforts have been put on tag-aware personalized recommendation. However, due to uncontrolled vocabularies, social tags are usually redundant, sparse, and ambiguous. In this paper, we propose a deep neural network approach to solve this problem by mapping both the tag-based user and item profiles to an abstract deep feature space, where the deep-semantic similarities between users and their target items (resp., irrelevant items) are maximized (resp., minimized). Due to huge numbers of online items, the training of this model is usually computationally expensive in the real-world context. Therefore, we introduce negative sampling, which significantly increases the model's training efficiency (109.6 times quicker) and ensures the scalability in practice. Experimental results show that our model can significantly outperform the state-of-the-art baselines in tag-aware personalized recommendation: e.g., its mean reciprocal rank is between 5.7 and 16.5 times better than the baselines.

【Keywords】: deep neural network; deep-semantic similarity; negative sampling; tag-aware personalized recommendation

204. Personalized Semantic Word Vectors.

【Paper Link】【Pages】:1925-1928

【Authors】: Javid Ebrahimi ; Dejing Dou

【Abstract】: Distributed word representations are able to capture syntactic and semantic regularities in text. In this paper, we present a word representation scheme that incorporates authorship information. While maintaining similarity among related words in the induced distributed space, our word vectors can be effectively used for some text classification tasks too. We build on a log-bilinear document model (lbDm), which extracts document features, and word vectors based on word co-occurrence counts. First, we propose a log-bilinear author model (lbAm), which contains an additional author matrix. We show that by directly learning author feature vectors, as opposed to document vectors, we can learn better word representations for the authorship attribution task. Furthermore, authorship information has been found to be useful for sentiment classification. We enrich the author model with a sentiment tensor, and demonstrate the effectiveness of this hybrid model (lbHm) through our experiments on a movie review-classification dataset.

【Keywords】: author model; document model; word vectors

205. Query Expansion Using Word Embeddings.

【Paper Link】【Pages】:1929-1932

【Authors】: Saar Kuzi ; Anna Shtok ; Oren Kurland

【Abstract】: We present a suite of query expansion methods that are based on word embeddings. Using Word2Vec's CBOW embedding approach, applied over the entire corpus on which search is performed, we select terms that are semantically related to the query. Our methods either use the terms to expand the original query or integrate them with the effective pseudo-feedback-based relevance model. In the former case, retrieval performance is significantly better than that of using only the query, and in the latter case the performance is significantly better than that of the relevance model.

【Keywords】: query models; word embeddings

206. Efficient Distributed Regular Path Queries on RDF Graphs Using Partial Evaluation.

【Paper Link】【Pages】:1933-1936

【Authors】: Xin Wang ; Junhu Wang ; Xiaowang Zhang

【Abstract】: We propose an efficient distributed method for answering regular path queries (RPQs) on large-scale RDF graphs using partial evaluation. In local computation, we devise a dynamic programming approach to evaluate local and partial answers of an RPQ on each computing site in parallel. In the assembly phase, an automata-based algorithm is proposed to assemble the partial answers of the RPQ into the final results. The experiments on benchmark RDF graphs show that our method outperforms the state-of-the-art message passing methods by up to an order of magnitude.

【Keywords】: RDF graph; partial evaluation; regular path query

207. Webpage Depth-level Dwell Time Prediction.

【Paper Link】【Pages】:1937-1940

【Authors】: Chong Wang ; Achir Kalra ; Cristian Borcea ; Yi Chen

【Abstract】: The amount of time spent by users at specific page depths within webpages, called dwell time, can be used by web publishers to decide where to place online ads and what type of ads to place at different depths within a webpage. This paper presents a model to predict the dwell time for a given "user, webpage, depth" triplet based on historic data collected by publishers. Dwell time prediction is difficult due to user behavior variability and data sparsity. We adopt the Factorization Machines model because it is able to capture the interaction between users and webpages, overcome the data sparsity issue, and provide flexibility to add auxiliary information such as the visible area of a user's browser. Experimental results using data from a large web publisher demonstrate that our model outperforms deterministic and regression-based comparison models.

【Keywords】: computational advertising; data mining; user behavior

【Paper Link】【Pages】:1941-1944

【Authors】: Li Gao ; Jia Wu ; Zhi Qiao ; Chuan Zhou ; Hong Yang ; Yue Hu

【Abstract】: In event-based social networks, such as Meetup, social groups refer to self-organized communities that consist of users who share the same interests. In many real-world scenarios, users usually have social group preference and join interested social groups to attend events. It is therefore necessary to consider the influence of social groups to improve the event recommendation performance; however, existing event recommendation models generally consider users' individual preferences and neglect the influence of social groups. To this end, we propose a new Bayesian latent factor model SogBmf that combines social group influence and individual preference for event recommendation. Experiments on real-world data sets demonstrate the effectiveness of the proposed method.

【Keywords】: event recommendation; social group influence

209. Graph-Based Multi-Modality Learning for Clinical Decision Support.

【Paper Link】【Pages】:1945-1948

【Authors】: Ziwei Zheng ; Xiaojun Wan

【Abstract】: The task of clinical decision support (CDS) involves retrieval and ranking of medical journal articles for medical records of diagnosis, test or treatment. Previous studies on this task are based on bag-of-words representations of document texts and general retrieval models. In this paper, we propose to use the paragraph vector technique to learn the latent semantic representation of texts and treat the latent semantic representations and the original bag-of-words representations as two different modalities. We then propose to use the graph-based multi-modality learning algorithm for document re-ranking. Experimental results on two TREC-CDS benchmark datasets demonstrate the excellent performance of our proposed approach.

【Keywords】: TREC; clinical decision support; graph-based multi-modality learning; paragraph vector

210. Where are You Tweeting?: A Context and User Movement Based Approach.

【Paper Link】【Pages】:1949-1952

【Authors】: Zhi Liu ; Yan Huang

【Abstract】: Geotagged tweets allow one to extract geo-information-trend, search local events, and identify natural disasters. In this paper, we propose a Hidden-Markov-based model to integrate tweet contents and user movements for geotagging. A language model is obtained for different locations from training datasets and movements of users among cities are analyzed. Home cities of users are considered in modeling the patterns of user movements. Evaluation on a large Twitter dataset shows that our method can significantly improve geotagging accuracy by 55% for home cities and 2% for other non-home cities as well as reduce error distances by orders of magnitude compared with pure text-based methods.

【Keywords】: geotagging; hidden Markov model; tweet

211. Ensemble Learned Vaccination Uptake Prediction using Web Search Queries.

【Paper Link】【Pages】:1953-1956

【Authors】: Niels Dalum Hansen ; Christina Lioma ; Kåre Mølbak

【Abstract】: We present a method that uses ensemble learning to combine clinical and web-mined time-series data in order to predict future vaccination uptake. The clinical data is official vaccination registries, and the web data is query frequencies collected from Google Trends. Experiments with official vaccine records show that our method predicts vaccination uptake effectively (4.7 Root Mean Squared Error). Whereas performance is best when combining clinical and web data, using solely web data yields comparative performance. To our knowledge, this is the first study to predict vaccination uptake using web data (with and without clinical data).

【Keywords】: ensemble learning; public health event prediction; vaccination uptake; web search queries

【Paper Link】【Pages】:1957-1960

【Authors】: Yao Lu ; Zhi Qiao ; Chuan Zhou ; Yue Hu ; Li Guo

【Abstract】: In this paper we study the friend recommendation problem in event-based social networks (EBSNs). Effective friend recommendation is of benefit to EBSNs, since it can promote user interaction and accelerate information diffusion for promoted events. Different from usual friend recommendations, the aim of making friends in EBSNs is to better participate offline events and enhance user experience. Meanwhile friend recommendation in EBSNs encounters three types of data, i.e. geographical information, implicate user rating, and user behavior. These differences imply that existing friend recommendation approaches are not adequate any more for EBSNs. Under this background, in this paper we propose a Bayesian latent factor model, which can jointly formulate above three types of data, for friend recommendation with better event promotion and user experience. Results on real-world datasets show the efficacy of our approach.

【Keywords】: Bayesian latent factor model; event-based social network; friend recommendation

213. Extracting Skill Endorsements from Personal Communication Data.

【Paper Link】【Pages】:1961-1964

【Authors】: Darshan M. Shankaralingappa ; Gianmarco De Francisci Morales ; Aristides Gionis

【Abstract】: People are increasingly communicating and collaborating via digital platforms, such as email and messaging applications. Data exchanged on these digital communication platforms can be a treasure trove of information on people who participate in the discussions: who they are collaborating with, what they are working on, what their expertise is, and so on. Yet, personal communication data is very rarely analyzed due to the sensitivity of the information it contains. In this paper, we mine personal communication data with the goal of generating skill endorsements of the type "person A endorses person B on skill X." To address privacy concerns, we consider that each person has access only to their own data (i.e., conversations with their peers). By using our method, they can generate endorsements for their peers, which they can inspect and opt to publish. To identify meaningful skills we use a knowledge base created from the StackExchange Q&A forum. We study two different approaches, one based on building a skill graph, and one based on information retrieval techniques. We find that the latter approach outperforms the graph-based algorithms when tested on a dataset of user profiles from StackOverflow. We also conduct a user study on email data from nine volunteers, and we find that the information retrieval-based approach achieves a MAP@10 score of 0.617.

【Keywords】: e-mail mining; personal data; skill endorsements

214. A Self-Organizing Map for Identifying InfluentialCommunities in Speech-based Networks.

【Paper Link】【Pages】:1965-1968

【Authors】: Sameen Mansha ; Faisal Kamiran ; Asim Karim ; Aizaz Anwar

【Abstract】: Low-literate people are unable to use many mainstream social networks due to their text-based interfaces even though they constitute a major portion of the world population. Specialized speech-based networks (SBNs) are more accessible to low-literate users through their simple speech-based interfaces. While SBNs have the potential for providing value-adding services to a large segment of society they have been hampered by the need to operate in low-income segments on low budgets. The knowledge of influential users and communities in such networks can help in optimizing their operations. In this paper, we present a self-organizing map (SOM) for discovering and visualizing influential communities of users in SBNs. We demonstrate how a friendship graph is formed from call data records and present a method for estimating influences between users. Subsequently, we develop a SOM to cluster users based on their influence, thus identifying community-level influences and their roles in information propagation. We test our approach on Polly, a SBN developed for job ads dissemination among low-literate users. For comparison, we identify influential users with the benchmark greedy algorithm and relate them to the discovered communities. The results show that influential users are concentrated in influential communities and community-level information propagation provides a ready summary of influential users.

【Keywords】: influential users and communities; self organizing maps; speech-based social networks

215. Crowdsourcing-based Urban Anomaly Prediction System for Smart Cities.

【Paper Link】【Pages】:1969-1972

【Authors】: Chao Huang ; Xian Wu ; Dong Wang

【Abstract】: Crowdsourcing has become an emerging data collection paradigm for smart city applications. A new category of crowdsourcing-based urban anomaly reporting systems have been developed to enable pervasive and real-time reporting of anomalies in cities (e.g., noise, illegal use of public facilities, urban infrastructure malfunctions). An interesting challenge in these applications is how to accurately predict an anomaly in a given region of the city before it happens. Prior works have made significant progress in anomaly detection. However, they can only detect anomalies after they happen, which may lead to significant information delay and lack of preparedness to handle the anomalies in an efficient way. In this paper, we develop a Crowdsourcing-based Urban Anomaly Prediction Scheme (CUAPS) to accurately predict the anomalies of a city by exploring both spatial and temporal information embedded in the crowdsourcing data. We evaluated the performance of our scheme and compared it to the state-of-the-art baselines using four real-world datasets collected from 311 service in the city of New York. The results showed that our scheme can predict different categories of anomalies in a city more accurately than the baselines.

【Keywords】: Bayesian inference; anomaly prediction; crowdsourcing; smart cities

216. Near Real-time Geolocation Prediction in Twitter Streams via Matrix Factorization Based Regression.

【Paper Link】【Pages】:1973-1976

【Authors】: Nghia Duong-Trung ; Nicolas Schilling ; Lars Schmidt-Thieme

【Abstract】: Previous research on content-based geolocation in general has developed prediction methods via conducting pre-partitioning and applying classification methods. The input of these methods is the concatenation of individual tweets during a period of time. But unfortunately, these methods have some drawbacks. They discard the natural real-values properties of latitude and longitude as well as fail to capture geolocation in near real-time. In this work, we develop a novel generative content-based regression model via a matrix factorization technique to tackle the near real-time geolocation prediction problem. With this model, we aim to address a couple of un-answered questions. First, we prove that near real-time geolocation prediction can be accomplished if we leave out the concatenation. Second, we account the real-values properties of physical coordinates within a regression solution. We apply our model on Twitter datasets as an example to prove the effectiveness and generality. Our experimental results show that the proposed model, in the best scenario, outperforms a set of state-of-the-art regression models including Support Vector Machines and Factorization Machines by a reduction of the median localization error up to 79%.

【Keywords】: geolocation; matrix factorization; regression; twitter

217. Distilling Word Embeddings: An Encoding Approach.

【Paper Link】【Pages】:1977-1980

【Authors】: Lili Mou ; Ran Jia ; Yan Xu ; Ge Li ; Lu Zhang ; Zhi Jin

【Abstract】: Distilling knowledge from a well-trained cumbersome network to a small one has recently become a new research topic, as lightweight neural networks with high performance are particularly in need in various resource-restricted systems. This paper addresses the problem of distilling word embeddings for NLP tasks. We propose an encoding approach to distill task-specific knowledge from a set of high-dimensional embeddings, so that we can reduce model complexity by a large margin as well as retain high accuracy, achieving a good compromise between efficiency and performance. Experiments reveal the phenomenon that distilling knowledge from cumbersome embeddings is better than directly training neural networks with small embeddings.

【Keywords】: model compression; neural networks; word embeddings

218. Regularising Factorised Models for Venue Recommendation using Friends and their Comments.

【Paper Link】【Pages】:1981-1984

【Authors】: Jarana Manotumruksa ; Craig MacDonald ; Iadh Ounis

【Abstract】: Venue recommendation is an important capability of Location-Based Social Networks such as Yelp and Foursquare. Matrix Factorisation (MF) is a collaborative filtering-based approach that can effectively recommend venues that are relevant to the users' preferences, by training upon either implicit or explicit feedbacks (e.g. check-ins or venue ratings) that these users express about venues. However, MF suffers in that users may only have rated very few venues. To alleviate this problem, recent literature have leveraged additional sources of evidence, e.g. using users' social friendships to reduce the complexity of - or regularise - the MF model, or identifying similar venues based on their comments. This paper argues for a combined regularisation model, where the venues suggested for a user are influenced by friends with similar tastes (as defined by their comments). We propose a MF regularisation technique that seamlessly incorporates both social network information and textual comments, by exploiting word embeddings to estimate a semantic similarity of friends based on their explicit textual feedback, to regularise the complexity of the factorised model. Experiments on a large existing dataset demonstrate that our proposed regularisation model is promising, and can enhance the prediction accuracy of several state-of-the-art matrix factorisation-based approaches.

【Keywords】: matrix factorisation; venue recommendation; word embeddings

219. Improving Search Results with Prior Similar Queries.

【Paper Link】【Pages】:1985-1988

【Authors】: Yashar Moshfeghi ; Kristiyan Velinov ; Peter Triantafillou

【Abstract】: This paper describes a novel approach to re-ranking search engine result pages (SERP): Its fundamental principle is to re-rank results to a given query, based on exploiting evidence gathered from past similar search queries. Our approach is inspired by collaborative filtering, with the main challenge being to find the set of similar queries, while also taking efficiency into account. In particular, our approach aims to address this challenge by proposing a combination of a similarity graph and a locality sensitive hashing scheme. We construct a set of features from our similarity graph and build a prediction model using the Hoeffding decision tree algorithm. We have evaluated the effectiveness of our model in terms of P@1, MAP@10, and nDCG@10, using the Yandex Data Challenge data set. We have compared the performance of our model against two baselines, namely, the Yandex initial ranking and the decision tree model learnt on the same set of features when extracted based on query repetition (i.e. excluding the evidence of similar queries in our approach). Our results reveal that the proposed approach consistently and (statistically) significantly outperforms both baselines.

【Keywords】: SERP; collaborative filtering; graph

220. The Solitude of Relevant Documents in the Pool.

【Paper Link】【Pages】:1989-1992

【Authors】: Aldo Lipani ; Mihai Lupu ; Evangelos Kanoulas ; Allan Hanbury

【Abstract】: Pool bias is a well understood problem of test-collection based benchmarking in information retrieval. The pooling method itself is designed to identify all relevant documents. In practice, 'all' translates to `as many as possible given some budgetary constraints' and the problem persists, albeit mitigated. Recently, methods to address this pool bias for previously created test collections have been proposed, for the evaluation measure precision at cut-off (P@n). Analyzing previous methods, we make the empirical observation that the distribution of the probability of providing new relevant documents to the pool, over the runs, is log-normal (when the pooling strategy is fixed depth at cut-off). We use this observation to calculate a prior probability of providing new relevant documents, which we then use in a pool bias estimator that improves upon previous estimates of precision at cut-off. Through extensive experimental results, covering 15 test collections, we show that the proposed bias correction method is the new state of the art, providing the closest estimates yet when compared to the original pool.

【Keywords】: P@n; TREC; pool bias; test collections

221. Scarce Feature Topic Mining for Video Recommendation.

【Paper Link】【Pages】:1993-1996

【Authors】: Wei Lu ; Korris Fu-Lai Chung ; Kunfeng Lai

【Abstract】: Recommendation for user generated content sites has gained significant attention nowadays. To satisfy the niche tastes of users, product recommendation poses more challenges due to the data sparsity issue. This work is motivated by a real world online video recommendation problem, where the click records database suffers from sparseness of video inventory and video tags. Targeting the long tail phenomena of user behavior and sparsity of item features, we propose a personalized compound recommendation framework for online video recommendation called Dirichlet mixture probit model for information scarcity (DPIS). Assuming that each record is generated from a representation of user preferences, DPIS is a probit classifier utilizing record topical clustering on the user part for recommendation. As demonstrated by the real-world application, the proposed DPIS achieves better performance than traditional methods.

【Keywords】: Bayesian approach; recommender system; topic model

222. Learning to Re-Rank Questions in Community Question Answering Using Advanced Features.

【Paper Link】【Pages】:1997-2000

【Authors】: Giovanni Da San Martino ; Alberto Barrón-Cedeño ; Salvatore Romeo ; Antonio Uva ; Alessandro Moschitti

【Abstract】: We study the impact of different types of features for question ranking in community Question Answering: bag-of-words models (BoW), syntactic tree kernels (TKs) and rank features. It should be noted that structural kernels have never been applied to the question reranking task, i.e., question to question similarity, where they have to model paraphrase relations. Additionally, the informal text, typically present in forums, poses new challenges to the use of TKs. We compare our learning to rank (L2R) algorithms against a strong baseline given by the Google rank (GR). The results show that (i) our shallow structures used in TKs are robust enough to noisy data and (ii) improving GR requires effective BoW features and TKs along with an accurate model of GR features in the used L2R algorithm.

【Keywords】: community question answering; learning to rank; syntactic structures

223. Learning to Rank System Configurations.

【Paper Link】【Pages】:2001-2004

【Authors】: Romain Deveaud ; Josiane Mothe ; Jian-Yun Nie

【Abstract】: Information Retrieval (IR) systems heavily rely on a large number of parameters, such as the retrieval model or various query expansion parameters, whose values greatly influence the overall retrieval effectiveness. However, setting all these parameters individually can often be a tedious task, since they can all affect one another, while also vary for different queries. We propose to tackle this problem by dealing with entire system configurations (i.e. a set of parameters representing an IR system) instead of single parameters, and to apply state-of-the-art Learning to Rank techniques to select the most appropriate configuration for a given query. The experiments we conducted on two TREC AdHoc collections show that this approach is feasible and significantly outperforms the traditional way to configure a system, as well as the top performing systems of the TREC tracks. We also show an analysis on the impact of different features on the model's learning capability.

【Keywords】: information retrieval; learning to rank; retrieval system parameters

224. Adaptive Distributional Extensions to DFR Ranking.

【Paper Link】【Pages】:2005-2008

【Authors】: Casper Petersen ; Jakob Grue Simonsen ; Kalervo Järvelin ; Christina Lioma

【Abstract】: Divergence From Randomness (DFR) ranking models assume that informative terms are distributed in a corpus differently than non-informative terms. Different statistical models (e.g. Poisson, geometric) are used to model the distribution of non-informative terms, producing different DFR models. An informative term is then detected by measuring the divergence of its distribution from the distribution of non-informative terms. However, there is little empirical evidence that the distributions of non-informative terms used in DFR actually fit current datasets. Practically this risks providing a poor separation between informative and non-informative terms, thus compromising the discriminative power of the ranking model. We present a novel extension to DFR, which first detects the best-fitting distribution of non-informative terms in a collection, and then adapts the ranking computation to this best-fitting distribution. We call this model Adaptive Distributional Ranking (ADR) because it adapts the ranking to the statistics of the specific dataset being processed each time. Experiments on TREC data show ADR to outperform DFR models (and their extensions) and be comparable in performance to a query likelihood language model (LM).

【Keywords】: adaptive retrieval; divergence from randomness

225. CyberRank: Knowledge Elicitation for Risk Assessment of Database Security.

【Paper Link】【Pages】:2009-2012

【Authors】: Hagit Grushka-Cohen ; Oded Sofer ; Ofer Biller ; Bracha Shapira ; Lior Rokach

【Abstract】: Security systems for databases produce numerous alerts about anomalous activities and policy rule violations. Prioritizing these alerts will help security personnel focus their efforts on the most urgent alerts. Currently, this is done manually by security experts that rank the alerts or define static risk scoring rules. Existing solutions are expensive, consume valuable expert time, and do not dynamically adapt to changes in policy. Adopting a learning approach for ranking alerts is complex due to the efforts required by security experts to initially train such a model. The more features used, the more accurate the model is likely to be, but this will require the collection of a greater amount of user feedback and prolong the calibration process. In this paper, we propose CyberRank, a novel algorithm for automatic preference elicitation that is effective for situations with limited experts' time and outperforms other algorithms for initial training of the system. We generate synthetic examples and annotate them using a model produced by Analytic Hierarchical Processing (AHP) to bootstrap a preference learning algorithm. We evaluate different approaches with a new dataset of expert ranked pairs of database transactions, in terms of their risk to the organization. We evaluated using manual risk assessments of transaction pairs, CyberRank outperforms all other methods for cold start scenario with error reduction of 20%.

【Keywords】: cold start; cyber security; preference elicitation; ranking; risk assessment; semi supervised

226. Online Food Recipe Title Semantics: Combining Nutrient Facts and Topics.

【Paper Link】【Pages】:2013-2016

【Authors】: Tomasz Kusmierczyk ; Kjetil Nørvåg

【Abstract】: Dietary pattern analysis is an important research area, and recently the availability of rich resources in food-focused social networks has enabled new opportunities in that field. However, there is a little understanding of how online textual content is related to actual health factors, e.g., nutritional values. To contribute to this lack of knowledge, we present a novel approach to mine and model online food content by combining text topics with related nutrient facts. Our empirical analysis reveals a strong correlation between them and our experiments show the extent to which it is possible to predict nutrient facts from meal name.

【Keywords】: LDA; online food recipe; social media mining; text regression

227. A Non-Parametric Topic Model for Short Texts Incorporating Word Coherence Knowledge.

【Paper Link】【Pages】:2017-2020

【Authors】: Yuhao Zhang ; Wenji Mao ; Daniel Dajun Zeng

【Abstract】: Mining topics in short texts (e.g. tweets, instant messages) can help people grasp essential information and understand key contents, and is widely used in many applications related to social media and text analysis. The sparsity and noise of short texts often restrict the performance of traditional topic models like LDA. Recently proposed Biterm Topic Model (BTM) which models word co-occurrence patterns directly, is revealed effective for topic detection in short texts. However, BTM has two main drawbacks. It needs to manually specify topic number, which is difficult to accurately determine when facing new corpora. Besides, BTM assumes that two words in same term should belong to the same topic, which is often too strong as it does not differentiate two types of words (i.e. general words and topical words). To tackle these problems, in this paper, we propose a non-parametric topic model npCTM with the above distinction. Our model incorporates the Chinese restaurant process (CRP) into the BTM model to determine topic number automatically. Our model also distinguishes general words from topical words by jointly considering the distribution of these two word types for each word as well as word coherence information as prior knowledge. We carry out experimental studies on real-world twitter dataset. The results demonstrate the effectiveness of our method to discover coherent topics compared with the baseline methods.

【Keywords】: Bayesian nonparametric model; text mining; topic model

228. Forecasting Seasonal Time Series Using Weighted Gradient RBF Network based Autoregressive Model.

【Paper Link】【Pages】:2021-2024

【Authors】: Wenjie Ruan ; Quan Z. Sheng ; Peipei Xu ; Nguyen Khoi Tran ; Nickolas J. G. Falkner ; Xue Li ; Wei Emma Zhang

【Abstract】: How to accurately forecast seasonal time series is very important for many business area such as marketing decision, planning production and profit estimation. In this paper, we propose a weighted gradient Radial Basis Function Network based AutoRegressive (WGRBF-AR) model for modeling and predicting the nonlinear and non-stationary seasonal time series. This WGRBF-AR model is a synthesis of the weighted gradient RBF network and the functional-coefficient autoregressive (FAR) model through using the WGRBF networks to approximate varying coefficients of FAR model. It not only takes the advantages of the FAR model in nonlinear dynamics description but also inherits the capability of the WGRBF network to deal with non-stationarity. We test our model using ten-years retail sales data on five different commodity in US. The results demonstrate that the proposed WGRBF-AR model can achieve competitive prediction accuracy compared with the state-of-the-art.

【Keywords】: ANN; neural network; prediction; seasonal data; time series

229. When Sensor Meets Tensor: Filling Missing Sensor Values Through a Tensor Approach.

【Paper Link】【Pages】:2025-2028

【Authors】: Wenjie Ruan ; Peipei Xu ; Quan Z. Sheng ; Nguyen Khoi Tran ; Nickolas J. G. Falkner ; Xue Li ; Wei Emma Zhang

【Abstract】: In the era of the Internet of Things, enormous number of sensors have been deployed in different locations, generating massive time-series sensory data with geo-tags. However, such sensory readings are easily missing due to various reasons such as the hardware malfunction, connection errors, and data corruption. This paper focuses on this challenge--how to accurately yet efficiently recover the missing values for corrupted time-series sensor data with geo-stamps. In this paper, we formulate the time-series sensor data as a 3-order tensor that naturally preserves sensors' temporal and spatial dependencies. Then we exploit its low-rank and sparse-noise structures by drawing upon recent advances in Robust Principal Component Analysis (RPCA) and tensor completion theory. The main novelty of this paper lies in that, we design a highly efficient optimization method that combines the alternating direction method of multipliers and accelerated proximal gradient to recover the data tensor. Besides testing our method using the synthetic data, we also design a real-world testbed by passive RFID (RadioFrequency IDentification) sensors. The results demonstrate the effectiveness and accuracy of our approach.

【Keywords】: ADMM; RFID; sensor data; tensor

230. PEQ: An Explainable, Specification-based, Aspect-oriented Product Comparator for E-commerce.

【Paper Link】【Pages】:2029-2032

【Authors】: Abhishek Sikchi ; Pawan Goyal ; Samik Datta

【Abstract】: While purchasing a product, consumers often rely on specifications as well as online reviews of the product for decision-making. While comparing, one often has in mind a specific aspect or a set of aspects which are of interest to them. Previous work has used comparative sentences, where two entities are compared directly in a single sentence by the review author, towards the comparison task. In this paper, we extend the existing model by incorporating the feature specifications of the products, which are easily available, and learn the importance to be associated with each of them. To test the validity of these product ranking measures, we comprehensively test it on a digital camera dataset from Amazon.com and the results show good empirical outperformance over the state-of-the-art baselines.

【Keywords】: comparison mining; decision support systems

231. Forecasting Geo-sensor Data with Participatory Sensing Based on Dropout Neural Network.

【Paper Link】【Pages】:2033-2036

【Authors】: Jyun-Yu Jiang ; Cheng-Te Li

【Abstract】: Nowadays, geosensor data, such as air quality and traffic flow, have become more and more essential in people's daily life. However, installing geosensors or hiring volunteers at every location and every time is so expensive. Some organizations may have only few facilities or limited budget to sense these data. Moreover, people usually tend to know the forecast instead of ongoing observations, but the number of sensors (or volunteers) will be a hurdle to make precise prediction. In this paper, we propose a novel concept to forecast geosensor data with participatory sensing. Given a limited number of sensors or volunteers, participatory sensing assumes each of them can observe and collect data at different locations and at different time. By aggregating these sparse data observations in the past time, we propose a neural network based approach to forecast the future geosensor data in any location of an urban area. The extensive experiments have been conducted with large-scale datasets of the air quality in three cities and the traffic of bike sharing systems in two cities. Experimental results show that our predictive model can precisely forecast the air quality and the bike rentle traffic as geosensor data.

【Keywords】: geo-sensor data forecasting; participatory sensing; urban computing

232. Iterative Search using Query Aspects.

【Paper Link】【Pages】:2037-2040

【Authors】: Manmeet Singh ; W. Bruce Croft

【Abstract】: Pseudo-relevance feedback (PRF) via query expansion has proven to be effective in many information retrieval tasks. In most existing work, the top-ranked documents from an initial search are assumed to be relevant and used for feedback. There are some drawbacks to this approach. One limitation is that there might be other relevant documents which were not retrieved or considered for the the feedback process. Another issue is one or more of the top retrieved documents may be non-relevant, which can introduce noise into the feedback mechanism. Term-level diversification, on the other hand, uses an effective technique for identifying terms associated with query aspects or subtopics. We propose a new iterative feedback method that combines PRF with aspect generation to improve feedback effectiveness. In our experiments, we discovered a new property of convergence of feedback terms that was incorporated into the PRF process. We show that the resulting method significantly outperforms the baseline relevance model.

【Keywords】: iterative search; pseudo-relevance feedback; search diversification

233. A Preference Approach to Reputation in Sponsored Search.

【Paper Link】【Pages】:2041-2044

【Authors】: Aritra Ghosh ; Dinesh Gaurav ; Rahul Agrawal

【Abstract】: Determining reputation of an advertiser in sponsored search is a recent important problem with direct impact on revenue for web publishers and relevance of ads. Individual performance of advertisers is usually expressed through observed click through rate, which depends on advertiser reputation, ad relevance and position. However, advertiser reputation has not been explicitly modeled in click prediction literature. Using traditional approaches in web page popularity for organic search in this context is not reasonable as the notion of link-structure in web is not directly applicable to sponsored search. In this study, we motivate and propose a pairwise preference relation model to study the advertiser reputation problem. Pairwise comparisons of advertisers give information over and above the information available in their individual historical performances. We relate the notion of preference among the advertisers to the spectral properties of the preference graph. We provide empirical evidence of the existence of reputation bias in click behavior. Consequently, we experiment with this signal to improve click prediction.

【Keywords】: advertiser reputation; click prediction; sponsored search

234. Clustering Speed in Multi-lane Traffic Networks.

【Paper Link】【Pages】:2045-2048

【Authors】: Bing Zhang ; Goce Trajcevski ; Feiying Liu

【Abstract】: We address the problem of efficient spatio-temporal clustering of speed data in road segments with multiple lanes. We postulate that the navigation/route plans typically reported by different providers as a single-value need not be accurate in multi-lane networks. Our methodology generates lane-aware distribution of speed from GPS data and agglomerates the basic space and time units into larger clusters. Thus, we achieve a compact description of speed variations which can be subsequently used for more accurate trips planning. We provide experiments that demonstrate the benefits of our proposed approaches.

【Keywords】: multi-lane roads; speed clustering; trajectories

235. Learning to Rank Non-Factoid Answers: Comment Selection in Web Forums.

【Paper Link】【Pages】:2049-2052

【Authors】: Kateryna Tymoshenko ; Daniele Bonadiman ; Alessandro Moschitti

【Abstract】: Recent initiatives in IR community have shown the importance of going beyond factoid Question Answering (QA) in order to design useful real-world applications. Questions asking for descriptions or explanations are much more difficult to be solved, e.g., the machine learning models cannot focus on specific answer words or their lexical type. Thus, researchers have started to explore powerful methods for feature engineering. Two of the most promising methods are convolution tree kernels (CTKs) and convolutional neural networks (CNNs) as they have been shown to obtain high performance in the task of answer sentence selection in factoid QA. In this paper, we design state-of-the-art models for non-factoid QA also carried out on noisy data. In particular, we study and compare models for comment selection in a community QA (cQA) scenario, where the majority of questions regard descriptions or explanations. To deal with such complex task, we incorporate relational information holding between questions and comments as well as domain-specific features into both convolutional models above. Our experiments on a cQA corpus show that both CTK and CNN achieve the state of the art, also according to a direct comparison with the results obtained by the best systems of the SemEval cQA challenge.

【Keywords】: community question answering; convolutional neural networks; kernel methods; question answering

236. A Theoretical Framework on the Ideal Number of Classifiers for Online Ensembles in Data Streams.

【Paper Link】【Pages】:2053-2056

【Authors】: Hamed R. Bonab ; Fazli Can

【Abstract】: A priori determining the ideal number of component classifiers of an ensemble is an important problem. The volume and velocity of big data streams make this even more crucial in terms of prediction accuracies and resource requirements. There is a limited number of studies addressing this problem for batch mode and none for online environments. Our theoretical framework shows that using the same number of independent component classifiers as class labels gives the highest accuracy. We prove the existence of an ideal number of classifiers for an ensemble, using the weighted majority voting aggregation rule. In our experiments, we use two state-of-the-art online ensemble classifiers with six synthetic and six real-world data streams. The violation of providing independent component classifiers for our theoretical framework makes determining the exact ideal number of classifiers nearly impossible. We suggest upper bounds for the number of classifiers that gives the highest accuracy. An important implication of our study is that comparing online ensemble classifiers should be done based on these ideal values, since comparing based on a fixed number of classifiers can be misleading.

【Keywords】: big data stream; ensemble size; weighted majority voting

237. User Modeling on Twitter with WordNet Synsets and DBpedia Concepts for Personalized Recommendations.

【Paper Link】【Pages】:2057-2060

【Authors】: Guangyuan Piao ; John G. Breslin

【Abstract】: User modeling of individual users on the Social Web platforms such as Twitter plays a significant role in providing personalized recommendations and filtering interesting information from social streams. Recently, researchers proposed the use of concepts (e.g., DBpedia entities) for representing user interests instead of word-based approaches, since Knowledge Bases such as DBpedia provide cross-domain background knowledge about concepts, and thus can be used for extending user interest profiles. Even so, not all concepts can be covered by a Knowledge Base, especially in the case of microblogging platforms such as Twitter where new concepts/topics emerge everyday. In this short paper, instead of using concepts alone, we propose using synsets from WordNet and concepts from DBpedia for representing user interests. We evaluate our proposed user modeling strategies by comparing them with other bag-of-concepts approaches. The results show that using synsets and concepts together for representing user interests improves the quality of user modeling significantly in the context of link recommendations on Twitter.

【Keywords】: personalization; recommender systems; user modeling

238. Improving Entity Ranking for Keyword Queries.

【Paper Link】【Pages】:2061-2064

【Authors】: John Foley ; Brendan O'Connor ; James Allan

【Abstract】: Knowledge bases about entities are an important part of modern information retrieval systems. A strong ranking of entities can be used to enhance query understanding and document retrieval or can be presented as another vertical to the user. Given a keyword query, our task is to provide a ranking of the entities present in the collection of interest. We are particularly interested in approaches to this problem that generalize to different knowledge bases and different collections. In the past, this kind of problem has been explored in the enterprise domain through Expert Search. Recently, a dataset was introduced for entity ranking from news and web queries from more general TREC collections. Approaches from prior work leverage a wide variety of lexical resources: e.g., natural language processing and relations in the knowledge base. We address the question of whether we can achieve competitive performance with minimal linguistic resources. We propose a set of features that do not require index-time entity linking, and demonstrate competitive performance on the new dataset. As this paper is the first non-introductory work to leverage this new dataset, we also find and correct certain aspects of the benchmark. To support a fair evaluation, we collect 38% more judgments and contribute annotator agreement information.

【Keywords】: information retrieval; knowledge-bases

239. The Healing Power of Poison: Helpful Non-relevant Documents in Feedback.

【Paper Link】【Pages】:2065-2068

【Authors】: Mostafa Dehghani ; Samira Abnar ; Jaap Kamps

【Abstract】: The use of feedback information is an effective approach to address the vocabulary gap between a user's query and the relevant documents. It has been shown that some relevant documents act like "poison pills," i.e. they hurt the performance of feedback systems despite the fact that they are relevant. In this paper, we study the positive counterpart of this by investigating the helpfulness of nonrelevant documents in feedback. In general, we find that although documents that are explicitly judged as non-relevant are normally assumed to be poisonous for feedback systems, sometimes considering high-scored non-relevant documents as a positive feedback helps to improve the performance of retrieval. In our experimental data, we observe a considerable fraction of non-relevant documents in higher ranked positions of the initial retrieval run, for most of the topics. Hence, by ignoring the potential value of non-relevant documents, we may loose a lot of useful information. We investigate the potential contribution of non-relevant documents using existing state-of-the-art feedback methods. Our main findings are the following. First, we find that some of the nonrelevant documents are exclusively helpful, they improve retrieval on their own, and others are complementary helpful, they lead to further improvement when added to a set of relevant documents. Second, we discover that, on average, exclusively helpful non-relevant documents have a higher contribution to the performance improvement, compared to the complementary ones. Third, we show that non-relevant documents in topics with poor average precision in the initial retrieval are more likely to help in the feedback.

【Keywords】: helpful non-relevant documents; relevance feedback

240. Probabilistic Approaches to Controversy Detection.

【Paper Link】【Pages】:2069-2072

【Authors】: Myungha Jang ; John Foley ; Shiri Dori-Hacohen ; James Allan

【Abstract】: Recently, the problem of automated controversy detection has attracted a lot of interest in the information retrieval community. Existing approaches to this problem have set forth a number of detection algorithms, but there has been little effort to model the probability of controversy in a document directly. In this paper, we propose a probabilistic framework to detect controversy on the web, and investigate two models. We first recast a state-of-the-art controversy detection algorithm into a model in our framework. Based on insights from social science research, we also introduce a language modeling approach to this problem. We evaluate different methods of creating controversy language models based on a diverse set of public datasets including Wikipedia, Web and News corpora. Our automatically derived language models show a significant relative improvement of 18% in AUC over prior work,and 23% over two manually curated lexicons.

【Keywords】: controversy detection; critical literacy; language modeling

241. Evaluating Document Retrieval Methods for Resource Selection in Clustered P2P IR.

【Paper Link】【Pages】:2073-2076

【Authors】: Rami Suleiman Alkhawaldeh ; Joemon M. Jose ; Deepak P

【Abstract】: Resource Selection (or Query Routing) is an important step in P2P IR. Though analogous to document retrieval in the sense of choosing a relevant subset of resources, resource selection methods have evolved independently from those for document retrieval. Among the reasons for such divergence is that document retrieval targets scenarios where underlying resources are semantically homogeneous, whereas peers would manage diverse content. We observe that semantic heterogeneity is mitigated in the clustered 2-tier P2P IR architecture resource selection layer by way of usage of clustering, and posit that this necessitates a re-look at the applicability of document retrieval methods for resource selection within such a framework. This paper empirically benchmarks document retrieval models against the state-of-the-art resource selection models for the problem of resource selection in the clustered P2P IR architecture, using classical IR evaluation metrics. Our benchmarking study illustrates that document retrieval models significantly outperform other methods for the task of resource selection in the clustered P2P IR architecture. This indicates that clustered P2P IR framework can exploit advancements in document retrieval methods to deliver corresponding improvements in resource selection, indicating potential convergence of these fields for the clustered P2P IR architecture.

【Keywords】: clustering peers; content-based; document retrieval methods; evaluation; information retrieval; peer-to-peer; resource selection

242. Detecting and Ranking Conceptual Links between Texts Using a Knowledge Base.

【Paper Link】【Pages】:2077-2080

【Authors】: Martin Tutek ; Goran Glavas ; Jan Snajder ; Natasa Milic-Frayling ; Bojana Dalbelo Basic

【Abstract】: Recent research has explored the use of Knowledge Bases (KBs) to represent documents as subgraphs of a KB concept graph and define metrics to characterize semantic relatedness of documents in terms of properties of the document concept graphs. However, none of the studies so far have examined to what degree such metrics capture a user-perceived relatedness of documents. Considering the users' explanations of how pairs of documents are related, the aim is to identify concepts in a KB graph that express the same notion of document relatedness. Our algorithm generates paths through the KB graph that originate from the terms in two documents. KB concepts where these paths intersect capture the semantic relatedness of the two starting terms and therefore the two documents. We consider how such intersecting concepts relate to the concepts in the users' explanations. The higher the users' concepts appear in the ranked list of intersecting concepts, the better the method in capturing the users' notion of document relatedness. Our experiments show that our approach outperforms a simpler graph method that uses properties of the concept nodes alone.

【Keywords】: content analysis; knowledge base graph; semantic relatedness

243. DePP: A System for Detecting Pages to Protect in Wikipedia.

【Paper Link】【Pages】:2081-2084

【Authors】: Kelsey Suyehira ; Francesca Spezzano

【Abstract】: Wikipedia is based on the idea that anyone can make edits to the website in order to create reliable and crowd-sourced content. Yet with the cover of internet anonymity, some users make changes to the website that do not align with Wikipedia's intended uses. For this reason, Wikipedia allows for some pages of the website to become protected, where only certain users can make revisions to the page. This allows administrators to protect pages from vandalism, libel, and edit wars. However, with over five million pages on Wikipedia, it is impossible for administrators to monitor all pages and manually enforce page protection. In this paper we consider for the first time the problem of deciding whether a page should be protected or not in a collaborative environment such as Wikipedia. We formulate the problem as a binary classification task and propose a novel set of features to decide which pages to protect based on (i) users page revision behavior and (ii) page categories. We tested our system, called DePP, on a new dataset we built consisting of 13.6K pages (half protected and half unprotected) and 1.9M edits. Experimental results show that DePP reaches 93.24% classification accuracy and significantly improves over baselines.

【Keywords】: automated detection; page protection; wikipedia; wikis and open collaboration

244. Hashtag Recommendation Based on Topic Enhanced Embedding, Tweet Entity Data and Learning to Rank.

【Paper Link】【Pages】:2085-2088

【Authors】: Quanzhi Li ; Sameena Shah ; Armineh Nourbakhsh ; Xiaomo Liu ; Rui Fang

【Abstract】: In this paper, we present a new approach of recommending hashtags for tweets. It uses Learning to Rank algorithm to incorporate features built from topic enhanced word embeddings, tweet entity data, hashtag frequency, hashtag temporal data and tweet URL domain information. The experiments using millions of tweets and hashtags show that the proposed approach outperforms the three baseline methods -- the LDA topic, the tf.idf based and the general word embedding approaches.

【Keywords】: hashtag recommendation; learning to rank; social media; topic enhanced word embedding; tweet

Poster Session II: Extended Short Papers 52

245. An Experimental Comparison of Iterative MapReduce Frameworks.

【Paper Link】【Pages】:2089-2094

【Authors】: Haejoon Lee ; Minseo Kang ; Sun-Bum Youn ; Jae-Gil Lee ; YongChul Kwon

【Abstract】: MapReduce has become a dominant framework in big data analysis, and thus there have been significant efforts to implement various data analysis algorithms in MapReduce. Many data analysis algorithms are inherently iterative, repeating the same set of tasks until a convergence. To efficiently support iterative algorithms at scale, a few variants of Hadoop and new platforms have been proposed and actively developed in both academia and industry. Representative systems include HaLoop, iMapReduce, Twister, and Spark. In this paper, we experimentally compare Hadoop and the aforementioned systems using various workloads and metrics. The five systems are compared through four iterative algorithms---PageRank, recursive query, k-means, and logistic regression---on 50 Amazon EC2 machines (200 cores in total). We thoroughly explore the effectiveness of their new caching, communication, and scheduling mechanisms in support of iterative computation. Our evaluation also shows the performance depending on data skewness and memory residency. Overall, we believe that our evaluation and interpretation will be useful for designing a new framework or improving the existing ones.

【Keywords】: benchmark; hadoop; iterative algorithms; mapreduce; spark

246. A Density-Based Approach to the Retrieval of Top-K Spatial Textual Clusters.

【Paper Link】【Pages】:2095-2100

【Authors】: Dingming Wu ; Christian S. Jensen

【Abstract】: Spatial keyword queries retrieve spatial textual objects that are near a query location and are relevant to query keywords. The paper defines the top-k spatial textual clusters (k-STC) query that returns the top-k clusters that are located close to a given query location, contain relevant objects with regard to given query keywords, and have an object density that exceeds a given threshold. This query aims to support users who wish to explore nearby regions with many relevant objects. To compute this query, the paper proposes a basic and an advanced algorithm that rely on on-line density-based clustering. An empirical study offers insight into the performance properties of the proposed algorithms.

【Keywords】:

247. Top-N Recommendation on Graphs.

【Paper Link】【Pages】:2101-2106

【Authors】: Zhao Kang ; Chong Peng ; Ming Yang ; Qiang Cheng

【Abstract】: Recommender systems play an increasingly important role in online applications to help users find what they need or prefer. Collaborative filtering algorithms that generate predictions by analyzing the user-item rating matrix perform poorly when the matrix is sparse. To alleviate this problem, this paper proposes a simple recommendation algorithm that fully exploits the similarity information among users and items and intrinsic structural information of the user-item matrix. The proposed method constructs a new representation which preserves affinity and structure information in the user-item rating matrix and then performs recommendation task. To capture proximity information about users and items, two graphs are constructed. Manifold learning idea is used to constrain the new representation to be smooth on these graphs, so as to enforce users and item proximities. Our model is formulated as a convex optimization problem, for which we need to solve the well known Sylvester equation only. We carry out extensive empirical evaluations on six benchmark datasets to show the effectiveness of this approach.

【Keywords】: collaborative filtering; laplacian graph; top-n recommendation

248. KB-Enabled Query Recommendation for Long-Tail Queries.

【Paper Link】【Pages】:2107-2112

【Authors】: Zhipeng Huang ; Bogdan Cautis ; Reynold Cheng ; Yudian Zheng

【Abstract】: In recent years, query recommendation algorithms have been designed to provide related queries for search engine users. Most of these solutions, which perform extensive analysis of users' search history (or query logs), are largely insufficient for long-tail queries that rarely appear in query logs. To handle such queries, we study a new solution, which makes use of a knowledge base (or KB), such as YAGO and Freebase. A KB is a rich information source that describes how real-world entities are connected. We extract entities from a query, and use these entities to explore new ones in the KB. Those discovered entities are then used to suggest new queries to the user. As shown in our experiments, our approach provides better recommendation results for long-tail queries than existing solutions.

【Keywords】: knowledge base; meta path; query log; query recommendation

249. RAP: Scalable RPCA for Low-rank Matrix Recovery.

【Paper Link】【Pages】:2113-2118

【Authors】: Chong Peng ; Zhao Kang ; Ming Yang ; Qiang Cheng

【Abstract】: Recovering low-rank matrices is a problem common in many applications of data mining and machine learning, such as matrix completion and image denoising. Robust Principal Component Analysis (RPCA) has emerged for handling such kinds of problems; however, the existing RPCA approaches are usually computationally expensive, due to the fact that they need to obtain the singular value decomposition (SVD) of large matrices. In this paper, we propose a novel RPCA approach that eliminates the need for SVD of large matrices. Scalable algorithms are designed for several variants of our approach, which are crucial for real world applications on large scale data. Extensive experimental results confirm the effectiveness of our approach both quantitatively and visually.

【Keywords】: anomaly detection; background-foreground separation; fixed rank; low-rank recovery; robust pca; shadow removal

250. Query Answering Efficiency in Expert Networks Under Decentralized Search.

【Paper Link】【Pages】:2119-2124

【Authors】: Liang Ma ; Mudhakar Srivatsa ; Derya Cansever ; Xifeng Yan ; Sue Kase ; Michelle Vanni

【Abstract】: Expert networks are formed by a group of expert-profes-sionals with different specialties to collaboratively resolve specific queries. In such networks, when a query reaches an expert who does not have sufficient expertise, this query needs to be routed to other experts for further processing until it is completely solved; therefore, query answering efficiency is sensitive to the underlying query routing mechanism being used. Among all possible query routing mechanisms, decentralized search, operating purely on each expert's local information without any knowledge of network global structure, represents the most basic and scalable routing mechanism. However, there is still a lack of fundamental understanding of the efficiency of decentralized search in expert networks. In this regard, we investigate decentralized search by quantifying its performance under a variety of network settings. Our key findings reveal the existence of network conditions, under which decentralized search can achieve significantly short query routing paths (i.e., between O(log n) and O(log2 n) hops, n: total number of experts in the network). Based on such theoretical foundation, we then study how the unique properties of decentralized search in expert networks is related to the anecdotal small-world phenomenon. To the best of our knowledge, this is the first work studying fundamental behaviors of decentralized search in expert networks. The developed performance bounds, confirmed by real datasets, can assist in predicting network performance and designing complex expert networks.

【Keywords】: decentralized search; expert networks; performance bounds; query answering

251. A Study of Realtime Summarization Metrics.

【Paper Link】【Pages】:2125-2130

【Authors】: Matthew Ekstrand-Abueg ; Richard McCreadie ; Virgil Pavlu ; Fernando Diaz

【Abstract】: Unexpected news events, such as natural disasters or other human tragedies, create a large volume of dynamic text data from official news media as well as less formal social media. Automatic real-time text summarization has become an important tool for quickly transforming this overabundance of text into clear, useful information for end-users including affected individuals, crisis responders, and interested third parties. Despite the importance of real-time summarization systems, their evaluation is not well understood as classic methods for text summarization are inappropriate for real-time and streaming conditions. The TREC 2013-2015 Temporal Summarization (TREC-TS) track was one of the first evaluation campaigns to tackle the challenges of real-time summarization evaluation, introducing new metrics, ground-truth generation methodology and dataset. In this paper, we present a study of TREC-TS track evaluation methodology, with the aim of documenting its design, analyzing its effectiveness, as well as identifying improvements and best practices for the evaluation of temporal summarization systems.

【Keywords】: metrics; real-time summarization; summarization; summarization evaluation; temporal summarization; trec

252. Framing Mobile Information Needs: An Investigation of Hierarchical Query Sequence Structure.

【Paper Link】【Pages】:2131-2136

【Authors】: Shuguang Han ; Xing Yi ; Zhen Yue ; Zhigeng Geng ; Alyssa Glass

【Abstract】: When using search engines, people often issue multiple related queries to accomplish a complex search task. A simple query-task structure may not fully capture the complexity of query relations since people may divide a task into multiple subtasks. As a result, this paper applies a three-level hierarchical structure with query, goal and mission - a mission includes several goals, and a goal consists of multiple queries. Particularly, we focus on analyzing query-goal-mission structure for mobile web search because of its increasing popularity and lack of investigation in the literature. This study has three main contributions: (1) we study the query-goal-mission structure for mobile web search, which was not studied before. (2) We identify several differences between mobile and desktop search patterns in terms of goal/mission length, duration and interleaving. (3) We demonstrate that the query-goal-mission structure can be applied to design better user satisfaction metrics. Specifically, goal-based search success rate and mission-based abandonment rate are better aligned with users' long-term engagement than query and session based metrics.

【Keywords】: mobile web search; query structure; search goal; search mission; user engagement

253. A Context-aware Collaborative Filtering Approach for Urban Black Holes Detection.

【Paper Link】【Pages】:2137-2142

【Authors】: Li Jin ; Zhuonan Feng ; Ling Feng

【Abstract】: Urban black hole, as a traffic anomaly, has caused lots of catastrophic accidents in many big cities nowadays. Traditional methods only depend on the single source data (e.g., taxi trajectories) to design blackhole detection algorithm from one point of view, which is rather incomplete to describe the regional crowd flow. In this paper, we model the urban black holes in each region of New York City (NYC) at different time intervals with a 3-dimensional tensor by fusing cross-domain data sources. Supplementing the missing entries of the tensor through a context-aware tensor decomposition approach, we leverage the knowledge from geographical features, 311 complaint features and human mobility features to recover the blackhole situation throughout NYC. The information can facilitate local residents and officials' decision making. We evaluate our model with five datasets related to NYC, diagnosing the urban black holes that cannot be identified (or earlier than those detected) by a single dataset. Experimental results demonstrate the advantages beyond four baseline methods.

【Keywords】: cross-domain; tensor decomposition; urban black holes

254. Combining Powers of Two Predictors in Optimizing Real-Time Bidding Strategy under Constrained Budget.

【Paper Link】【Pages】:2143-2148

【Authors】: Chi-Chun Lin ; Kun-Ta Chuang ; Wush Chi-Hsuan Wu ; Ming-Syan Chen

【Abstract】: We address the bidding strategy design problem faced by a Demand-Side Platform (DSP) in Real-Time Bidding (RTB) advertising. A RTB campaign consists of various parameters and usually a predefined budget. Under the budget constraint of a campaign, designing an optimal strategy for bidding on each impression to acquire as many clicks as possible is a main job of a DSP. State-of-the-art bidding algorithms rely on a single predictor, namely the clickthrough rate (CTR) predictor, to calculate the bidding value for each impression. This provides reasonable performance if the predictor has appropriate accuracy in predicting the probability of user clicking. However when the predictor gives only moderate accuracy, classical algorithms fail to capture optimal results. We improve the situation by accomplishing an additional winning price predictor in the bidding process. In this paper, a method combining powers of two prediction models is proposed, and experiments with real world RTB datasets from benchmarking the new algorithm with a classic CTR-only method are presented. The proposed algorithm performs better with regard to both number of clicks achieved and effective cost per click in many different settings of budget constraints.

【Keywords】: bidding strategy design; bidding with click and winning price predictors; demand-side platform; display advertising; real-time bidding

255. Attractiveness versus Competition: Towards an Unified Model for User Visitation.

【Paper Link】【Pages】:2149-2154

【Authors】: Thanh-Nam Doan ; Ee-Peng Lim

【Abstract】: Modeling user check-in behavior provides useful insights about venues as well as the users visiting them. These insights can be used in urban planning and recommender system applications. Unlike previous works that focus on modeling distance effect on user's choice of check-in venues, this paper studies check-in behaviors affected by two venue-related factors, namely, area attractiveness and neighborhood competitiveness. The former refers to the ability of an area with multiple venues to collectively attract check-ins from users, while the latter represents the ability of a venue to compete with its neighbors in the same area for check-ins. We first embark on a data science study to ascertain the two factors using two Foursquare datasets gathered from users and venues in Singapore and Jakarta, two major cities in Asia. We then propose the VAN model incorporating user-venue distance, area attractiveness and neighborhood competitiveness factors. The results from real datasets show that VAN model outperforms the various baselines in two tasks: home location prediction and check-in prediction.

【Keywords】: area attractiveness; location-based social network; neighborhood competition

256. OptMark: A Toolkit for Benchmarking Query Optimizers.

【Paper Link】【Pages】:2155-2160

【Authors】: Zhan Li ; Olga Papaemmanouil ; Mitch Cherniack

【Abstract】: Query optimizers have long been considered as among the most complex components of a database engine, while the assessment of an optimizer's quality remains a challenging task. Indeed, existing performance benchmarks for database engines (like TPC benchmarks) produce a performance assessment of the query runtime system rather than its query optimizer. To address this challenge, this paper introduces OptMark, a toolkit for evaluating the quality of a query optimizer. OptMark is designed to offer a number of desirable properties. First, it decouples the quality of an optimizer from the quality of its underlying execution engine. Second it evaluates independently both the effectiveness of an optimizer (i.e., quality of the chosen plans) and its efficiency (i.e., optimization time). OptMark includes also a generic benchmarking toolkit that is minimum invasive to the DBMS that wishes to use it. Any DBMS can provide a system-specific implementation of a simple API that allows OptMark to run and generate benchmark scores for the specific system. This paper discusses the metrics we propose for evaluating an optimizer's quality, the benchmark's design and the toolkit's API and functionality. We have implemented OptMark on the open-source MySQL engine as well as two commercial database systems. Using these implementations we are able to assess the quality of the optimizers on these three systems based on the TPC-DS benchmark queries.

【Keywords】: benchmarking; database optimizer

257. Multi-Dueling Bandits and Their Application to Online Ranker Evaluation.

【Paper Link】【Pages】:2161-2166

【Authors】: Brian Brost ; Yevgeny Seldin ; Ingemar J. Cox ; Christina Lioma

【Abstract】: Online ranker evaluation focuses on the challenge of efficiently determining, from implicit user feedback, which ranker out of a finite set of rankers is the best. It can be modeled by dueling bandits, a mathematical model for online learning under limited feedback from pairwise comparisons. Comparisons of pairs of rankers is performed by interleaving their result sets and examining which documents users click on. The dueling bandits model addresses the key issue of which pair of rankers to compare at each iteration. Methods for simultaneously comparing more than two rankers have recently been developed. However, the question of which rankers to compare at each iteration was left open. We address this question by proposing a generalization of the dueling bandits model that uses simultaneous comparisons of an unrestricted number of rankers. We evaluate our algorithm on standard large-scale online ranker evaluation datasets. Our experimentals show that the algorithm yields orders of magnitude gains in performance compared to state-of-the-art dueling bandit algorithms.

【Keywords】: dueling bandits; multi-dueling bandits; online ranker evaluation

258. Robust Contextual Outlier Detection: Where Context Meets Sparsity.

【Paper Link】【Pages】:2167-2172

【Authors】: Jiongqian Liang ; Srinivasan Parthasarathy

【Abstract】: Outlier detection is a fundamental data science task with applications ranging from data cleaning to network security. Recently, a new class of outlier detection algorithms has emerged, called contextual outlier detection, and has shown improved performance when studying anomalous behavior in a specific context. However, as we point out in this article, such approaches have limited applicability in situations where the context is sparse (i.e., lacking a suitable frame of reference). Moreover, approaches developed to date do not scale to large datasets. To address these problems, here we propose a novel and robust approach alternative to the state-of-the-art called RObust Contextual Outlier Detection (ROCOD). We utilize a local and global behavioral model based on the relevant contexts, which is then integrated in a natural and robust fashion. We run ROCOD on both synthetic and real-world datasets and demonstrate that it outperforms other competitive baselines on the axes of efficacy and efficiency. We also drill down and perform a fine-grained analysis to shed light on the rationale for the performance gains of ROCOD and reveal its effectiveness when handling objects with sparse contexts.

【Keywords】: behavioral attributes; contextual attributes; outlier detection; scalable algorithm

259. Credibility Assessment of Textual Claims on the Web.

【Paper Link】【Pages】:2173-2178

【Authors】: Kashyap Popat ; Subhabrata Mukherjee ; Jannik Strötgen ; Gerhard Weikum

【Abstract】: There is an increasing amount of false claims in news, social media, and other web sources. While prior work on truth discovery has focused on the case of checking factual statements, this paper addresses the novel task of assessing the credibility of arbitrary claims made in natural-language text - in an open-domain setting without any assumptions about the structure of the claim, or the community where it is made. Our solution is based on automatically finding sources in news and social media, and feeding these into a distantly supervised classifier for assessing the credibility of a claim (i.e., true or fake). For inference, our method leverages the joint interaction between the language of articles about the claim and the reliability of the underlying web sources. Experiments with claims from the popular website snopes.com and from reported cases of Wikipedia hoaxes demonstrate the viability of our methods and their superior accuracy over various baselines.

【Keywords】: credibility analysis; rumor and hoax detection; text mining

【Paper Link】【Pages】:2179-2184

【Authors】: Xinyue Liu ; Xiangnan Kong ; Yanhua Li

【Abstract】: Traffic prediction has become an important and active research topic in the last decade. Existing solutions mainly focus on exploiting the past and current traffic data, collected from various kinds of sensors, such as loop detectors, GPS devices, etc. In real-world road systems, only a small fraction of the road segments are deployed with sensors. For all the other road segments without sensors or historical traffic data, previous methods may no longer work. In this paper, we propose to use location-based social media, which captures a much larger area of the road systems than deployed sensors, to predict the traffic conditions. A simple but effective method called CTP is proposed to incorporate location-based social media semantics into the learning process. CTP also exploits complex dependencies among different regions to improve the prediction performances through collective inference. Empirical studies using traffic data and tweets collected in Los Angeles area demonstrate the effectiveness of CTP.

【Keywords】: collective inference; data mining; social media; traffic prediction

261. Recommendations For Streaming Data.

【Paper Link】【Pages】:2185-2190

【Authors】: Karthik Subbian ; Charu C. Aggarwal ; Kshiteesh Hegde

【Abstract】: Recommender systems have become increasingly popular in recent years because of the broader popularity of many web-enabled electronic commerce applications. However, most recommender systems today are designed in the context of an offline setting. The online setting is, however, much more challenging because the existing methods do not work very effectively for very large-scale systems. In many applications, it is desirable to provide real-time recommendations in large-scale scenarios. The main problem in applying streaming algorithms for recommendations is that the in-core storage space for memory-resident operations is quite limited. In this paper, we present a probabilistic neighborhood-based algorithm for performing recommendations in real-time. We present experimental results, which show the effectiveness of our approach in comparison to state-of-the-art methods.

【Keywords】: data streams; neighborhood model; recommender systems; streaming recommendations

262. PRO: Preference-Aware Recurring Query Optimization.

【Paper Link】【Pages】:2191-2196

【Authors】: Zhongfang Zhuang ; Chuan Lei ; Elke A. Rundensteiner ; Mohamed Y. Eltabakh

【Abstract】: While recurring queries over evolving data are the bedrock of the analytical applications, resources demanded to process a large amount of data for each recurring execution can be a fatal bottleneck in cost-sensitive cloud computing environments. It is thus imperative to design a system responsive to users' preferences regarding how resources should be utilized. In this work, we propose PRO, a preference-aware recurring query processing system that optimizes recurring query executions complying with user preferences. First, we show that finding an optimal is an NP-complete problem due to the cost interdependencies between consecutive executions. We propose an execution relation graph (ERG) model that effectively incorporates these dependencies between executions. This model enables us to transform our problem into a well-known graph problem. We then design a graph-based approach (called PRO-OPT) leveraging dynamic programming and pruning techniques with pseudo-polynomial complexity. Our experiments confirm that PRO consistently outperforms state-of-the-art solutions by 9 fold in processing time under a rich variety of circumstances on the Wikipedia datasets.

【Keywords】: execution selection; preference-aware; recurring query

263. Discovering Temporal Purchase Patterns with Different Responses to Promotions.

【Paper Link】【Pages】:2197-2202

【Authors】: Ling Luo ; Bin Li ; Irena Koprinska ; Shlomo Berkovsky ; Fang Chen

【Abstract】: The supermarkets often use sales promotions to attract customers and create brand loyalty. They would often like to know if their promotions are effective for various customers, so that better timing and more suitable rate can be planned in the future. Given a transaction data set collected by an Australian national supermarket chain, in this paper we conduct a case study aimed at discovering customers' long-term purchase patterns, which may be induced by preference changes, as well as short-term purchase patterns, which may be induced by promotions. Since purchase events of individual customers may be too sparse to model, we propose to discover a number of latent purchase patterns from the data. The latent purchase patterns are modeled via a mixture of non-homogeneous Poisson processes where each Poisson intensity function is composed by long-term and short-term components. Through the case study, 1) we validate that our model can accurately estimate the occurrences of purchase events; 2) we discover easy-to-interpret long-term gradual changes and short-term periodic changes in different customer groups; 3) we identify the customers who are receptive to promotions through the correlation between behavior patterns and the promotions, which is particularly worthwhile for target marketing.

【Keywords】: customer behaviors; customer segmentation; mixture modeling; non-homogeneous poisson process; temporal modeling

264. ZEST: A Hybrid Model on Predicting Passenger Demand for Chauffeured Car Service.

【Paper Link】【Pages】:2203-2208

【Authors】: Hua Wei ; Yuandong Wang ; Tianyu Wo ; Yaxiao Liu ; Jie Xu

【Abstract】: Chauffeured car service based on mobile applications like Uber or Didi suffers from supply-demand disequilibrium, which can be alleviated by proper prediction on the distribution of passenger demand. In this paper, we propose a Zero-Grid Ensemble Spatio Temporal model (ZEST) to predict passenger demand with four predictors: a temporal predictor and a spatial predictor to model the influences of local and spatial factors separately, an ensemble predictor to combine the results of former two predictors comprehensively and a Zero-Grid predictor to predict zero demand areas specifically since any cruising within these areas costs extra waste on energy and time of driver. We demonstrate the performance of ZEST on actual operational data from ride-hailing applications with more than 6 million order records and 500 million GPS points. Experimental results indicate our model outperforms 5 other baseline models by over 10% both in MAE and sMAPE on the three-month datasets.

【Keywords】: chauffeured car service; demand prediction; spatiotemporal data mining

265. A Filtering-based Clustering Algorithm for Improving Spatio-temporal Kriging Interpolation Accuracy.

【Paper Link】【Pages】:2209-2214

【Authors】: Qiao Kang ; Wei-keng Liao ; Ankit Agrawal ; Alok N. Choudhary

【Abstract】: Geostatistical interpolation is the process that uses existing data and statistical models as inputs to predict data in unobserved spatio-temporal contexts as output. Kriging is a well-known geostatistical interpolation method that minimizes mean square error of prediction. The result interpolated by Kriging is accurate when consistency of statistical properties in data is assumed. However, without this assumption, Kriging interpolation has poor accuracy. To address this problem, this paper presents a new filtering-based clustering algorithm that partitions data into clusters such that the interpolation error within each cluster is significantly reduced, which in turn improves the overall accuracy. Comparisons to traditional Kriging are made with two real-world datasets using two error criteria: normalized mean square error(NMSE) and χ2 test statistics for normalized deviation measurement. Our method has reduced NMSE by more than 50% for both datasets over traditional Kriging. Moreover, χ2 tests have also shown significant improvements of our approach over traditional Kriging.

【Keywords】: kriging; spatio-temporal clustering; spatio-temporal interpolation

266. Reuse-based Optimization for Pig Latin.

【Paper Link】【Pages】:2215-2220

【Authors】: Jesús Camacho-Rodríguez ; Dario Colazzo ; Melanie Herschel ; Ioana Manolescu ; Soudip Roy Chowdhury

【Abstract】: Pig Latin is a popular language which is widely used for parallel processing of massive data sets. Currently, subexpressions occurring repeatedly in Pig Latin scripts are executed as many times as they appear, and the current Pig Latin optimizer does not identify reuse opportunities. We present a novel optimization approach aiming at identifying and reusing repeated subexpressions in Pig Latin scripts. Our optimization algorithm, named PigReuse, identifies subexpression merging opportunities, selects the best ones to execute based on a cost function, and reuses their results as needed in order to compute exactly the same output as the original scripts. Our experiments demonstrate the effectiveness of our approach.

【Keywords】: linear programming; piglatin; reuse-based optimization

267. Discriminative View Learning for Single View Co-Training.

【Paper Link】【Pages】:2221-2226

【Authors】: Joseph St. Amand ; Jun Huan

【Abstract】: Co-training, a popular semi-supervised learning technique, is severely limited as it applicable only to datasets which have a natural division of the feature space into two or more distinct views. In this paper, we investigate techniques to apply co-training to single-view data sets. We develop a view learning technique which takes a single view dataset and learns multiple views. These learned views balance the available discriminatory information in the dataset, while still meeting Blum's co-training criteria. In addition, we constrain the views such that pairs of learned view embedding functions exhibit sparsity in a complementary pattern, which aid in increasing diversity. Finally, we demonstrate the efficacy of our approach via experimental means on several real-world datasets from different domains.

【Keywords】: discriminative features; multi-view learning; semi-supervised learning; single-view co-training

【Paper Link】【Pages】:2227-2232

【Authors】: Dawei Chen ; Cheng Soon Ong ; Lexing Xie

【Abstract】: The problem of recommending tours to travellers is an important and broadly studied area. Suggested solutions include various approaches of points-of-interest (POI) recommendation and route planning. We consider the task of recommending a sequence of POIs, that simultaneously uses information about POIs and routes. Our approach unifies the treatment of various sources of information by representing them as features in machine learning algorithms, enabling us to learn from past behaviour. Information about POIs are used to learn a POI ranking model that accounts for the start and end points of tours. Data about previous trajectories are used for learning transition patterns between POIs that enable us to recommend probable routes. In addition, a probabilistic model is proposed to combine the results of POI ranking and the POI to POI transitions. We propose a new F1 score on pairs of POIs that capture the order of visits. Empirical results show that our approach improves on recent methods, and demonstrate that combining points and routes enables better trajectory recommendations.

【Keywords】: learning to rank; planning; trajectory recommendation

269. Towards Representation Independent Similarity Search Over Graph Databases.

【Paper Link】【Pages】:2233-2238

【Authors】: Yodsawalai Chodpathumwan ; Amirhossein Aleyasen ; Arash Termehchy ; Yizhou Sun

【Abstract】: Finding similar entities is a fundamental problem in graph data analysis. Similarity search algorithms usually leverage the structural properties of the database to quantify the degree of similarity between entities. However, the same information can be represented in different structures and the structural properties observed over particular representations may not hold for the alternatives. These algorithms are effective on some representations and ineffective on others. We define the property of representation independence for similarity search algorithms as their robustness against transformations that modify the structure of databases but preserve the information content. We introduce a widespread group of such transformations called relationship reorganizing. We propose an algorithm called R-PathSim, which is provably robust under relationship reorganizing. Our empirical results show that current algorithms except R-PathSim are highly sensitive to the data representation and R-PathSim is as efficient and effective as other algorithms.

【Keywords】: database transformation; representation independence; structural similarity search

270. Why Did You Cover That Song?: Modeling N-th Order Derivative Creation with Content Popularity.

【Paper Link】【Pages】:2239-2244

【Authors】: Kosetsu Tsukuda ; Masahiro Hamasaki ; Masataka Goto

【Abstract】: Many amateur creators now create derivative works and put them on the web. Although there are several factors that inspire the creation of derivative works, such factors cannot usually be observed on the web. In this paper, we propose a model for inferring latent factors from sequences of derivative work posting events. We assume a sequence to be a stochastic process incorporating the following three factors: (1) the original work's attractiveness, (2) the original work's popularity, and (3) the derivative work's popularity. To characterize content popularity, we use content ranking data and incorporate rank-biased popularity based on the creators' browsing behavior. Our main contributions are three-fold: (1) to the best of our knowledge, this is the first study modeling derivative creation activity, (2) by using a real-world dataset of music-related derivative work creation to evaluate our model, we showed the effectiveness of adopting all three factors to model derivative creation activity and onsidering creators' browsing behavior, and (3) we carried out qualitative experiments and showed that our model is useful in analyzing derivative creation activity in terms of category characteristics, temporal development of factors that trigger derivative work posting events, etc.

【Keywords】: derivative creation; latent variable model; user generated content

271. Anomalies in the Peer-review System: A Case Study of the Journal of High Energy Physics.

【Paper Link】【Pages】:2245-2250

【Authors】: Sandipan Sikdar ; Matteo Marsili ; Niloy Ganguly ; Animesh Mukherjee

【Abstract】: Peer-review system has long been relied upon for bringing quality research to the notice of the scientific community and also preventing flawed research from entering into the literature. The need for the peer-review system has often been debated as in numerous cases it has failed in its task and in most of these cases editors and the reviewers were thought to be responsible for not being able to correctly judge the quality of the work. This raises a question "Can the peer-review system be improved?" Since editors and reviewers are the most important pillars of a reviewing system, we in this work, attempt to address a related question - given the editing/reviewing history of the editors or reviewers "can we identify the under-performing ones?", with citations received by the edited/reviewed papers being used as proxy for quantifying performance. We term such reviewers and editors as anomalous and we believe identifying and removing them shall improve the performance of the peer-review system. Using a massive dataset of Journal of High Energy Physics (JHEP) consisting of 29k papers submitted between 1997 and 2015 with 95 editors and 4035 reviewers and their review history, we identify several factors which point to anomalous behavior of referees and editors. In fact the anomalous editors and reviewers account for 26.8% and 14.5% of the total editors and reviewers respectively and for most of these anomalous reviewers the performance degrades alarmingly over time.

【Keywords】: citation; editor; peer-review system; reviewer

272. Multi-source Hierarchical Prediction Consolidation.

【Paper Link】【Pages】:2251-2256

【Authors】: Chenwei Zhang ; Sihong Xie ; Yaliang Li ; Jing Gao ; Wei Fan ; Philip S. Yu

【Abstract】: In big data applications such as healthcare data mining, due to privacy concerns, it is necessary to collect predictions from multiple information sources for the same instance, with raw features being discarded or withheld when aggregating multiple predictions. Besides, crowd-sourced labels need to be aggregated to estimate the ground truth of the data. Due to the imperfection caused by predictive models or human crowdsourcing workers, noisy and conflicting information is ubiquitous and inevitable. Although state-of-the-art aggregation methods have been proposed to handle label spaces with flat structures, as the label space is becoming more and more complicated, aggregation under a label hierarchical structure becomes necessary but has been largely ignored. These label hierarchies can be quite informative as they are usually created by domain experts to make sense of highly complex label correlations such as protein functionality interactions or disease relationships. We propose a novel multi-source hierarchical prediction consolidation method to effectively exploits the complicated hierarchical label structures to resolve the noisy and conflicting information that inherently originates from multiple imperfect sources. We formulate the problem as an optimization problem with a closed-form solution. The consolidation result is inferred in a totally unsupervised, iterative fashion. Experimental results on both synthetic and real-world data sets show the effectiveness of the proposed method over existing alternatives.

【Keywords】: crowdsourcing; ensemble; hierarchy; unsupervised learning

273. Probabilistic Knowledge Graph Construction: Compositional and Incremental Approaches.

【Paper Link】【Pages】:2257-2262

【Authors】: Dongwoo Kim ; Lexing Xie ; Cheng Soon Ong

【Abstract】: Knowledge graph construction consists of two tasks: extracting information from external resources (knowledge population) and inferring missing information through a statistical analysis on the extracted information (knowledge completion). In many cases, insufficient external resources in the knowledge population hinder the subsequent statistical inference. The gap between these two processes can be reduced by an incremental population approach. We propose a new probabilistic knowledge graph factorisation method that benefits from the path structure of existing knowledge (e.g. syllogism) and enables a common modelling approach to be used for both incremental population and knowledge completion tasks. More specifically, the probabilistic formulation allows us to develop an incremental population algorithm that trades off exploitation-exploration. Experiments on three benchmark datasets show that the balanced exploitation-exploration helps the incremental population, and the additional path structure helps to predict missing information in knowledge completion.

【Keywords】: active learning; knowledge graph; thompson sampling

274. Explaining Sentiment Spikes in Twitter.

【Paper Link】【Pages】:2263-2268

【Authors】: Anastasia Giachanou ; Ida Mele ; Fabio Crestani

【Abstract】: Tracking public opinion in social media provides important information to enterprises or governments during a decision making process. In addition, identifying and extracting the causes of sentiment spikes allows interested parties to redesign and adjust strategies with the aim to attract more positive sentiments. In this paper, we focus on the problem of tracking sentiment towards different entities, detecting sentiment spikes and on the problem of extracting and ranking the causes of a sentiment spike. Our approach combines LDA topic model with Relative Entropy. The former is used for extracting the topics discussed in the time window before the sentiment spike. The latter allows to rank the detected topics based on their contribution to the sentiment spike.

【Keywords】: sentiment spikes; tracking sentiment; twitter

275. Qualitative Cleaning of Uncertain Data.

【Paper Link】【Pages】:2269-2274

【Authors】: Henning Köhler ; Sebastian Link

【Abstract】: We propose a new view on data cleaning: Not data itself but the degrees of uncertainty attributed to data are dirty. Applying possibility theory, tuples are assigned degrees of possibility with which they occur, and constraints are assigned degrees of certainty that say to which tuples they apply. Classical data cleaning modifies some minimal set of tuples. Instead, we marginally reduce their degrees of possibility. This reduction leads to a new qualitative version of the vertex cover problem. Qualitative vertex cover can be mapped to a linear-weighted constraint satisfaction problem. However, any off-the-shelf solver cannot solve the problem more efficiently than classical vertex cover. Instead, we utilize the degrees of possibility and certainty to develop a dedicated algorithm that is fixed parameter tractable in the size of the qualitative vertex cover. Experiments show that our algorithm is faster than solvers for the classical vertex cover problem by several orders of magnitude, and performance improves with higher numbers of uncertainty degrees.

【Keywords】: database repair; possibility theory; vertex cover

276. APAM: Adaptive Eager-Lazy Hybrid Evaluation of Event Patterns for Low Latency.

【Paper Link】【Pages】:2275-2280

【Authors】: Ilyeop Yi ; Jae-Gil Lee ; Kyu-Young Whang

【Abstract】: Event pattern detection refers to identifying combinations of events matched to a user-specified query event pattern from a real-time event stream. Latency is an important measure of the performance of an event pattern detection system. Existing methods can be classified into the eager evaluation method and the lazy evaluation method depending on when each event arrival is evaluated. These methods have advantages and disadvantages in terms of latency depending on the event arrival rate. In this paper, we propose a hybrid eager-lazy evaluation method that combines the advantages of both methods. For each event type, the hybrid method, which we call APAM ( Adaptive Partitioning-And-Merging), determines which method to use: eager or lazy. We also propose a formal cost model to estimate the latency and propose a method of finding the optimal partition based on the cost model. Finally, we show through experiments that our method can improve the latency by up to 361.48 times over the eager evaluation method and 27.94 times over the lazy evaluation method using a synthetic data set.

【Keywords】: complex event processing; event pattern detection; hybrid; latency; optimization

277. OrientStream: A Framework for Dynamic Resource Allocation in Distributed Data Stream Management Systems.

【Paper Link】【Pages】:2281-2286

【Authors】: Chunkai Wang ; Xiaofeng Meng ; Qi Guo ; Zujian Weng ; Chen Yang

【Abstract】: Distributed data stream management systems (DDSMS) are usually composed of upper layer relational query systems (RQS) and lower layer stream processing systems (SPS). When users submit new queries to RQS, a query planner needs to be converted into a directed acyclic graph (DAG) consisting of tasks which are running on SPS. Based on different query requests and data stream properties, SPS need to configure different deployments strategies. However, how to dynamically predict deployment configurations of SPS to ensure the processing throughput and low resource usage is a great challenge. This article presents OrientStream, a framework for dynamic resource allocation in DDSMS using incremental machine learning techniques. By introducing the data-level, query plan-level, operator-level and cluster-level's four-level feature extraction mechanism, we firstly use the different query workloads as training sets to predict the resource usage of DDSMS and then select the optimal resource configuration from candidate settings based on the current query requests and stream properties. Finally, we validate our approach on the open source SPS--Storm. Experiments show that OrientStream can reduce CPU usage of 8%-15% and memory usage of 38%-48% respectively.

【Keywords】: incremental learning; modeling and prediction; relational query system; stream processing system

278. Tag2Word: Using Tags to Generate Words for Content Based Tag Recommendation.

【Paper Link】【Pages】:2287-2292

【Authors】: Yong Wu ; Yuan Yao ; Feng Xu ; Hanghang Tong ; Jian Lu

【Abstract】: Tag recommendation is helpful for the categorization and searching of online content. Existing tag recommendation methods can be divided into collaborative filtering methods and content based methods. In this paper, we put our focus on the content based tag recommendation due to its wider applicability. Our key observation is the tag-content co-occurrence, i.e., many tags have appeared multiple times in the corresponding content. Based on this observation, we propose a generative model (Tag2Word), where we generate the words based on the tag-word distribution as well as the tag itself. Experimental evaluations on real data sets demonstrate that the proposed method outperforms several existing methods in terms of recommendation accuracy, while enjoying linear scalability.

【Keywords】: generative model; tag recommendation; tag-content co-occurrence

279. Digesting Multilingual Reader Comments via Latent Discussion Topics with Commonality and Specificity.

【Paper Link】【Pages】:2293-2298

【Authors】: Bei Shi ; Wai Lam ; Lidong Bing ; Yinqing Xu

【Abstract】: Many news websites from different regions in the world allow readers to write comments in their own languages about an event. Digesting such enormous amount of comments in different languages is difficult. One elegant way to digest and organize these comments is to detect latent discussion topics with the consideration of language attributes. Some discussion topics are common topics shared between languages whereas some topics are specifically dominated by a particular language. To tackle this task of discovering discussion topics that exhibit commonality or specificity from news reader comments written in different languages, we propose a new model called TDCS based on graphical models, which can cope with the language gap and detect language-common and language-specific latent discussion topics simultaneously. Our TDCS model also exploits comment-oriented clues via a scalable Dirichlet Multinomial Regression method. To learn the model parameters, we develop an inference method which alternates between EM and Gibbs sampling. Experimental results show that our proposed TDCS model can provide an effective way to digest multilingual news reader comments.

【Keywords】: commonality and specificity; latent discussion topics; multilingual news reader comments

280. Digesting News Reader Comments via Fine-Grained Associations with Event Facets and News Contents.

【Paper Link】【Pages】:2299-2304

【Authors】: Bei Shi ; Wai Lam

【Abstract】: News articles from different sources reporting the same event are often associated with an enormous amount of reader comments resulting in difficulty in digesting the comments manually. Some of these comments, despite coming from different sources, discuss about a certain facet of the event. On the other hand, some comments discuss on the specific topic of the corresponding news article. We propose a framework that can digest reader comments automatically via fine-grained associations with event facets and news. We propose an unsupervised model called DRC, based on collective matrix factorization and develop a multiplicative-update method to infer the parameters. Experimental results show that our proposed DRC model can provide an effective way to digest news reader comments.

【Keywords】: event facets; matrix factorization; news comments

281. Efficient Algorithms for the Two Locus Problem in Genome-Wide Association Study: Algorithms for the Two Locus Problem.

【Paper Link】【Pages】:2305-2310

【Authors】: Sanguthevar Rajasekaran ; Subrata Saha

【Abstract】: Advances made in sequencing technology have resulted in the sequencing of thousands of genomes. Novel analysis tools are needed to process these data and extract useful information. Such tools could aid in personalized medicine. As an example, we could identify the causes for a disease by comparing the genomes of people who have the disease and those who do not have this disease. Given that human variability happens due to single nucleotide polymorphisms (SNPs), we could focus our attention on these SNPs. Investigations that try to understand human variability using SNPs fall under genome-wide association study (GWAS). A crucial step in GWAS is the identification of the correlation between genotypes (SNPs) and phenotypes (i.e., characteristics such as the presence of a disease). This step can be modeled as the k-locus problem (where k is any integer). A number of algorithms have been proposed in the literature for this problem when k = 2. In this paper we present an algorithm for solving the 2-locus problem that is up to two orders of magnitude faster than the previous best known algorithms.

【Keywords】: genome-wide association study; string correlations; two-locus problem

【Paper Link】【Pages】:2311-2316

【Authors】: Thomas Niebler ; Martin Becker ; Daniel Zoller ; Stephan Doerfel ; Andreas Hotho

【Abstract】: Social tagging systems have established themselves as a quick and easy way to organize information by annotating resources with tags. In recent work, user behavior in social tagging systems was studied, that is, how users assign tags, and consume content. However, it is still unclear how users make use of the navigation options they are given. Understanding their behavior and differences in behavior of different user groups is an important step towards assessing the effectiveness of a navigational concept and improving it to better suit the users' needs. In this work, we investigate navigation trails in the popular scholarly social tagging system BibSonomy from six years of log data. We discuss dynamic browsing behavior of the general user population and show that different navigational subgroups exhibit different navigational traits. Furthermore, we provide strong evidence that the semantic nature of the underlying folksonomy is an essential factor for explaining navigation.

【Keywords】: bibsonomy; folksonomy; navigation analysis; semantic similarity; social tagging system; tagging

283. Memory-Optimized Distributed Graph Processing through Novel Compression Techniques.

【Paper Link】【Pages】:2317-2322

【Authors】: Panagiotis Liakos ; Katia Papakonstantinopoulou ; Alex Delis

【Abstract】: A multitude of contemporary applications now involve graph data whose size continuously grows and this trend shows no signs of subsiding. This has caused the emergence of many distributed graph processing systems including Pregel and Apache Giraph. However, the unprecedented scale now reached by real-world graphs hardens the task of graph processing even in distributed environments and the current memory usage patterns rapidly become a primary concern for such contemporary graph processing systems. We seek to address this challenge by exploiting empirically-observed properties demonstrated by graphs that are generated by human activity. In this paper, we propose three space-efficient adjacency list representations that can be applied to any distributed graph processing system. Our suggested compact representations reduce respective memory requirements for accommodating the graph elements up to 5 times if compared with state-of-the-art methods. At the same time, our memory-optimized methods retain the efficiency of uncompressed structures and enable the execution of algorithms for large scale graphs in settings where contemporary alternative structures fail due to memory errors.

【Keywords】: distributed computing; graph compression; pregel

284. Tracking the Evolution of Congestion in Dynamic Urban Road Networks.

【Paper Link】【Pages】:2323-2328

【Authors】: Tarique Anwar ; Chengfei Liu ; Hai L. Vu ; Md. Saiful Islam

【Abstract】: The congestion scenario on a road network is often represented by a set of differently congested partitions having homogeneous level of congestion inside. Due to the changing traffic, these partitions evolve with time. In this paper, we propose a two-layer method to incrementally update the differently congested partitions from those at the previous time point in an efficient manner, and thus track their evolution. The physical layer performs low-level computations to incrementally update a set of small-sized road network building blocks, and the logical layer provides an interface to query the physical layer about the congested partitions. At each time point, the unstable road segments are identified and moved to their most suitable building blocks. Our experimental results on different datasets show that the proposed method is much efficient than the existing re-partitioning methods without significant sacrifice in accuracy.

【Keywords】: incremental partitioning; road network motif; tracking congestion evolution; urban road networks

285. The Rich and the Poor: A Markov Decision Process Approach to Optimizing Taxi Driver Revenue Efficiency.

【Paper Link】【Pages】:2329-2334

【Authors】: Huigui Rong ; Xun Zhou ; Chang Yang ; M. Zubair Shafiq ; Alex X. Liu

【Abstract】: Taxi services play an important role in the public transportation system of large cities. Improving taxi business efficiency is an important societal problem since it could improve the income of the drivers and reduce gas emissions and fuel consumption. The recent research on seeking strategies may not be optimal for the overall revenue over an extended period of time as they ignored the important impact of passengers' destinations on future passenger seeking. To address these issues, this paper investigates how to increase the revenue efficiency (revenue per unit time) of taxi drivers, and models the passenger seeking process as a Markov Decision Process (MDP). For each one-hour time slot, we learn a different set of parameters for the MDP from data and find the best move for a vacant taxi to maximize the total revenue in that time slot. A case study and several experimental evaluations on a real dataset from a major city in China show that our proposed approach improves the revenue efficiency of inexperienced drivers by up to 15% and outperforms a baseline method in all the time slots.

【Keywords】: markov decision process; revenue efficiency; taxi driver

286. Ensemble of Anchor Adapters for Transfer Learning.

【Paper Link】【Pages】:2335-2340

【Authors】: Fuzhen Zhuang ; Ping Luo ; Sinno Jialin Pan ; Hui Xiong ; Qing He

【Abstract】: In the past decade, there have been a large number of transfer learning algorithms proposed for various real-world applications. However, most of them are vulnerable to negative transfer since their performance is even worse than traditional supervised models. Aiming at more robust transfer learning models, we propose an ENsemble framework of anCHOR adapters (ENCHOR for short), in which an anchor adapter adapts the features of instances based on their similarities to a specific anchor (i.e., a selected instance). Specifically, the more similar to the anchor instance, the higher degree of the original feature of an instance remains unchanged in the adapted representation, and vice versa. This adapted representation for the data actually expresses the local structure around the corresponding anchor, and then any transfer learning method can be applied to this adapted representation for a prediction model, which focuses more on the neighborhood of the anchor. Next, based on multiple anchors, multiple anchor adapters can be built and combined into an ensemble for final output. Additionally, we develop an effective measure to select the anchors for ensemble building to achieve further performance improvement. Extensive experiments on hundreds of text classification tasks are conducted to demonstrate the effectiveness of ENCHOR. The results show that: when traditional supervised models perform poorly, ENCHOR (based on only 8 selected anchors) achieves $6%-13%$ increase in terms of average accuracy compared with the state-of-the-art methods, and it greatly alleviates negative transfer.

【Keywords】: classifcation; transfer learning

287. Incremental Mining of High Utility Sequential Patterns in Incremental Databases.

【Paper Link】【Pages】:2341-2346

【Authors】: Jun-Zhe Wang ; Jiun-Long Huang

【Abstract】: High utility sequential pattern (HUSP) mining is an emerging topic in pattern mining, and only a few algorithms have been proposed to address it. In practice, most sequence databases usually grow over time, and it is inefficient for existing algorithms to mine HUSPs from scratch when databases grow with a small portion of updates. In view of this, we propose the IncUSP-Miner algorithm to mine HUSPs incrementally. Specifically, to avoid redundant computations, we propose a tighter upper bound of the utility of a sequence, called TSU, and then design a novel data structure, called the candidate pattern tree, to maintain the sequences whose TSU values are greater than or equal to the minimum utility threshold. Accordingly, to avoid keeping a huge amount of utility information for each sequence, a set of auxiliary utility information is designed to be stored in each tree node. Moreover, for those nodes whose utilities have to be updated, a strategy is also proposed to reduce the amount of computation, thereby improving the mining efficiency. Experimental results on three real datasets show that IncUSP-Miner is able to efficiently mine HUSPs incrementally.

【Keywords】: high utility sequential pattern mining; incremental high utility sequential pattern mining; incremental mining

288. Understanding Stability of Noisy Networks through Centrality Measures and Local Connections.

【Paper Link】【Pages】:2347-2352

【Authors】: Vladimir Ufimtsev ; Soumya Sarkar ; Animesh Mukherjee ; Sanjukta Bhowmick

【Abstract】: Networks created from real-world data contain some inaccuracies or noise, manifested as small changes in the network structure. An important question is whether these small changes can signficantly affect the analysis results. In this paper, we study the effect of noise in changing ranks of the high centrality vertices. We compare, using the Jaccard Index (JI), how many of the top-k high centrality nodes from the original network are also part of the top-k ranked nodes from the noisy network. We deem a network as stable if the JI value is high. We observe two features that affect the stability. First, the stability is dependent on the number of top-ranked vertices considered. When the vertices are ordered according to their centrality values, they group into clusters. Perturbations to the network can change the relative ranking within the cluster, but vertices rarely move from one cluster to another. Second, the stability is dependent on the local connections of the high ranking vertices. The network is highly stable if the high ranking vertices are connected to each other. Our findings show that the stability of a network is affected by the local properties of high centrality vertices, rather than the global properties of the entire network. Based on these local properties we can identify the stability of a network, without explicitly applying a noise model.

【Keywords】: betweenness; closeness; noise; rich-club; stability

289. Online Adaptive Topic Focused Tweet Acquisition.

【Paper Link】【Pages】:2353-2358

【Authors】: Mehdi Sadri ; Sharad Mehrotra ; Yaming Yu

【Abstract】: Twitter provides a public streaming API that is strictly limited, making it difficult to simultaneously achieve good coverage and relevance when monitoring tweets for a specific topic of interest. In this paper, we address the tweet acquisition challenge to enhance monitoring of tweets based on the client/application needs in an online adaptive manner such that the quality and quantity of the results improves over time. We propose a Tweet Acquisition System (TAS), that iteratively selects phrases to track based on an explore-exploit strategy. Our experimental studies show that TAS significantly improves recall of relevant tweets and the performance improves when the topics are more specific.

【Keywords】: adaptive data acquisition; data crawling; data preprocessing; data quality; explore exploit; information retrieval; social media data; topic focused crawler; tweet acquisition

290. Optimizing Nugget Annotations with Active Learning.

【Paper Link】【Pages】:2359-2364

【Authors】: Gaurav Baruah ; Haotian Zhang ; Rakesh Guttikonda ; Jimmy J. Lin ; Mark D. Smucker ; Olga Vechtomova

【Abstract】: Nugget-based evaluations, such as those deployed in the TREC Temporal Summarization and Question Answering tracks, require human assessors to determine whether a nugget is present in a given piece of text. This process, known as nugget annotation, is labor-intensive. In this paper, we present two active learning techniques that prioritize the sequence in which candidate nugget/sentence pairs are presented to an assessor, based on the likelihood that the sentence contains a nugget. Our approach builds on the recognition that nugget annotation is similar to high-recall retrieval, and we adapt proven existing solutions. Simulation experiments with four existing TREC test collections show that our techniques yield far more matches for a given level of effort than baselines that are typically deployed in previous nugget-based evaluations.

【Keywords】: nugget-based evaluations; question answering; temporal summarization; trec

【Paper Link】【Pages】:2365-2370

【Authors】: Prudhvi Ratna Badri Satya ; Kyumin Lee ; Dongwon Lee ; Thanh Tran ; Jason (Jiasheng) Zhang

【Abstract】: As the commercial implications of Likes in online social networks multiply, the number of fake Likes also increase rapidly. To maintain a healthy ecosystem, however, it is critically important to prevent and detect such fake Likes. Toward this goal, in this paper, we investigate the problem of detecting the so-called "fake likers" who frequently make fake Likes for illegitimate reasons. To uncover fake Likes in online social networks, we: (1) first collect a substantial number of profiles of both fake and legitimate Likers using linkage and honeypot approaches, (2) analyze the characteristics of both types of Likers, (3) identify effective features exploiting the learned characteristics and apply them in supervised learning models, and (4) thoroughly evaluate their performances against three baseline methods and under two attack models. Our experimental results show that our proposed methods with effective features significantly outperformed baseline methods, with accuracy = 0.871, false positive rate = 0.1, and false negative rate = 0.14.

【Keywords】: facebook; fake likers; fiverr; microworkers; online social networks

292. Where to Place Your Next Restaurant?: Optimal Restaurant Placement via Leveraging User-Generated Reviews.

【Paper Link】【Pages】:2371-2376

【Authors】: Feng Wang ; Li Chen ; Weike Pan

【Abstract】: When opening a new restaurant, geographical placement is of prime importance in determining whether it will thrive. Although some methods have been developed to assess the attractiveness of candidate locations for a restaurant, the accuracy is limited as they mainly rely on traditional data sources, such as demographic studies or consumer surveys. With the advent of abundant user-generated restaurant reviews, there is a potential to leverage these reviews to gain some insights into users' preferences for restaurants. In this paper, we particularly take advantage of user-generated reviews to construct predictive features for assessing the attractiveness of candidate locations to expand a restaurant. Specifically, we investigate three types of features: review-based market attractiveness, review-based market competitiveness and geographic characteristics of a location under consideration for a prospective restaurant. We devise the three sets of features and incorporate them into a regression model to predict the number of check-ins that a prospective restaurant at a candidate location would be likely to attract. We then conduct an experiment with real-world restaurant data, which demonstrates the predictive power of features we constructed in this paper. Moreover, our experimental results suggest that market attractiveness and market competitiveness features mined solely from user-generated restaurant reviews are more predictive than geographic features.

【Keywords】: geographic features; mar- ket attractiveness features; market competitiveness features; optimal restaurant placement; user-generated reviews

【Paper Link】【Pages】:2377-2382

【Authors】: Justin Sampson ; Fred Morstatter ; Liang Wu ; Huan Liu

【Abstract】: The automatic and early detection of rumors is of paramount importance as the spread of information with questionable veracity can have devastating consequences. This became starkly apparent when, in early 2013, a compromised Associated Press account issued a tweet claiming that there had been an explosion at the White House. This tweet resulted in a significant drop for the Dow Jones Industrial Average. Most existing work in rumor detection leverages conversation statistics and propagation patterns, however, such patterns tend to emerge slowly requiring a conversation to have a significant number of interactions in order to become eligible for classification. In this work, we propose a method for classifying conversations within their formative stages as well as improving accuracy within mature conversations through the discovery of implicit linkages between conversation fragments. In our experiments, we show that current state-of-the-art rumor classification methods can leverage implicit links to significantly improve the ability to properly classify emergent conversations when very little conversation data is available. Adopting this technique allows rumor detection methods to continue to provide a high degree of classification accuracy on emergent conversations with as few as a single tweet. This improvement virtually eliminates the delay of conversation growth inherent in current rumor classification methods while significantly increasing the number of conversations considered viable for classification.

【Keywords】: implicit network; rumor detection; social media

294. Automatical Storyline Generation with Help from Twitter.

【Paper Link】【Pages】:2383-2388

【Authors】: Ting Hua ; Xuchao Zhang ; Wei Wang ; Chang-Tien Lu ; Naren Ramakrishnan

【Abstract】: Storyline detection aims to connect seemly irrelevant single documents into meaningful chains, which provides opportunities for understanding how events evolve over time and what triggers such evolutions. Most previous work generated the storylines through unsupervised methods that can hardly reveal underlying factors driving the evolution process. This paper introduces a Bayesian model to generate storylines from massive documents and infer the corresponding hidden relations and topics. In addition, our model is the first attempt that utilizes Twitter data as human input to ``supervise'' the generation of storylines. Through extensive experiments, we demonstrate our proposed model can achieve significant improvement over baseline methods and can be used to discover interesting patterns for real world cases.

【Keywords】: storyline; topic modeling; twitter

295. A Comparative Study of Query-biased and Non-redundant Snippets for Structured Search on Mobile Devices.

【Paper Link】【Pages】:2389-2394

【Authors】: Nikita V. Spirin ; Alexander S. Kotov ; Karrie G. Karahalios ; Vassil Mladenov ; Pavel A. Izhutov

【Abstract】: To investigate what kind of snippets are better suited for structured search on mobile devices, we built an experimental mobile search application and conducted a task-oriented interactive user study with 36 participants. Four different versions of a search engine result page (SERP) were compared by varying the snippet type (query-biased vs. non-redundant) and the snippet length (two vs. four lines per result). We adopted a within-subjects experiment design and made each participant do four realistic search tasks using different versions of the application. During the study sessions, we collected search logs, "think-aloud" comments, and post-task surveys. Each session was finalized with an interview. We found that with non-redundant snippets the participants were able to complete the tasks faster and find more relevant results. Most participants preferred non-redundant snippets and wanted to see more information about each result on the SERP for any snippet type. Yet, the participants felt that the version with query-biased snippets was easier to use. We conclude with a set of practical design recommendations.

【Keywords】: mobile search; snippet; structured search; user study

296. Content-Agnostic Malware Detection in Heterogeneous Malicious Distribution Graph.

【Paper Link】【Pages】:2395-2400

【Authors】: Ibrahim M. Alabdulmohsin ; Yufei Han ; Yun Shen ; Xiangliang Zhang

【Abstract】: Malware detection has been widely studied by analysing either file dropping relationships or characteristics of the file distribution network. This paper, for the first time, studies a global heterogeneous malware delivery graph fusing file dropping relationship and the topology of the file distribution network. The integration offers a unique ability of structuring the end-to-end distribution relationship. However, it brings large heterogeneous graphs to analysis. In our study, an average daily generated graph has more than 4 million edges and 2.7 million nodes that differ in type, such as IPs, URLs, and files. We propose a novel Bayesian label propagation model to unify the multi-source information, including content-agnostic features of different node types and topological information of the heterogeneous network. Our approach does not need to examine the source codes nor inspect the dynamic behaviours of a binary. Instead, it estimates the maliciousness of a given file through a semi-supervised label propagation procedure, which has a linear time complexity w.r.t. the number of nodes and edges. The evaluation on 567 million real-world download events validates that our proposed approach efficiently detects malware with a high accuracy.

【Keywords】: algorithm; bayesian inference; data mining; download activity graph; label propagation; malware detection; malware mitigation; semi-supervised learning

Industry Track Short Papers 6

【Paper Link】【Pages】:2401-2404

【Authors】: Liang Wang ; Kuang-chih Lee ; Quan Lu

【Abstract】: User attributes including online behavior history and demographic information are the keys to decide whether a user is the right audience for an advertisement. When a user visits a website, the website generally plugs a browser cookie string (bcookie' for short). The bcookie is then used as an identifier to collect the user's online behavior, as well as the joint key to link user profile attributes, such as demographic information and browsing history. However, the same users can have different bcookies across different browsers and devices. Moreover, bcookies can expire after some period, be cleared by browsers or users. This situation of bcookie discounting typically introduces both performance and delivery problems in online advertising since advertisers are hard to find the most receptive audiences based on the user profile information. In this paper, we try to tackle this problem by using anassistant identifier' to find the linkage between different bcookies. For most of the Internet company, in addition to the bcookie information, there are always other identifiers such as IP address, user agent, OS type and version, etc., stored in the serving log data. Therefore, we propose an unified framework to link different bcookies from the same users according to those assistant identifiers. Specifically, our proposed method first constructs a bipartite graph with linkages between the assistant identifiers and the bcookies. Next all attributes associated with each bcookie are propagated along the graph using the state-of-the-art random walk model. Offline comparative experimental studies are conducted to confirm that by enriching the bcookie attributes we can recover 20% more online users whose bcookie information is lost, which is greatly helpful to delivery more budget spending with a little loss in precision of predicting converted users. On-product evaluation further confirms the effectiveness of the proposed method.

【Keywords】: advertisement recommendation; probability mass propagation; user attribute enrichment

298. Balanced Supervised Non-Negative Matrix Factorization for Childhood Leukaemia Patients.

【Paper Link】【Pages】:2405-2408

【Authors】: Ali Braytee ; Daniel R. Catchpoole ; Paul J. Kennedy ; Wei Liu

【Abstract】: Supervised feature extraction methods have received considerable attention in the data mining community due to their capability to improve the classification performance of the unsupervised dimensionality reduction methods. With increasing dimensionality, several methods based on supervised feature extraction are proposed to achieve a feature ranking especially on microarray gene expression data. This paper proposes a method with twofold objectives: it implements a balanced supervised non-negative matrix factorization (BSNMF) to handle the class imbalance problem in supervised non-negative matrix factorization techniques. Furthermore, it proposes an accurate gene ranking method based on our proposed BSNMF for microarray gene expression datasets. To the best of our knowledge, this is the first work to handle the class imbalance problem in supervised feature extraction methods. This work is part of a Human Genome project at The Children's Hospital at Westmead (TB-CHW), Australia. Our experiments indicate that the factorized components using supervised feature extraction approach have more classification capability than the unsupervised one, but it drastically fails at the presence of class imbalance problem. Our proposed method outperforms the state-of-the-art methods and shows promise in overcoming this concern.

【Keywords】: gene selection; imbalance class problem; non-negative matrix factorization; supervised feature extraction

【Paper Link】【Pages】:2409-2412

【Authors】: Minh-Tien Nguyen ; Chien-Xuan Tran ; Duc-Vu Tran ; Minh-Le Nguyen

【Abstract】: This paper presents a dataset named SoLSCSum for social context summarization. The dataset includes 157 open-domain articles along with their comments collected from Yahoo News. The articles and their comments were manually annotated by two annotators to extract standard summaries. The inter-annotator agreement is 74.5% and Cohen's Kappa is 0.5845. To illustrate the potential use of our dataset, a learning to rank model was trained by using a set of local and cross features. Experimental results demonstrate that: (1) our model trained by Ranking SVM obtains significant improvements from 5.5% to 14.8% of ROUGE-1 over state-of-the-art baselines in document summarization and (2) our dataset can be used to train summary methods such as SVM.

【Keywords】: dataset; information retrieval; learning to rank; nlp; social context summarization; text summarization

300. Distributed Deep Learning for Question Answering.

【Paper Link】【Pages】:2413-2416

【Authors】: Minwei Feng ; Bing Xiang ; Bowen Zhou

【Abstract】: This paper is an empirical study of the distributed deep learning for question answering subtasks: answer selection and question classification. Comparison studies of SGD, MSGD, ADADELTA, ADAGRAD, ADAM/ADAMAX, RMSPROP, DOWNPOUR and EASGD/EAMSGD algorithms have been presented. Experimental results show that the distributed framework based on the message passing interface can accelerate the convergence speed at a sublinear scale. This paper demonstrates the importance of distributed training. For example, with 48 workers, a 24x speedup is achievable for the answer selection task and running time is decreased from 138.2 hours to 5.81 hours, which will increase the productivity significantly.

【Keywords】: deep learning; distributed training; question answering

301. Bus Routes Design and Optimization via Taxi Data Analytics.

【Paper Link】【Pages】:2417-2420

【Authors】: Seong Ping Chuah ; Huayu Wu ; Yu Lu ; Liang Yu ; Stéphane Bressan

【Abstract】: Public bus services are often planned in the context of urban planning. For a city with efficient and extensive network of public transportation system like Singapore, enhancing the existing coverage of bus service to meet the dynamic mobility needs of the population requires data mining approach. Specifically, frequent taxi rides between two locations at a period of time may suggest possible poor coverage of public transport service, if not lacking of the public transport service. In this paper, we describe a proof of concept effort to discover this weakness and its improvement in public transportation system via mining of taxi ride dataset. We cluster taxi rides dataset to determine some popular taxi rides in Singapore. From the clustered taxi rides, we filter and select only the clusters whose commuting via existing public transport are tortuous if not unreachable door-to-door. Based on the discovered travel pattern, we propose new bus routes that serve the passengers of these clusters. We formulate the bus planning problem as an optimization of directed cycle graph, and present it's preliminary solution and results. We showcase our idea in the case of Singapore.

【Keywords】: bus route design & optimization; clustering; taxi rides

302. Routing an Autonomous Taxi with Reinforcement Learning.

【Paper Link】【Pages】:2421-2424

【Authors】: Miyoung Han ; Pierre Senellart ; Stéphane Bressan ; Huayu Wu

【Abstract】: Singapore's vision of a Smart Nation encompasses the development of effective and efficient means of transportation. The government's target is to leverage new technologies to create services for a demand-driven intelligent transportation model including personal vehicles, public transport, and taxis. Singapore's government is strongly encouraging and supporting research and development of technologies for autonomous vehicles in general and autonomous taxis in particular. The design and implementation of intelligent routing algorithms is one of the keys to the deployment of autonomous taxis. In this paper we demonstrate that a reinforcement learning algorithm of the Q-learning family, based on a customized exploration and exploitation strategy, is able to learn optimal actions for the routing autonomous taxis in a real scenario at the scale of the city of Singapore with pick-up and drop-off events for a fleet of one thousand taxis.

【Keywords】: exploration; reinforcement learning

Industry Track Demo Papers 19

303. XKnowSearch!: Exploiting Knowledge Bases for Entity-based Cross-lingual Information Retrieval.

【Paper Link】【Pages】:2425-2428

【Authors】: Lei Zhang ; Michael Färber ; Achim Rettinger

【Abstract】: In recent years, the amount of entities in large knowledge bases available on the Web has been increasing rapidly, making it possible to propose new ways of intelligent information access. Within the context of globalization, there is a clear need for techniques and systems that can enable multilingual and cross-lingual information access. In this paper, we present XKnowSearch!, a novel entity-based system for multilingual and cross-lingual information retrieval, which supports keyword search and also allows users to influence the search process according to their search intents. By leveraging the multilingual knowledge base on the Web, keyword queries and documents can be represented in their semantic forms, which can facilitate query disambiguation and expansion, and can also overcome the language barrier between queries and documents in different languages.

【Keywords】: cross-lingual; entity-based; information retrieval; knowledge bases

304. TweetSift: Tweet Topic Classification Based on Entity Knowledge Base and Topic Enhanced Word Embedding.

【Paper Link】【Pages】:2429-2432

【Authors】: Quanzhi Li ; Sameena Shah ; Xiaomo Liu ; Armineh Nourbakhsh ; Rui Fang

【Abstract】: Classifying tweets into topic categories is necessary and important for many applications, since tweets are about a variety of topics and users are only interested in certain topical areas. Many tweet classification approaches fail to achieve high accuracy due to data sparseness issue. Tweet, as a special type of short text, in additional to its text, also has other metadata that can be used to enrich its context, such as user name, mention, hashtag and embedded link. In this demonstration, we present TweetSift, an efficient and effective real time tweet topic classifier. TweetSift exploits external tweet-specific entity knowledge to provide more topical context for a tweet, and integrates them with topic enhanced word embeddings for topic classification. The demonstration will show how TweetSift works and how it is incorporated with our social media event detection system.

【Keywords】: entity knowledge base; topic enhanced word embedding; tweet topic classification; twitter

305. PARC: Privacy-Aware Data Cleaning.

【Paper Link】【Pages】:2433-2436

【Authors】: Dejun Huang ; Dhruv Gairola ; Yu Huang ; Zheng Zheng ; Fei Chiang

【Abstract】: Poor data quality has become a persistent challenge for organizations as data continues to grow in complexity and size. Existing data cleaning solutions focus on identifying repairs to the data to minimize either a cost function or the number of updates. These techniques, however, fail to consider underlying data privacy requirements that exist in many real data sets containing sensitive and personal information. In this demonstration, we present PARC, a Privacy-AwaRe data Cleaning system that corrects data inconsistencies w.r.t. a set of FDs, and limits the disclosure of sensitive values during the cleaning process. The system core contains modules that evaluate three key metrics during the repair search, and solves a multi-objective optimization problem to identify repairs that balance the privacy vs. utility tradeoff. This demonstration will enable users to understand: (1) the characteristics of a privacy-preserving data repair; (2) how to customize data cleaning and data privacy requirements using two real datasets; and (3) the distinctions among the repair recommendations via visualization summaries.

【Keywords】: constraint based cleaning; data quality; information disclosure

306. Ease the Process of Machine Learning with Dataflow.

【Paper Link】【Pages】:2437-2440

【Authors】: Tianyou Guo ; Jun Xu ; Xiaohui Yan ; Jianpeng Hou ; Ping Li ; Zhaohui Li ; Jiafeng Guo ; Xueqi Cheng

【Abstract】: Machine learning algorithms have become the key components in many big data applications. However, the full potential of machine learning is still far from been realized because using machine learning algorithms is hard, especially on distributed platforms such as Hadoop and Spark. The key barriers come from not only the implementation of the algorithms themselves, but also the processing for applying them to real applications which often involve multiple steps and different algorithms. In this demo we present a general-purpose dataflow-based system for easing the process of applying machine learning algorithms to real world tasks. In the system, a learning task is formulated as a directed acyclic graph (DAG) in which each node represents an operation (e.g., a machine learning algorithm), and each edge represents the flow of the data from one node to its descendants. Graphical user interface is implemented for making users to create, configure, submit, and monitor a task in a drag-and-drop manner. Advantages of the system include 1) lowering the barriers of defining and executing machine learning tasks; 2) sharing and re-using the implementations of the algorithms, the task dataflow DAGs, and the (intermediate) experimental results; 3) seamlessly integrating the stand-alone algorithms as well as the distributed algorithms in one task. The system has been deployed as a machine learning service and can be access from the Internet.

【Keywords】: dataflow; directed acyclic graph; machine learning process

307. FIN10K: A Web-based Information System for Financial Report Analysis and Visualization.

【Paper Link】【Pages】:2441-2444

【Authors】: Yu-Wen Liu ; Liang-Chih Liu ; Chuan-Ju Wang ; Ming-Feng Tsai

【Abstract】: In this demonstration, we present FIN10K, a web-based information system that facilitates the analysis of textual information in financial reports. The proposed system has three main components: (1) a 10-K Corpus, including an inverted index of financial reports on Form 10-K, several numerical finance measures, and pre-trained word embeddings; (2) an information retrieval system; and (3) two data visualizations of the analyzed results. The system can be of great help in revealing valuable insights within large amounts of textual information. The system is now online available at http: //clip.csie.org/10K/.

【Keywords】: data visualization; text mining; web-based system

308. FeatureMiner: A Tool for Interactive Feature Selection.

【Paper Link】【Pages】:2445-2448

【Authors】: Kewei Cheng ; Jundong Li ; Huan Liu

【Abstract】: The recent popularity of big data has brought immense quantities of high-dimensional data, which presents challenges to traditional data mining tasks due to curse of dimensionality. Feature selection has shown to be effective to prepare these high dimensional data for a variety of learning tasks. To provide easy access to feature selection algorithms, we provide an interactive feature selection tool FeatureMiner based on our recently released feature selection repository scikit-feature. FeatureMiner eases the process of performing feature selection for practitioners by providing an interactive user interface. Meanwhile, it also gives users some practical guidance in finding a suitable feature selection algorithm among many given a specific dataset. In this demonstration, we show (1) How to conduct data preprocessing after loading a dataset; (2) How to apply feature selection algorithms; (3) How to choose a suitable algorithm by visualized performance evaluation.

【Keywords】: data mining; feature selection; interactive user interface

309. Deola: A System for Linking Author Entities in Web Document with DBLP.

【Paper Link】【Pages】:2449-2452

【Authors】: Yinan Liu ; Wei Shen ; Xiaojie Yuan

【Abstract】: In this paper, we present Deola, an Online system for Author Entity Linking with DBLP. Unlike most existing entity linking systems which focus on linking entities with Wikipedia and depend largely on the special features associated with Wikipedia (e.g., Wikipedia articles), Deola links author names appearing in the web document which belongs to the domain of computer science with their corresponding entities existing in the DBLP network. This task is helpful for the enrichment of the DBLP network and the understanding of the domain-specific document. This task is challenging due to name ambiguity and limited knowledge existing in DBLP. Given a fragment of domain-specific web document belonging to the domain of computer science, Deola can return the mapping entity in DBLP for each author name appearing in the input document.

【Keywords】: author name disambiguation; domain-specific entity linking; entity linking

310. ConHub: A Metadata Management System for Docker Containers.

【Paper Link】【Pages】:2453-2455

【Authors】: Chris Xing Tian ; Aditya Pan ; Yong Chiang Tay

【Abstract】: For many years now, enterprises and cloud providers have been using virtualization to run their workloads. Until recently, this means running an application in a virtual machine (hardware virtualization). However, virtual machines are increasingly replaced by containers (operating system virtualization), as evidenced by the rapid rise of Docker. A containerized software environment can generate a large amount of metadata. If properly managed, these metadata can greatly facilitate the management of containers themselves. This demonstration introduces ConHub, a PostgreSQL-based container metadata management system. Visitors will see that (1) ConHub has a language CQL that supports Docker commands; (2) it has a user-friendly interface for querying and visualizing container relationships; and (3) they can use CQL to formulate sophisticated queries to facilitate container management.

【Keywords】: OS virtualization; container metadata; relational database

311. BIGtensor: Mining Billion-Scale Tensor Made Easy.

【Paper Link】【Pages】:2457-2460

【Authors】: Namyong Park ; ByungSoo Jeon ; Jungwoo Lee ; U. Kang

【Abstract】: Many real-world data are naturally represented as tensors, or multi-dimensional arrays. Tensor decomposition is an important tool to analyze tensors for various applications such as latent concept discovery, trend analysis, clustering, and anomaly detection. However, existing tools for tensor analysis do not scale well for billion-scale tensors or offer limited functionalities. In this paper, we propose BIGtensor, a large-scale tensor mining library that tackles both of the above problems. Carefully designed for scalability, BIGtensor decomposes at least 100× larger tensors than the current state of the art. Furthermore, BIGtensor provides a variety of distributed tensor operations and tensor generation methods. We demonstrate how BIGtensor can help users discover hidden concepts and analyze trends from large-scale tensors that are hard to be processed by existing tensor tools.

【Keywords】: distributed computing; tensor; tensor decompositions

312. eGraphSearch: Effective Keyword Search in Graphs.

【Paper Link】【Pages】:2461-2464

【Authors】: Mehdi Kargar ; Lukasz Golab ; Jaroslaw Szlichta

【Abstract】: In a node-labeled graph, keyword search finds subtrees of the graph whose nodes contain all of the query keywords. This provides a way to query graph databases that neither requires mastery of a query language such as SPARQL, nor a deep knowledge of the database schema. We demonstrate eGraphSearch, a new system for effective keyword search in graph databases. Previous work ranks answer trees using combinations of structural and content-based metrics, such as path length between keywords or relevance of the labels in the answer tree to the query keywords. However, different nodes in the graph might have different importance, which affects the utility of the answer. In the proposed system, we implemented two new ways to rank keyword search results over graphs: the first one takes node importance into account while the second one is a bi-objective optimization of edge weights and node importance. In the demonstration, participants will execute keyword queries against several popular graph datasets.

【Keywords】: graphs databases; keyword search

313. EnerQuery: Energy-Aware Query Processing.

【Paper Link】【Pages】:2465-2468

【Authors】: Amine Roukh ; Ladjel Bellatreche ; Carlos Ordonez

【Abstract】: Energy consumption is increasingly more important in large-scale query processing. This problem requires revisiting traditional query processing in actual DBMSs to identify the potential of energy saving, and to study the trade-offs between energy consumption and performance. In this paper, we propose EnerQuery, a tool built on top of a traditional DBMS to capitalize the efforts invested in building energy-aware query optimizers, which have the lion's share in energy consumption. Energy consumption is estimated on all query plan steps and integrated into a mathematical linear cost model used to select the best query plans. To increase end users' energy awareness, EnerQuery features a diagnostic GUI to visualize energy consumption per step and its savings when tuning key parameters during query execution.

【Keywords】: database design; energy efficiency; query processing

314. TGraph: A Temporal Graph Data Management System.

【Paper Link】【Pages】:2469-2472

【Authors】: Haixing Huang ; Jinghe Song ; Xuelian Lin ; Shuai Ma ; Jinpeng Huai

【Abstract】: Temporal graphs are a class of graphs whose nodes and edges, together with the associated properties, continuously change over time. Recently, systems have been developed to support snapshot queries over temporal graphs. However, these systems barely support aggregate time range queries. Moreover, these systems cannot guarantee ACID transactions, an important feature for data management systems as long as concurrent processing is involved. To solve these issues, we design and develop TGraph, a temporal graph data management system, that assures the ACID transaction feature, and supports fast temporal graph queries.

【Keywords】: data management; graph databases; temporal graphs

315. Analyzing Data Relevance and Access Patterns of Live Production Database Systems.

【Paper Link】【Pages】:2473-2475

【Authors】: Martin Boissier ; Carsten Alexander Meyer ; Timo Djürken ; Jan Lindemann ; Kathrin Mao ; Pascal Reinhardt ; Tim Specht ; Tim Zimmermann ; Matthias Uflacker

【Abstract】: Access to real-world database systems and their workloads is an invaluable source of information for database researchers. However, usually such full access is not possible due to tracing overheads, data protection, or legal reasons. In this paper, we present a tool set to analyze and compare synthetic and real-world database workloads, their characteristics, and access patterns. This tool set processes SQL workload traces and collects fine-grained access information without requiring direct read access to the production system. To gain insights into large real-world systems, we traced a live production enterprise system of a Global 2000 company and compare it with the synthetic benchmarks TPC-C and TPC-E.

【Keywords】: ERP; OLXP; access patterns; data relevance; data smartist; data tiering; mixed workloads; production systems

316. Thymeflow, A Personal Knowledge Base with Spatio-temporal Data.

【Paper Link】【Pages】:2477-2480

【Authors】: David Montoya ; Thomas Pellissier Tanon ; Serge Abiteboul ; Fabian M. Suchanek

【Abstract】: The typical Internet user has data spread over several devices and across several online systems. We demonstrate an open-source system for integrating user's data from different sources into a single Knowledge Base. Our system integrates data of different kinds into a coherent whole, starting with email messages, calendar, contacts, and location history. It is able to detect event periods in the user's location data and align them with calendar events. We will demonstrate how to query the system within and across different dimensions, and perform analytics over emails, events, and locations.

【Keywords】: data integration; open-source; personal information; querying

317. Inferring Traffic Incident Start Time with Loop Sensor Data.

【Paper Link】【Pages】:2481-2484

【Authors】: Mingxuan Yue ; Liyue Fan ; Cyrus Shahabi

【Abstract】: Traffic incidents and their impacts have been largely studied to improve road safety and to reduce incurred life and economic losses. However, the inaccuracy of incident data collected from transportation agencies, especially the start time, poses a great challenge to traffic incident research. We present INFIT, a system that infers the incident start time utilizing traffic data collected by loop sensors. The core of INFIT is IIG, our newly developed inference algorithm. The key idea is that IIG considers the traffic speed at multiple upstream locations, to mitigate the randomness in traffic data and to distinguish among multiple impact factors. INFIT includes an interactive interface with real-world incident datasets. We demonstrate INFIT with three exploratory use cases and show the usefulness of our inference algorithms.

【Keywords】: impact propagation; traffic data; traffic incidents

318. TEAMOPT: Interactive Team Optimization in Big Networks.

【Paper Link】【Pages】:2485-2487

【Authors】: Liangyue Li ; Hanghang Tong ; Nan Cao ; Kate Ehrlich ; Yu-Ru Lin ; Norbou Buchler

【Abstract】: The science of team science is a rapidly emerging research field that studies strategies to understand and enhance the process and outcomes of collaborative, team-based research. An interesting research question we address in this work is how to maintain and optimize the team performance should certain changes happen to the team. In particular, we take the network approach to understanding the teams and consider optimizing the teams with several operations (e.g., replacement, expansion, shrinkage). We develop TEAMOPT, a system to assist users in optimizing the team performance interactively to support the changes to a team. TEAMOPT takes as input a large network of individuals (e.g., co-author network of researchers) and is able to assist users in assembling a team with specific requirements and optimizing the team in response to the changes made to the team. It is effective in finding the best candidates, and interactive with users' feedback in the loop. The system is developed using HTML5, JavaScript, D3.js (front-end) and Python CGI (back-end). A prototype system is already deployed. We will invite the audience to experiment with our TEAMOPT in terms of its effectiveness, efficiency and applicability to various scenarios.

【Keywords】: big networks; graph kernel; team optimization

319. GStreamMiner: A GPU-accelerated Data Stream Mining Framework.

【Paper Link】【Pages】:2489-2492

【Authors】: Chandima Hewa Nadungodage ; Yuni Xia ; John Jaehwan Lee

【Abstract】: Due to the continuous, unbounded, and dynamic characteristics of the streaming data, mining data streams becomes a very challenging task. When analyzing online data streams, it is necessary to produce accurate results in a very short amount of time. The parallel processing power of Graphics Processing Units (GPUs) can be used to accelerate the processing and produce results in a timely manner. In this paper, we present GStreamMiner, a GPU-accelerated data stream mining framework and demonstrate its application using outlier detection over continuous streaming data as a case study. The demo software provides a visual interface which is continuously get updated with new results as the data stream progresses. It also facilitates the users to compare the performance of the GPU and CPU versions of the outlier detection algorithm.

【Keywords】: GPU; data stream mining; outlier detection.

320. QART: A Tool for Quality Assurance in Real-Time in Contact Centers.

【Paper Link】【Pages】:2493-2496

【Authors】: Ragunathan Mariappan ; Balaji Peddamuthu ; Preethi R. Raajaratnam ; Sandipan Dandapat ; Neeta Pande ; Shourya Roy

【Abstract】: In this paper, we describe an automatic real-time quality assurance system QART (pronounced cart) for contact center chats. QART performs multi-faceted analysis on dialogue utterances, as they happen, using sophisticated statistical and rule-based natural language processing (NLP) techniques. It covers various aspects inspired by today's Quality Assurance and Customer Satisfaction Scoring(C-Sat) practices as well as introduces novel components such as incremental dialogue summarization capability. QART front-end is an interactive dashboard providing views of ongoing dialogues at different granularity, enabling contact center supervisors to monitor and take corrective actions as needed. It is developed on state of the art stream computing platform Apache Spark Streaming with HBase datastore and Python Flask front end.

【Keywords】: contact center automation; customer behavior; dialogue summarization; real-time quality assurance

321. A Fatigue Strength Predictor for Steels Using Ensemble Data Mining: Steel Fatigue Strength Predictor.

【Paper Link】【Pages】:2497-2500

【Authors】: Ankit Agrawal ; Alok N. Choudhary

【Abstract】: Fatigue strength is one of the most important mechanical properties of steel. High cost and time for fatigue testing, and potentially disastrous consequences of fatigue failures motivates the development of predictive models for this property. We have developed advanced data-driven ensemble predictive models for this purpose with an extremely high cross-validated accuracy of >98\%, and have deployed these models in a user-friendly online web-tool, which can make very fast predictions of fatigue strength for a given steel represented by its composition and processing information. Such a tool with fast and accurate models is expected to be a very useful resource for the materials science researchers and practitioners to assist in their search for new and improved quality steels. The web-tool is available at http://info.eecs.northwestern.edu/SteelFatigueStrengthPredictor

【Keywords】: ensemble learning; fatigue strength; materials informatics; steels; supervised learning

Workshops 6

322. CyberSafety 2016: The First International Workshop on Computational Methods in CyberSafety.

【Paper Link】【Pages】:2501-2502

【Authors】: Shivakant Mishra ; Qin Lv ; Richard Han ; Jeremy Blackburn

【Abstract】: The theme of cybersafety is an important emerging research topic on the Internet that manifests itself daily as users navigate the Web and networked applications. Examples of cybersafety issues include cyberbullying, cyberthreats, recruiting minors via Internet services for nefarious purposes, using deceptive means to dupe vulnerable populations, exhibiting misbehaving behaviors such as using profanity or flashing in online video chats, and many others. These issues have a direct negative impact on the social, psychological and in some cases physical well-being of the end users. An important characteristic of these issues is that they fall in a grey legal area, where perpetrators may claim freedom of speech or rights to free expression despite causing harm. The main goal of this inaugural workshop on cybersafety is to bring together the researchers and practitioners from academia, industry, government and research labs working in the area of cybersafety to discuss the unique challenges in addressing various cybersafety issues and to share experiences, solutions, tools, and techniques. The focus is on the detection, prevention and mitigation of various cybersafety issues, as well as education and promoting safe practices. Topics of interest include but are not limited to the following: Cyberbullying in social media, Cyberthreats, coercion, and undue social pressure, Misbehaving users in online video chat services, Trolls in chat rooms, discussion boards and other social media, Deception to shape opinion, such as spinning, Deceptive techniques targeted at vulnerable populations such as the elderly and K-12 minors, Bad actors in social media, Online exposure of inappropriate material to minors, Education and promoting safe practices, and Remedies for preventing or thwarting cybersafety issues.

【Keywords】: bad actors; cyberbullying; cybersafety; cyberthreats; misbehaving users; trolls

【Paper Link】【Pages】:2503-2504

【Authors】: Carlos Castillo ; Fernando Diaz ; Yu-Ru Lin ; Jie Yin

【Abstract】: The proliferation of social media platforms together with the wide adoption of smartphone devices has transformed how we communicate and share news. During large-scale emergencies, such as natural disasters or armed attacks, victims, responders, and volunteers increasingly use social media to post situation updates and to request and offer help. The use of social media for emergency and disaster response has been a prominent application of information and knowledge management techniques in recent years. There are a number of challenges associated with near real-time processing of vast volumes of information in a way that makes sense for people directly affected, for volunteer organizations, and for official emergency response agencies. As massive amount of messages posted by users are transformed into semi-structured records via information extraction and natural language processing techniques, there is a growing need for developing advanced techniques to aggregate this large-scale data to gain an understanding of the ``big picture'' of an emergency, and to detect and predict how a disaster could develop. This workshop seeks to provide a platform for the exchange of ideas, identification of important problems, and discovery of possible synergies. It will enable interesting discussions and encouraged collaboration between various disciplines, and information and knowledge management approaches is the core of this workshop.

【Keywords】: disaster relief; emergency management; social media

324. BigNet 2016: First Workshop on Big Network Analytics.

【Paper Link】【Pages】:2505-2506

【Authors】: Jie Tang ; Keke Cai ; Zhong Su ; Hanghang Tong ; Michalis Vazirgiannis ; Yang Yang

【Abstract】: The first ACM international workshop on big network analytics is held in Indianapolis, Indiana, USA on October 24, 2016 and co-located with the ACM 25th Conference on Information and Knowledge Management (CIKM). The main objective of the workshop is to provide a forum for presenting the most recent advances in mining big networks to unearth rich knowledge. It is related to information retrieval, Web mining, social network analysis, and computational advertising. The anticipated outcome includes a fruitful discussion about the emerging challenges in this field, the development of novel theories for mining big networks, and motivating the interesting applications. The broader anticipated outcome includes: fostering future research directions, publishing high quality papers, attracting new researchers to this field, and concrete solutions to the existing problems.

【Keywords】: big network

325. DDTA 2016: The Workshop on Data-Driven Talent Acquisition.

【Paper Link】【Pages】:2507-2508

【Authors】: Yi Fang ; Maarten de Rijke ; Huangming Xie

【Abstract】: Expertise search is a well-established field in information retrieval. In recent years, the increasing availability of data enables accumulation of evidence of talent and expertise from a wide range of domains. The availability of big data significantly benefits employers and recruiters. By analyzing the massive amounts of structured and unstructured data, organizations may be able to find the exact skill sets and talent they need to grow their business. The aim of this workshop is to provide a forum for industry and academia to discuss the recent progress in talent search and management, and how the use of big data and data-driven decision making can advance talent acquisition and human resource management.

【Keywords】: data-driven; expertise retrieval; human resource management; talent acquisition

326. ACM DAVA'16: 2nd International Workshop on DAta mining meets Visual Analytics at Big Data Era.

【Paper Link】【Pages】:2509

【Authors】: Lei Shi ; Hanghang Tong ; Chaoli Wang ; Leman Akoglu

【Abstract】: The theme of this workshop is to bridge data mining and visual analytics for information and knowledge management. The topics include, but not limited to, the following: Big data mining and visual analytics, theory and foundations -- Knowledge discovery with data mining and visual analytics technologies -- Fusion, mining and visualization of rich and heterogeneous data source -- Security and privacy issues in data mining and visual analytics systems -- Information, social and biological graph mining and visualization -- Novel methods on visualization-oriented data mining -- Visual representations and interaction techniques of data mining results -- Data management and knowledge representation including scalable data representations -- Mathematical foundations and algorithms in data mining to allow interactive visual analysis -- Analytical reasoning including the human analytic, knowledge discovery, perception, and collaborative visual analytics -- Evaluation methods for data mining algorithms and visual analytics systems -- Applications of visual analytics and data mining techniques, including but not limited to applications in science, engineering, public safety, commerce, etc. The DAVA'16 workshop includes 3 invited keynote talks, 2 paper sessions and some posters. Authors of accepted oral papers give 20-minute presentation on their papers. Three keynote speakers from both data mining and visualization give invited talks in this workshop (40-minute each). The DAVA'16 organization committee selects one paper of the highest quality to receive the DAVA'16 best paper award and a cash award of $300. An extended version of the selected papers will be recommended to Chinese of Journal Electronics (SCI-indexed) or International Journal of Software and Informatics (IJSI) as a special issue on visual analytics.

【Keywords】: data mining; visual analytics

327. DTMBIO 2016: The Tenth International Workshop on Data and Text Mining in Biomedical Informatics.

【Paper Link】【Pages】:2511-2512

【Authors】: Sangwoo Kim ; Jake Y. Chen ; Vincenzo Cutello ; Doheon Lee

【Abstract】: Started in 2006 as a specialized workshop in the field of text mining applied to biomedical informatics, DTMBIO (ACM international workshop on Data and Text Mining in Biomedical Informatics) has been held annually in conjunction with one of the largest data management conferences, CIKM, bringing together researchers working on computer science and bioinformatics area. The purpose of DTMBIO is to foster discussions regarding the state-of-the-art applications of data and text mining on biomedical research problems. DTMBIO 2016 will help scientists navigate emerging trends and opportunities in the evolving area of informatics related techniques and problems in the context of biomedical research.

【Keywords】: algorithm; bioinformatics; biomedical informatics; data mining; management; text mining