SDM 2019:Calgary, Alberta, Canada

Proceedings of the 2019 SIAM International Conference on Data Mining, SDM 2019, Calgary, Alberta, Canada, May 2-4, 2019. SIAM 【DBLP Link】

Paper Num: 90 || Session Num: 0

1. Incremental Multi-view Support Vector Machine.

【Paper Link】【Pages】:1-9

【Authors】: Peng Zhou ; Yi-Dong Shen ; Liang Du ; Fan Ye

【Abstract】: Multi-view classification has received considerable attention in recent years. We observed that the existing multi-view classification methods learn a consensus result by collecting all views and thus have two critical limitations. First, it is not scalable. Second, in many applications views of data are available over time; it is in-feasible to apply the existing multi-view learning methods to such streaming views. To address the two limitations, in this paper we propose a novel incremental multi-view SVM method, i.e., instead of processing all views simultaneously, we integrate them one by one in an incremental way. We first learn an initial model from the first view; next when a new view is available, we update the model and then apply it to learn a new consensus result. This incremental method is scalable and applicable to streaming views. We present a block coordinate descent algorithm whose convergence is theoretically guaranteed to optimize the induced objective function. Experimental results on several benchmark data sets further demonstrate the effectiveness of our method.

【Keywords】:

2. Deep Multimodality Model for Multi-task Multi-view Learning.

【Paper Link】【Pages】:10-18

【Authors】: Lecheng Zheng ; Yu Cheng ; Jingrui He

【Abstract】: Many real-world problems exhibit the coexistence of multiple types of heterogeneity, such as view heterogeneity (i.e., multi-view property) and task heterogeneity (i.e., multi-task property). For example, in an image classification problem containing multiple poses of the same object, each pose can be considered as one view, and the detection of each type of object can be treated as one task. Furthermore, in some problems, the data type of multiple views might be different. In a web classification problem, for instance, we might be provided an image and text mixed data set, where the web pages are characterized by both images and texts. A common strategy to solve this kind of problem is to leverage the consistency of views and the relatedness of tasks to build the prediction model. In the context of deep neural network, multitask relatedness is usually realized by grouping tasks at each layer, while multi-view consistency is usually enforced by finding the maximal correlation coefficient between views. However, there is no existing deep learning algorithm that jointly models task and view dual heterogeneity, particularly for a data set with multiple modalities (text and image mixed data set or text and video mixed data set, etc.). In this paper, we bridge this gap by proposing a deep multi-task multi-view learning framework that learns a deep representation for such dual-heterogeneity problems. Empirical studies on multiple real-world data sets demonstrate the effectiveness of our proposed Deep-MTMV algorithm.

【Keywords】:

3. Augmented Multi-Task Learning by Optimal Transport.

【Paper Link】【Pages】:19-27

【Authors】: Boyang Liu ; Pang-Ning Tan ; Jiayu Zhou

【Abstract】: Multi-task learning (MTL) provides an effective approach to improve generalization error for multiple related prediction tasks by learning the tasks jointly, assuming there is a common structure shared by their model parameters. Despite its successes, the shared parameter assumption is ineffective when the sample sizes for some tasks are too small to infer the task relationships correctly from data. To overcome this limitation, we propose a novel framework for increasing the effective sample size of each task by augmenting it with pseudo-labeled instances generated from the training data of other related tasks. Incorporating training data from other tasks is a challenge for regression problems as their data distributions may not be consistent due to the co-variate shift and response drift problems. Our proposed framework addresses this challenge by coupling multitask regression with a series of optimal transport steps to iteratively learn the pseudo-labeled instances by identifying relevant training instances from other source domains and refining the pseudo-labels until they are consistent with the training instances of the target domain. Experimental results on both synthetic and real-world data showed that our framework consistently outperformed other state-of-the-art MTL methods.

【Keywords】:

4. Calibrating Probability Estimation Trees using Venn-Abers Predictors.

【Paper Link】【Pages】:28-36

【Authors】: Ulf Johansson ; Tuwe Löfström ; Henrik Boström

【Abstract】: Class labels output by standard decision trees are not very useful for making informed decisions, e.g., when comparing the expected utility of various alternatives. In contrast, probability estimation trees (PETs) output class probability distributions rather than single class labels. It is well known that estimating class probabilities in PETs by relative frequencies often lead to extreme probability estimates, and a number of approaches to provide more well-calibrated estimates have been proposed. In this study, a recent model-agnostic calibration approach, called Venn-Abers predictors is, for the first time, considered in the context of decision trees. Results from a large-scale empirical investigation are presented, comparing the novel approach to previous calibration techniques with respect to several different performance metrics, targeting both predictive performance and reliability of the estimates. All approaches are considered both with and without Laplace correction. The results show that using Venn-Abers predictors for calibration is a highly competitive approach, significantly outperforming Platt scaling, Isotonic regression and no calibration, with respect to almost all performance metrics used, independently of whether Laplace correction is applied or not. The only exception is AUC, where using non-calibrated PETs together with Laplace correction, actually is the best option, which can be explained by the fact that AUC is not affected by the absolute, but only relative, values of the probability estimates.

【Keywords】:

5. Deep Multi-view Information Bottleneck.

【Paper Link】【Pages】:37-45

【Authors】: Qi Wang ; Claire Boudreau ; Qixing Luo ; Pang-Ning Tan ; Jiayu Zhou

【Abstract】: In many classification problems, the predictions can be enhanced by fusing information from different data views. In particular, when the information from different views complement each other, it is expected that multi-view learning will lead to improved predictive performance. In this paper, we proposed a supervised multi-view learning framework based on the information bottleneck principle to filter out irrelevant and noisy information from multiple views and learn an accurate joint representation. Specifically, our proposed method maximizes the mutual information between the labels and the learned joint representation while minimizing the mutual information between the learned latent representation of each view and the original data representation. As the relationships between different views are often complicated and nonlinear, we employed deep neural networks to learn the latent representation and to disentangle their complex dependencies. However, since the computation of mutual information can be intractable, we employed the variational inference method to efficiently solve the optimization problem. We performed extensive experiments on various synthetic and real-world datasets to demonstrate the effectiveness of the framework.

【Keywords】:

6. Forest Packing: Fast Parallel, Decision Forests.

【Paper Link】【Pages】:46-54

【Authors】: James Browne ; Disa Mhembere ; Tyler M. Tomita ; Joshua T. Vogelstein ; Randal C. Burns

【Abstract】: Decision Forests are popular machine learning techniques that assist scientists to extract knowledge from massive data sets. This class of tool remains popular because of their interpretability and ease of use, unlike other modern machine learning methods, such as kernel machines and deep learning. Decision forests also scale well for use with large data because training and run time operations are trivially parallelizable allowing for high inference throughputs. A negative aspect of these forests, and an untenable property for many real time applications, is their high inference latency caused by the combination of large model sizes with random memory access patterns. We present memory packing techniques and a novel tree traversal method to overcome this deficiency. The result of our system is a grouping of trees into a hierarchical structure. At low levels, we pack the nodes of multiple trees into contiguous memory blocks so that each memory access fetches data for multiple trees. At higher levels, we use leaf cardinality to identify the most popular paths through a tree and collocate those paths in contiguous cache lines. We extend this layout with a re-ordering of the tree traversal algorithm to take advantage of the increased memory throughput provided by out-of-order execution and cache-line prefetching. Together, these optimizations increase the performance and parallel scalability of classification in ensembles by a factor of ten over an optimized C++ implementation and a popular R-language implementation.

【Keywords】:

7. Fast Unsupervised Location Category Inference from Highly Inaccurate Mobility Data.

【Paper Link】【Pages】:55-63

【Authors】: Jinfeng Yi ; Qi Lei ; Wesley M. Gifford ; Ji Liu ; Junchi Yan ; Bowen Zhou

【Abstract】: Understanding a mobile user's behavior, e.g., to infer if she is exercising in a gym or dining in a restaurant, is the key to a variety of applications. However, in many real-world scenarios, precisely determining user visitation is extremely challenging due to the uncertainty present in mobile location updates, where errors can be hundreds of meters or even more. We consider the location uncertainty circle determined by the reported location coordinates as the center and the associated location error as the radius. Such a location uncertainty circle is likely to cover multiple location categories, especially in densely populated areas. Worse still, in many cases, mobile users are anonymous, and we have no access to their personal information or other labeled data, which compels us to develop an unsupervised learning approach to solve this problem. Using a user-time-location category tensor, we capture the user behavior and propose a novel tensor factorization framework to accurately infer the location categories visited by mobile users. This framework leverages several key observations including the negative-unlabeled nature of the data and the intrinsic correlations between users. Also, the proposed algorithm can predict where users are even in the absence of location information. To efficiently solve the proposed framework, we propose a parameter-free and scalable optimization algorithm by effectively exploring the sparse and low-rank structure of the tensor. Our empirical studies show that the proposed algorithm is both effective and scalable: it can solve problems with millions of users and billions of location updates, and also provide superior prediction accuracies on real-world location update and check-in datasets.

【Keywords】:

【Paper Link】【Pages】:64-72

【Authors】: Sha Li ; Chao Zhang ; Dongming Lei ; Ji Li ; Jiawei Han

【Abstract】: Recent studies have demonstrated inspiring success in leveraging geo-tagged social media data for applications such as event detection, location recommendation and mobile healthcare. However, in most real-life social media streams, only a small percentage of data have explicit geo-location metadata, which hinders the power of social media from being fully unleashed. We study the problem of inferring geo-locations from social media messages. While a number of text-based geo-locating techniques have been proposed, they either fall short of automatically identifying indicative keywords from noisy social media posts or do not integrate rich prior knowledge of geological regions. We propose an attentive memory network called GeoAttn for localization of social media messages. To capture indicative keywords for location inference, GeoAttn consists of an attentive message encoder, which selectively focuses on location-indicative terms to derive a discriminative message representation. The message embedding is then fed into a memory network, which selectively attends to relevant Points-of-Interest (POIs) for location prediction. The message encoder and key-value memory network are jointly trained in an end-to-end manner. The attention mechanisms in GeoAttn not only alleviate noisy information for higher prediction accuracy, but also provide interpretable attention scores that rationalize the predictions. Our experiments on a million-scale geo-tagged tweet dataset show that GeoAttn outperforms previous state-of-the-art location prediction methods by 15.5% in mean error distance, and is capable of locating over half of the tweets within 5km.

【Keywords】:

9. Bus Travel Speed Prediction using Attention Network of Heterogeneous Correlation Features.

【Paper Link】【Pages】:73-81

【Authors】: Yidan Sun ; Guiyuan Jiang ; Siew-Kei Lam ; Shicheng Chen ; Peilan He

【Abstract】: Accurate bus travel speed prediction can lead to improved urban mobility by enabling passengers to reliably plan their trips in advance and traffic administrators to manage the bus operations more effectively. However, the increasing complexity of public transportation networks pose a significant challenge to existing prediction methods as the bus operations are affected by numerous factors such as varying traffic conditions, tight bus operation schedules, wide-ranging travel demands, frequent accelerations/decelerations at bus stops, delays at intersections, etc. This paper aims to achieve accurate bus speed prediction by identifying important intrinsic and extrinsic features that impact the bus speed, and their significance in specific situations. We propose to jointly incorporate multiple feature components that provide discriminating information to train the prediction model by exploring the spatial correlation, temporal correlation, as well as contextual information (e.g. road characteristics and weather conditions). In particular, we introduce an attribute-driven attention network model to integrate the feature components, which considers the heterogeneous influence of different feature components on bus speed and dynamically assigns weights to the learned latent features based on specific traffic situations. Extensive experiments using real bus travel data involving 42 bus services show that our proposed method outperforms six well-known methods.

【Keywords】:

10. A Nondeterministic Normalization based Scan Statistic (NN-scan) towards Robust Hotspot Detection: A Summary of Results.

【Paper Link】【Pages】:82-90

【Authors】: Yiqun Xie ; Shashi Shekhar

【Abstract】: Hotspot detection aims to find sub-regions of a space that have higher probability density of generating certain events (e.g., disease, crimes) than the other regions. Finding hotspots has important applications in many domains including public health, crime analysis, transportation, etc. Existing methods of hotspot detection rely on test statistics (e.g., likelihood ratio, density) that do not consider spatial nondeterminism, leading to false and missing detections. We provide theoretical insights into the limitations of related work, and propose a new framework, namely, Nondeterministic Normalization based scan statistic (NN-scan), to address the issues. We also propose a DynamIc Linear Approximation (DILA) algorithm to improve NN-scan's efficiency. In experiments, we show that NN-scan can significantly improve the precision and recall of hotspot detection and DILA can greatly reduce the computational cost.

【Keywords】:

11. A Fast Algorithm for Combinatorial Hotspot Mining Based on Spatial Scan Statistic.

【Paper Link】【Pages】:91-99

【Authors】: Shin-ichi Minato ; Jun Kawahara ; Fumio Ishioka ; Masahiro Mizuta ; Koji Kurihara

【Abstract】: It is a popular and classical problem to detect a hotspot cluster from a statistical data which is partitioned by geographical regions such as prefectures or cities. Spatial scan statistic is a standard measure of likelihood ratio which has been widely used for testing hotspot clusters. In this work, we propose a very fast algorithm to enumerate all combinatorial regions which are more significant than a given threshold value. Our algorithm features the fast exploration by pruning the search space based on the partial monotonicity of the spatial scan statistic. Experimental results for a nation-wide 47 prefectures dataset show that our method generates the highest-ranked hotspot cluster in a time a million or more times faster than the previous naive search method. Our method works practically for a dataset with several hundreds of regions, and it will drastically accelerate hotspot analysis in various fields.

【Keywords】:

12. A Deep Spatio-Temporal Fuzzy Neural Network for Passenger Demand Prediction.

【Paper Link】【Pages】:100-108

【Authors】: Xiaoyuan Liang ; Guiling Wang ; Martin Renqiang Min ; Yi Qi ; Zhu Han

【Abstract】: In spite of its importance, passenger demand prediction is a highly challenging problem, because the demand is simultaneously influenced by the complex interactions among many spatial and temporal factors and other external factors such as weather. To address this problem, we propose a Spatio-TEmporal Fuzzy neural Network (STEF-Net) to accurately predict passenger demands incorporating the complex interactions of all known important factors. We design an end-to-end learning framework with different neural networks modeling different factors. Specifically, we propose to capture spatio-temporal feature interactions via a convolutional long short-term memory network and model external factors via a fuzzy neural network that handles data uncertainty significantly better than deterministic methods. To keep the temporal relations when fusing two networks and emphasize discriminative spatio-temporal feature interactions, we employ a novel feature fusion method with a convolution operation and an attention layer. As far as we know, our work is the first to fuse a deep recurrent neural network and a fuzzy neural network to model complex spatial-temporal feature interactions with additional uncertain input features for predictive learning. Experiments on a large-scale real-world dataset show that our model achieves more than 10% improvement over the state-of-the-art approaches.

【Keywords】:

13. AMAS: Attention Model for Attributed Sequence Classification.

【Paper Link】【Pages】:109-117

【Authors】: Zhongfang Zhuang ; Xiangnan Kong ; Elke A. Rundensteiner

【Abstract】: Classification over sequential data is important for a wide range of applications from information retrieval, anomaly detection to genomic analysis. Neural network approaches, in particular recurrent neural networks, have been widely used in such tasks due to their strong capability of feature learning. However, recent innovations in sequence classification learn from not only the sequences but also the associated attributes, called attributed sequences. While recent work shows the attributed sequences to be useful in real-world applications, neural attention models have not yet been explored for attributed sequence classification. This paper is the first to study the problem of attributed sequence classification with the neural attention mechanism. This is challenging that now we need to assess the importance of each item in each sequence considering both the sequence itself and the associated metadata. We propose a framework, called AMAS, to classify attributed sequences using the information from the sequences, metadata, and the computed attention. Empirical results on real-world datasets demonstrate that the proposed AMAS framework significantly improves the performance of classification over the state-of-the-art methods on attributed sequences.

【Keywords】:

14. MM-Pred: A Deep Predictive Model for Multi-attribute Event Sequence.

【Paper Link】【Pages】:118-126

【Authors】: Li Lin ; Lijie Wen ; Jianmin Wang

【Abstract】: Event sequence prediction has wide applications on economics, electronic health and social media monitoring. Accurate prediction of event sequences can help provide better service to customers and prevent risks. Recent works try to address the problem aiming at learning the impact of past events on the future events using deep learning methods. Such works often take the past event sequences as input and model the self-change transformations of the events, and few of them concerned the effect of event attributes. We propose an RNN-based predictive model to encode multiple attributes as attached information of the event for predicting next event and its attributes given past sequences. To learn how important each attribute is for the event, we design a component modulator to customize weights for representations of the event and its attributes. The more important the information is, the relevant weight will be higher. Finally, the prediction of next event and its attributes are conducted simultaneously with a different modulator for each predictive task. The performance of the proposed model was evaluated on 5 real-life datasets, containing two different types of event logs. The results show that our model outperforms the baselines and the state-of-the-art, not only on the prediction of next event and its attributes but also the generation of event sequence suffix.

【Keywords】:

15. Constraint Programming for Association Rules.

【Paper Link】【Pages】:127-135

【Authors】: Mohamed-Bachir Belaid ; Christian Bessiere ; Nadjib Lazaar

【Abstract】: Discovering association rules among items in a dataset is one of the fundamental problems in data mining. It has recently been shown that constraint programming is a flexible way to tackle data mining tasks. In this paper we propose a declarative model based on constraint programming to capture association rules. Our model also allows us to specify any additional property and/or user's constraints on the kind of rules the user is looking for. To implement our model, we introduce a new global constraint, Confident, for ensuring the confidence of rules. We prove that completely propagating Confident is NP-hard. We thus provide a decomposition of Confident. In addition to user's constraints on the items composing body and head of the rules, we show that we can capture the popular minimal non-redundant property of association rules. An experimental analysis shows the practical effectiveness of our approach compared to existing approaches.

【Keywords】:

16. Linear Time Motif Discovery in Time Series.

【Paper Link】【Pages】:136-144

【Authors】: Xiaosheng Li ; Jessica Lin

【Abstract】: The discovery of motifs (repeated patterns) is an important task in time series data mining. The task can be formulated as finding the most similar non-overlapping pair of subsequences in a given time series. Existing exact motif discovery methods have quadratic time complexities in the length of the time series. In this work, we present an algorithm that can find the exact motif of a given time series in a linear expected time complexity. The algorithm is further modified to find all pairs of subsequences whose distances are below a given threshold value. In practice, if true motifs exist in the data or the threshold is set to a small value, the algorithms are very fast. The proof of correctness and time complexity are detailed and experiments are conducted to verify the analysis. We applied the proposed method to analyze the real-world bird sound and electrical consumption data to demonstrate its effectiveness.

【Keywords】:

17. Deep Skip-Gram Networks for Text Classification.

【Paper Link】【Pages】:145-153

【Authors】: Chaochun Liu ; Yaliang Li ; Hongliang Fei ; Ping Li

【Abstract】: Text classification is one of the indispensable tasks for natural language processing, which has many real applications. However, existing methods for text classification still cannot well effectively capture long-range and local-pattern features within texts due to the huge variation of text expression. Motivated by this, we propose skip-gram convolution to extract non-consecutive local n-gram patterns, which provide much more comprehensive information for varying text expressions, and help us to understand the human text better. We also employ the recurrent neural network to extract the long-range features from localized level to sequential and global level via the chain-like architecture. To demonstrate the effectiveness of our deep skip-gram networks, we conduct comprehensive experiments on eight large-scale datasets that are widely used for the text classification task. Experimental results show that our deep skip-gram networks can outperform most of competing state-of-the-art methods, especially significant on more complex and challenging datasets. Moreover, our model is very robust and can be generalized very well on different datasets, even without tuning the hyper-parameters for specific dataset.

【Keywords】:

18. Mining Rules Incrementally over Large Knowledge Bases.

【Paper Link】【Pages】:154-162

【Authors】: Xiaofeng Zhou ; Ali Sadeghian ; Daisy Zhe Wang

【Abstract】: Multiple web-scale Knowledge Bases, e.g., Freebase, YAGO, NELL, have been constructed using semi-supervised or unsupervised information extraction techniques and many of them, despite their large sizes, are continuously growing. Much research effort has been put into mining inference rules from knowledge bases. To address the task of rule mining over evolving web-scale knowledge bases, we propose a parallel incremental rule mining framework. Our approach is able to efficiently mine rules based on the relational model and apply updates to large knowledge bases; we propose an alternative metric that reduces computation complexity without compromising quality; we apply multiple optimization techniques that reduce runtime by more than 2 orders of magnitude. Experiments show that our approach efficiently scales to web-scale knowledge bases and saves over 90% time compared to the state-of-the-art batch rule mining system. We also apply our optimization techniques to the batch rule mining algorithm, reducing runtime by more than half compared to the state-of-the-art. To the best of our knowledge, our incremental rule mining system is the first that handles updates to web-scale knowledge bases.

【Keywords】:

19. Optimally rotated coordinate systems for adaptive least-squares regression on sparse grids.

【Paper Link】【Pages】:163-171

【Authors】: Bastian Bohn ; Michael Griebel ; Jens Oettershagen

【Abstract】: For low-dimensional data sets with a large amount of data points, standard kernel methods are usually not feasible for regression anymore. Besides simple linear models or involved heuristic deep learning models, grid-based discretizations of larger (kernel) model classes lead to algorithms, which naturally scale linearly in the amount of data points. For moderate-dimensional or high-dimensional regression tasks, these grid-based discretizations suffer from the curse of dimensionality. Here, sparse grid methods have proven to circumvent this problem to a large extent. In this context, space- and dimension-adaptive sparse grids, which can detect and exploit a given low effective dimensionality of nominally high-dimensional data, are particularly successful. They nevertheless rely on an axis-aligned structure of the solution and exhibit issues for data with predominantly skewed and rotated coordinates. In this paper we propose a preprocessing approach for these adaptive sparse grid algorithms that determines an optimized, problem-dependent coordinate system and, thus, reduces the effective dimensionality of a given data set in the ANOVA sense. We provide numerical examples on synthetic data as well as real-world data to show how an adaptive sparse grid least squares algorithm benefits from our preprocessing method.

【Keywords】:

20. ℓ0-Regularized Sparsity for Probabilistic Mixture Models.

【Paper Link】【Pages】:172-180

【Authors】: Dzung T. Phan ; Tsuyoshi Idé

【Abstract】: This paper revisits a classical task of learning probabilistic mixture models. Our major goal is to sparsely learn the mixture weights to automatically determine the right number of clusters. The key idea is to use a novel Bernoulli prior on the mixture weights in a Bayesian learning framework, and formalize the task of determining the mixture weights as an ℓ0-regularized optimization problem. By leveraging a specific mathematical structure, we derive a quadratic time algorithm for efficiently solving the non-convex ℓ0-based problem. In experiments, we evaluate the performance of our proposed approach over existing methods in recovery capability and anomaly detection for synthetic as well as real-world data sets.

【Keywords】:

21. Intrinsic Dimensionality Estimation within Tight Localities.

【Paper Link】【Pages】:181-189

【Authors】: Laurent Amsaleg ; Oussama Chelly ; Michael E. Houle ; Ken-ichi Kawarabayashi ; Milos Radovanovic ; Weeris Treeratanajaru

【Abstract】: Accurate estimation of Intrinsic Dimensionality (ID) is of crucial importance in many data mining and machine learning tasks, including dimensionality reduction, outlier detection, similarity search and subspace clustering. However, since their convergence generally requires sample sizes (that is, neighborhood sizes) on the order of hundreds of points, existing ID estimation methods may have only limited usefulness for applications in which the data consists of many natural groups of small size. In this paper, we propose a local ID estimation strategy stable even for ‘tight’ localities consisting of as few as 20 sample points. The estimator applies MLE techniques over all available pairwise distances among the members of the sample, based on a recent extreme-value-theoretic model of intrinsic dimensionality, the Local Intrinsic Dimension (LID). Our experimental results show that our proposed estimation technique can achieve notably smaller variance, while maintaining comparable levels of bias, at much smaller sample sizes than state-of-the-art estimators.

【Keywords】:

22. Identifying When Effect Restoration Will Improve Estimates of Causal Effect.

【Paper Link】【Pages】:190-198

【Authors】: Hüseyin Oktay ; Akanksha Atrey ; David D. Jensen

【Abstract】: Several methods have been developed that combine multiple models learned on different data sets and then use that combination to reach conclusions that would not have been possible with any one of the models alone. We examine one such method—effect restoration—which was originally developed to mitigate the effects of poorly measured confounding variables in a causal model. We show how effect restoration can be used to combine results from different machine learning models and how the combined model can be used to estimate causal effects that are not identifiable from either of the original studies alone. We characterize the performance of effect restoration by using both theoretical analysis and simulation studies. Specifically, we show how conditional independence tests and common assumptions can help distinguish when effect restoration should and should not be applied, and we use empirical analysis to show the limited range of conditions under which effect restoration should be applied in practical situations.

【Keywords】:

23. We Are Not Your Real Parents: Telling Causal from Confounded using MDL.

【Paper Link】【Pages】:199-207

【Authors】: David Kaltenpoth ; Jilles Vreeken

【Abstract】: Given data over variables (X1, …, Xm, Y) we consider the problem of finding out whether X jointly causes Y or whether they are all confounded by an unobserved latent variable Z. To do so, we take an information-theoretic approach based on Kolmogorov complexity. In a nutshell, we follow the postulate that first encoding the true cause, and then the effects given that cause, results in a shorter description than any other encoding of the observed variables. The ideal score is not computable, and hence we have to approximate it. We propose to do so using the Minimum Description Length (MDL) principle. We compare the MDL scores under the models where X causes Y and where there exists a latent variables Z confounding both X and Y and show our scores are consistent. To find potential confounders we propose using latent factor modeling, in particular, probabilistic PCA (PPCA). Empirical evaluation on both synthetic and real-world data shows that our method, CoCa, performs very well—even when the true generating process of the data is far from the assumptions made by the models we use. Moreover, it is robust as its accuracy goes hand in hand with its confidence.

【Keywords】:

24. Query-Aware Bayesian Committee Machine for Scalable Gaussian Process Regression.

【Paper Link】【Pages】:208-216

【Authors】: Jiayuan He ; Jianzhong Qi ; Kotagiri Ramamohanarao

【Abstract】: The Gaussian process (GP) model is a powerful tool for regression problems. However, the high computational costs of the GP model has constrained its applications over large-scale data sets. To overcome this limitation, aggregation models employ distributed GP submodels (experts) for parallel training and predicting, and then merge the predictions of all submodels to produce an approximated result. The state-of-the-art aggregation models are based on Bayesian committee machines, where a prior is assumed at the start and then updated by each submodel. In this paper, we investigate the impact of the prior on the accuracy of aggregations. We propose a query-aware Bayesian committee machine (QBCM). The QBCM model partitions the testing data (i.e., queries) into subsets, and incorporates a query-aware prior when merging the predictions of submodels. This model improves the prediction accuracy, while retaining the advantages of aggregation models, i.e., closed-form inference and parallelizability. We conduct both theoretical analysis and empirical experiments on real data. The results confirm the effectiveness and efficiency of the proposed model QBCM.

【Keywords】:

25. A Novel Regularizer for Temporally Stable Learning with an Application to Twitter Topic Classification.

【Paper Link】【Pages】:217-225

【Authors】: Yakun Wang ; Ga Wu ; Mohamed Reda Bouadjenek ; Scott Sanner ; Sen Su ; Zhongbao Zhang

【Abstract】: Supervised topic classifiers for Twitter and other media sources are important in a variety of long-term topic tracking tasks. Unfortunately, over long periods of time, features that are predictive during the training period may prove ephemeral and fail to generalize to prediction at future times. For example, if we trained a classifier to identify tweets concerning the topic of “Celebrity Death”, individual celebrity names and terms associated with these celebrities such as “Nelson Mandela” or “South Africa” would prove to be temporally unstable since they would not generalize over long periods of time; in contrast, terms like “RIP” (rest in peace) would prove to be temporally stable predictors of this topic over long periods of time. In this paper, we aim to design supervised learning methods for Twitter topic classifiers that are capable of automatically downweighting temporally unstable features to improve future generalization. To do this, we first begin with an oracular approach that chooses temporally stable features based on knowledge of both train and test data labels. We then search for feature metrics evaluated on only the training data that are capable of recovering the temporally stable features identified by our oracular definition. We next embed the top-performing metric as a temporal stability regularizer in logistic regression with the important property that the overall training objective retains convexity, hence enabling a globally optimal solution. Finally, we train our topic classifiers on 6 Twitter topics over roughly one year of data and evaluate on the following year of data, showing that logistic regression with our temporal stability regularizer generally outperforms logistic regression without such regularization across the full precision-recall continuum. Overall, these results establish a novel regularizer for training long-term temporally stable topic classifiers for Twitter and beyond.

【Keywords】:

26. Efficient Summarizing of Evolving Events from Twitter Streams.

【Paper Link】【Pages】:226-234

【Authors】: Thi-Huyen Nguyen ; Tuan-Anh Hoang ; Wolfgang Nejdl

【Abstract】: Twitter has been heavily used for users to report and share information about real-world events. However, understanding the multiple aspects of an event as it happens is a very challenging task due to the prevalent noise and redundant in tweets as well as the evolution of the event. In this paper, we present a graph-based method for summarizing evolutionary events from tweet streams. Unlike existing approaches that either require prior information, result in less readable summaries, or are not scalable, our proposed method can automatically extract sets of representative tweets as concise summaries for the events. Moreover, the method also allows the summaries to be updated efficiently using an incremental procedure, thus can scale up to large data streams. The experiments on five datasets reveal that our proposed method significantly outperforms several baselines.

【Keywords】:

【Paper Link】【Pages】:235-243

【Authors】: Lu Cheng ; Ruocheng Guo ; Yasin N. Silva ; Deborah L. Hall ; Huan Liu

【Abstract】: Cyberbullying has become one of the most pressing online risks for young people and has raised serious concerns in society. The emerging literature identifies cyberbullying as repetitive acts that occur over time rather than one-off incidents. Yet, there has been relatively little work to model the hierarchical structure of social media sessions and the temporal dynamics of cyberbullying in online social network sessions. We propose a hierarchical attention network for cyberbullying detection that takes these aspects of cyberbullying into account. The primary distinctive characteristics of our approach include: (i) a hierarchical structure that mirrors the structure of a social media session; (ii) levels of attention mechanisms applied at the word and comment level, thereby enabling the model to pay different amounts of attention to words and comments, depending on the context; and (iii) a cyberbullying detection task that also predicts the interval of time between two adjacent comments. These characteristics allow the model to exploit the commonalities and differences across these two tasks to improve the performance of cyberbullying detection. Experiments on a real-world dataset from Instagram, the social media platform on which the highest percentage of users have reported experiencing cyberbullying, reveal that the proposed architecture outperforms the state-of-the-art method.

【Keywords】:

28. SPMM: A Soft Piecewise Mapping Model for Bilingual Lexicon Induction.

【Paper Link】【Pages】:244-252

【Authors】: Yan Fan ; Chengyu Wang ; Boxing Chen ; Zhongkai Hu ; Xiaofeng He

【Abstract】: Bilingual Lexicon Induction (BLI) aims at inducing word translations in two distinct languages. The generated bilingual dictionaries via BLI are essential for cross-lingual NLP applications. Most existing methods assume that a mapping matrix can be learned to project the embedding of a word in the source language to that of a word in the target language which shares the same meaning. However, a single matrix may not be able to provide sufficiently large parameter space and to tailor to the semantics of words across different domains and topics due to the complicated nature of linguistic regularities. In this paper, we propose a Soft Piecewise Mapping Model (SPMM). It generates word alignments in two languages by learning multiple mapping matrices with orthogonal constraint. Each matrix encodes the embedding translation knowledge over a distribution of latent topics in the embedding spaces. Such learning problem can be formulated as an extended version of the Wahba's problem, with a closed-form solution derived. To address the limited size of training data for low-resourced languages and emerging domains, an iterative boosting method based on SPMM is used to augment training dictionaries. Experiments conducted on both general and domain-specific corpora show that SPMM is effective and outperforms previous methods.

【Keywords】:

29. Aspect-Based Sentiment Analysis with Minimal Guidance.

【Paper Link】【Pages】:253-261

【Authors】: Honglei Zhuang ; Timothy P. Hanratty ; Jiawei Har

【Abstract】: Aspect-based sentiment analysis is an important tool to understand user opinions in a fine-grained manner. Although extensively studied, developing such a tool for a specific domain remains an expensive process. Most existing methods either rely on massive labeled data for training or external language resource and tools which are not necessarily available or accurate. We propose to study the aspect-based sentiment analysis with only a small set of aspect and sentiment seed words as guidance on a target corpus. We first expand the aspect and sentiment lexicons from the given seed words by features created by frequent pattern mining. Then, we develop a generative model to characterize the aspect and sentiment mentions based on their word embedding, and infer the sentiment polarity for sentiment words accordingly. The effectiveness of our method is verified by experiments on two real world data sets.

【Keywords】:

【Paper Link】【Pages】:262-270

【Authors】: Liangzhe Chen ; B. Aditya Prakash

【Abstract】: Microblogging websites, like Twitter and Weibo, are used by billions of people to create and spread information. This activity depends on various factors such as the friendship links between users, their topic interests and social influence between them. Social influence can be thought of as a latent factor, that may alter users posting and linking behaviors. Making sense of these behaviors is very important for fully understanding and utilizing these platforms. Most prior work in this space either ignores the effect of social influence, or considers its effect only on link formation or post generation. In contrast, we propose PoLIM, leveraging simple weak supervision, a novel model which jointly models the effect of influence on both link and post generation. We also give PoLIM-FIT, an efficient parallel inference algorithm which scales to large datasets. In our experiments on a large tweets corpus, we detect meaningful topical communities, celebrities, as well as the influence strengths patterns among them. Further, we find that there are significant portions of posts and links that are caused by influence, and this portion increases when the data focuses on a specific event. We also show that differentiating and identifying these influenced content benefits other specific quantitative downstream tasks as well, like predicting future tweets and link formation, where we significantly outperform state-of-the-art.

【Keywords】:

31. Positive-Unlabeled Classification under Class Prior Shift and Asymmetric Error.

【Paper Link】【Pages】:271-279

【Authors】: Nontawat Charoenphakdee ; Masashi Sugiyama

【Abstract】: Bottlenecks of binary classification from positive and unlabeled data (PU classification) are the requirements that given unlabeled patterns are drawn from the same distribution as the test distribution, and the penalty of the false positive error is identical to the false negative error. However, such requirements are often not fulfilled in practice. In this paper, we generalize PU classification to the class prior shift and asymmetric error scenarios. Based on the analysis of the Bayes optimal classifier, we show that given a test class prior, PU classification under class prior shift is equivalent to PU classification with asymmetric error. Then, we propose two different frameworks to handle these problems, namely, a risk minimization framework and density ratio estimation framework. Finally, we demonstrate the effectiveness of the proposed frameworks through experiments.

【Keywords】:

32. Fast Training for Large-Scale One-versus-All Linear Classifiers using Tree-Structured Initialization.

【Paper Link】【Pages】:280-288

【Authors】: Huang Fang ; Minhao Cheng ; Cho-Jui Hsieh ; Michael P. Friedlander

【Abstract】: We consider the problem of training one-versus-all (OVA) linear classifiers for multiclass or multilabel classification when the number of labels is large. A naive extension of OVA to this problem, even with hundreds of cores, usually requires hours for training on large real world datasets. We propose a novel algorithm called OVA-Primal++ that speeds up the training of OVA by using a tree-structured training order, where each classifier is trained using its parent's classifier as initialization. OVA-Primal++ is both theoretically and empirically faster than the naive OVA algorithm, and yet still enjoys the same highly parallelizability and small memory footprint. Extensive experiments on multiclass and multilabel classification datasets validate the effectiveness of our method.

【Keywords】:

33. Towards Identifying Causal Relation Between Instances and Labels.

【Paper Link】【Pages】:289-297

【Authors】: Tian-Zuo Wang ; Sheng-Jun Huang ; Zhi-Hua Zhou

【Abstract】: Multi-Instance Multi-Label (MIML) learning is a popular framework in machine learning, where each object is represented by a bag of instances, and associated with multiple labels. While MIML learning has achieved success in many applications, it is less clear how the labels are related to the instances. In this paper, we propose to study the causal relation between instances and labels, which on one hand can improve the interpretability of complicated MIML models, and on the other hand may further improve the prediction performance at both instance and bag levels. We exploit prototypes in the instance space as a bridge to represent the examples, and then propose an efficient algorithm to identify the causal relations from prototypes to class labels, which are further utilized for model training and key instance detection. Experiments on various datasets show that in addition to superior classification performance, our approach can identify reasonable causal relations between instances and labels.

【Keywords】:

34. Discovering and Controlling for Latent Confounds in Text Classification Using Adversarial Domain Adaptation.

【Paper Link】【Pages】:298-305

【Authors】: Virgile Landeiro ; Tuan Tran ; Aron Culotta

【Abstract】: In text classification, the testing data often systematically differ from the training data, a problem called dataset shift. In this paper, we investigate a type of dataset shift we call confounding shift. Such a setting exists when two conditions are met: (a) there is a confound variable Z that influences both text features X and class label Y; (b) the relationship between Z and Y changes from training to testing. While recent work in this area has required confounds to be known ahead of time, this is unrealistic for many settings. To address this shortcoming, we propose a method both to discover and to control for potential confounds. The approach first uses neural network-based topic modeling to discover potential confounds that differ between training and testing data, then uses adversarial training to fit a classification model that is invariant to these discovered confounds. We find the resulting method to improve over state-of-the-art domain adaptation method, while also producing results that are competitive with those obtained when confounds are known ahead of time.

【Keywords】:

35. A Bayesian Approach for Accurate Classification-Based Aggregates.

【Paper Link】【Pages】:306-314

【Authors】: Q. A. Meertens ; C. G. H. Diks ; H. Jaap van den Herik ; Frank W. Takes

【Abstract】: In this paper, we study the accuracy of values aggregated over classes predicted by a classification algorithm. The problem is that the resulting aggregates (e.g., sums of a variable) are known to be biased. The bias can be large even for highly accurate classification algorithms, in particular when dealing with class-imbalanced data. To correct this bias, the algorithm's classification error rates have to be estimated. In this estimation, two issues arise when applying existing bias correction methods. First, inaccuracies in estimating classification error rates have to be taken into account. Second, impermissible estimates, such as a negative estimate for a positive value, have to be dismissed. We show that both issues are relevant in applications where the true labels are known only for a small set of data points. We propose a novel bias correction method using Bayesian inference. The novelty of our method is that it imposes constraints on the model parameters. We show that our method solves the problem of biased classification-based aggregates as well as the two issues above, in the general setting of multi-class classification. In the empirical evaluation, using a binary classifier on a real-world dataset of company tax returns, we show that our method outperforms existing methods in terms of mean squared error.

【Keywords】:

36. Decision List Optimization based on Continuous Relaxation.

【Paper Link】【Pages】:315-323

【Authors】: Yuzuru Okajima ; Kunihiko Sadamasa

【Abstract】: The interpretability of machine learning models is a crucial factor if human experts have to check and explain why decisions are made. Because of the growing interest in in-terpretability, a simple rule-based model called a decision list has recently regained attention, and several techniques are proposed to improve the accuracy of decision lists by exploiting the power of modern high-speed processors. However, even with the help of modern processors, optimizing decision lists is still difficult and inefficient because of the discrete nature of decision lists. In this paper, we propose an algorithm that optimizes decision lists in a continuously relaxed form, instead of optimizing them in the original discrete form. First, from a given rule set, we construct a continuous decision list that is a continuously relaxed version of a decision list. It is optimized using a continuous optimization method and is finally converted into a decision list of the original discrete form. This approach can exploit the efficiency of continuous optimization methods supported by the optimization frameworks recently developed for deep learning. In experiments, our algorithm performed more efficiently than the state-of-the-art decision list learning algorithms, i.e., it achieved significantly higher accuracies in shorter times on large datasets.

【Keywords】:

37. Mining Maximal Induced Bicliques using Odd Cycle Transversals.

【Paper Link】【Pages】:324-332

【Authors】: Kyle Kloster ; Blair D. Sullivan ; Andrew van der Poel

【Abstract】: Many common graph data mining tasks take the form of identifying dense subgraphs (e.g. clustering, clique-finding, etc). In biological applications, the natural model for these dense substructures is often a complete bipartite graph (biclique), and the problem requires enumerating all maximal bicliques (instead of identifying just the largest or densest). The best known algorithm in general graphs is due to Dias et al., and runs in time O(M|V|4), where M is the number of maximal induced bicliques (MIBs) in the graph. When the graph being searched is itself bipartite, Zhang et al. give a faster algorithm where the time per MIB depends on the number of edges in the graph. In this work, we present a new algorithm for enumerating MIBs in general graphs, whose run time depends on how “close to bipartite” the input is. Specifically, the runtime is parameterized by the size k of an odd cycle transversal (OCT), a vertex set whose deletion results in a bipartite graph. Our algorithm runs in time O(M|V‖E|k23k/3), which is an improvement on Dias et al. whenever k ≤ 3 log3 |V|. We implement our algorithm alongside a variant of Dias et al.'s in open-source C++ code, and experimentally verify that the OCT-based approach is faster in practice on graphs with a wide variety of sizes, densities, and OCT decompositions.

【Keywords】:

38. DSL: Discriminative Subgraph Learning via Sparse Self-Representation.

【Paper Link】【Pages】:333-341

【Authors】: Lin Zhang ; Petko Bogdanov

【Abstract】: The goal in network state prediction (NSP) is to classify the global state (label) associated with features embedded in a graph. This graph structure encoding feature relationships is the key distinctive aspect of NSP compared to classical supervised learning. NSP arises in various applications: gene expression samples embedded in a protein-protein interaction (PPI) network, temporal snapshots of infrastructure or sensor networks, and fMRI coherence network samples from multiple subjects to name a few. Instances from these domains are typically “wide” (more features than samples), and thus, feature sub-selection is required for robust and generalizable prediction. How to best employ the network structure in order to learn succinct connected subgraphs encompassing the most discriminative features becomes a central challenge in NSP. Prior work employs connected subgraph sampling or graph smoothing within optimization frameworks, resulting in either large variance of quality or weak control over the connectivity of selected subgraphs. In this work we propose an optimization framework for discriminative subgraph learning (DSL) which simultaneously enforces (i) sparsity, (ii) connectivity and (iii) high discriminative power of the resulting subgraphs of features. Our optimization algorithm is a single-step solution for the NSP and the associated feature selection problem. It is rooted in the rich literature on maximal-margin optimization, spectral graph methods and sparse subspace self-representation. DSL simultaneously ensures solution interpretability and superior predictive power (up to 16% improvement in challenging instances compared to baselines), with execution times up to an hour for large instances.

【Keywords】:

【Paper Link】【Pages】:342-350

【Authors】: Carl Yang ; Kevin Chang

【Abstract】: On social networks, while nodes bear rich attributes, we often lack the ‘semantics’ of why each link is formed–and thus we are missing the ‘road signs’ to navigate and organize the complex social universe. How to identify relationship semantics without labeled links? Founded on the prevalent homophily principle, we propose the novel problem of Attribute-based Relationship Profiling (ARP), to profile the closeness w.r.t. the underlying relationships (e.g., schoolmate) between users based on their similarity in the corresponding attributes (e.g., schools) and, as output, learn a set of social affinity graphs, where each link is weighted by its probabilities of carrying the relationships. As requirements, ARP should be systematic and complete to profile every link for every relationship– our challenges lie in effectively modeling homophily. We propose a novel reverse smoothness principle by observing that the similarity-closeness duality of homophily is consistent with the well-known smoothness assumption in graph-based semi-supervised learning– only the direction of inference is reversed. To realize smoothness over noisy social graphs, we further propose a novel holistic closeness modeling approach to capture ‘high-order’ smoothness by extending closeness from edges to paths. Extensive experiments on three real-world datasets demonstrate the efficacy of our proposed algorithm for ARP.

【Keywords】:

40. Edge Replacement Grammars : A Formal Language Approach for Generating Graphs.

【Paper Link】【Pages】:351-359

【Authors】: Revanth Reddy ; Sarath Chandar ; Balaraman Ravindran

【Abstract】: Graphs are increasingly becoming ubiquitous as models for structured data. A generative model that closely mimics the structural properties of a given set of graphs has utility in a variety of domains. Much of the existing work require that a large number of parameters, in fact exponential in size of the graphs, be estimated from the data. We take a slightly different approach to this problem, leveraging the extensive prior work in the formal graph grammar literature. In this paper, we propose a graph generation model based on Probabilistic Edge Replacement Grammars (PERGs). We propose a variant of PERG called Restricted PERG (RPERG), which is analogous to PCFGs in string grammar literature. With this restriction, we are able to derive a learning algorithm for estimating the parameters of the grammar from graph data. We empirically demonstrate on real life datasets that RPERGs outperform existing methods for graph generation. We improve on the performance of the state-of-the-art Hyperedge Replacement Grammar based graph generative model. Despite being a context free grammar, the proposed model is able to capture many of the structural properties of real networks, such as degree distributions, power law and spectral characteristics.

【Keywords】:

41. Temporal Graph Regression via Structure-Aware Intrinsic Representation Learning.

【Paper Link】【Pages】:360-368

【Authors】: Chao Han ; Xi Hang Cao ; Marija Stanojevic ; Mohamed F. Ghalwash ; Zoran Obradovic

【Abstract】: Temporal graph regression is a frequently encountered research problem in many studies of graph analytics. A temporal graph is a sequence of attributed graphs where node features and target variables change over time, but network structure stays constant. The task of temporal graph regression is to predict the target variables associated with nodes at future time-points given historical snapshots of the graph. Existing methods tackle this problem mostly by conducting structured regression for all target variables. However, those methods have limited performance due to redundant information. Although several techniques have been proposed recently to learn lower dimensional embedding for the target space, the problem of how to effectively exploit the structure of the temporal graph in such embeddings is still unsolved. Other recent works only study node embedding of the stationary graphs only, and this is not applicable to temporal attributed graphs. In this paper, we introduced a Structure-Aware Intrinsic Representation Learning model (SAIRL) to jointly learn lower dimensional embeddings of the target space and feature space via structure-aware graph abstraction and feature-aware target embedding learning. To solve this problem, we have developed a derivative-free block coordinate descent algorithm with closed-form solutions. To characterize the quality of embedding-based learned with SAIRL, we conducted extensive experiments on a variety of different real-world temporal graphs. The results indicate that the proposed method can be more accurate than the state-of-the-art embedding learning methods, regardless of regressors.

【Keywords】:

42. Multi-Dimensional, Multilayer, Nonlinear and Dynamic HITS.

【Paper Link】【Pages】:369-377

【Authors】: Francesca Arrigo ; Francesco Tudisco

【Abstract】: We introduce a ranking model for temporal multidimensional weighted and directed networks based on the Perron eigenvector of a multi-homogeneous order-preserving map. The model extends to the temporal multilayer setting the HITS algorithm and defines five centrality vectors: two for the nodes, two for the layers, and one for the temporal stamps. Nonlinearity is introduced in the standard HITS model in order to guarantee existence and uniqueness of these centrality vectors for any network, without any requirement on its connectivity structure. We introduce a globally convergent power iteration like algorithm for the computation of the centrality vectors. Numerical experiments on real-world networks are performed in order to assess the effectiveness of the proposed model and showcase the performance of the accompanying algorithm.

【Keywords】:

43. Flow-Based Local Graph Clustering with Better Seed Set Inclusion.

【Paper Link】【Pages】:378-386

【Authors】: Nate Veldt ; Christine Klymko ; David F. Gleich

【Abstract】: Flow-based methods for local graph clustering have received significant recent attention for their theoretical cut improvement and runtime guarantees. In this work we present two improvements for using flow-based methods in real-world semi-supervised clustering problems. Our first contribution is a generalized objective function that allows practitioners to place strict and soft penalties on excluding specific seed nodes from the output set. This feature allows us to avoid the tendency, often exhibited by previous flow-based methods, to contract a large seed set into a small set of nodes that does not contain all or even most of the seed nodes. Our second contribution is a fast algorithm for minimizing our generalized objective function, based on a variant of the push-relabel algorithm for computing preflows. We make our approach very fast in practice by implementing a global relabeling heuristic and employing a warm-start procedure to quickly solve related cut problems. In practice our algorithm is faster than previous related flow-based methods, and is also more robust in detecting ground truth target regions in a graph thanks to its ability to better incorporate semi-supervised information about target clusters.

【Keywords】:

44. Spherical Principal Component Analysis.

【Paper Link】【Pages】:387-395

【Authors】: Kai Liu ; Qiuwei Li ; Hua Wang ; Gongguo Tang

【Abstract】: Principal Component Analysis (PCA) is one of the most broadly used methods to analyze high-dimensional data. However, most existing studies on PCA aim to minimize the reconstruction error measured by the Euclidean distance, although in some fields, such as text analysis in information retrieval, analysis using the angle distance is known to be more effective. In this paper, we propose a novel PCA formulation by adding a constraint on the factors to unify the Euclidean distance and the angle distance. Because the objective and constraints are nonconvex, the optimization problem is difficult to solve in general. To tackle the optimization problem, we propose an alternating linearized minimization method with guaranteed convergence and provable convergence rate. Experiments on synthetic data and real-world data sets have validated the effectiveness of our new method and demonstrated its advantages over state-of-art competing methods.

【Keywords】:

45. Auto-Summarization: A Step Towards Unsupervised Learning of a Submodular Mixture.

【Paper Link】【Pages】:396-404

【Authors】: Chandrashekhar Lavania ; Jeff A. Bilmes

【Abstract】: We introduce an approach that requires the specification of only a handful of hyperparameters to determine a mixture of submodular functions for use in data science applications. Two techniques, applied in succession, are used to achieve this. The first involves training an autoencoder neural network constrainedly so that the bottleneck features have the following characteristic: the larger a feature's value, the more an input sample should have an automatically learnt property. This is analogous to bag of-words features, but where the “words” are learnt automatically. The second technique instantiates a mixture of submodular functions, each of which consists of a concave composed with a modular function comprised of the learnt neural network features. We introduce a mixture weight learning approach that does not (as is common) directly utilize supervised summary information. Instead, it optimizes a set of meta-objectives each of which corresponds to a likely necessary condition on what constitutes a good summarization objective. While hyperparameter optimization is often the bane of unsupervised methods, our approach reduces the learning of a summarization function (which most generally involves learning 2n parameters) down to the problem of selecting only a handful of hyperparameters. Empirical results on three very different modalities of data (i.e., image, text, and machine learning training data) show that our method produces functions that perform significantly better than a variety of unsupervised baseline methods.

【Keywords】:

46. Learning Diverse Gaussian Graphical Models and Interpreting Edges.

【Paper Link】【Pages】:405-413

【Authors】: Roozbeh Yousefzadeh ; Diane Oyen ; Nina Lanza

【Abstract】: Gaussian graphical models are used to discover patterns of variable dependencies in many scientific applications, yet there is no guarantee that the assumption of identically distributed (IID) samples is correct. In many cases, the non-IID nature of the data is due to the fact that some observations are produced by a different underlying process than the other samples. Therefore, it is informative to the analyst who is trying to understand patterns in their data to explore different graphical models produced by various subsets of the data. Learning a graphical model for every one of a combinatoric number of data subsets is intractable. We solve the problem with an interactive machine learning approach, by first learning a Gaussian graphical model from data, then finding a different subset of the data that would produce the most different Gaussian graphical model, allowing the user to explore the most diverse Gaussian graphical models that fit various subsets of their data. To find the most different Gaussian graphical model, we define an optimization problem that can be solved by gradient-based algorithms. Furthermore, to gain insight into the learned Gaussian graphical model, we explain an edge in the graph by finding the subset of observations in the dataset that are critical for defining the correlation that the edge represents. That is, we interpret edges by finding subsets of data such that their removal from the dataset will lead to eliminating an edge in the graph. This relational information enables analysts to interpret each edge in terms of its robustness and relationship to observations in the data. Our method finds patterns in data from the Mars rover and online recipes. By bringing transparency and interpretability, we enable the practitioners to create and use models that they are confident and insightful about.

【Keywords】:

47. Deep Co-Clustering.

【Paper Link】【Pages】:414-422

【Authors】: Dongkuan Xu ; Wei Cheng ; Bo Zong ; Jingchao Ni ; Dongjin Song ; Wenchao Yu ; Yuncong Chen ; Haifeng Chen ; Xiang Zhang

【Abstract】: Co-clustering partitions instances and features simultaneously by leveraging the duality between them and it often yields impressive performance improvement over traditional clustering algorithms. The recent development in learning deep representations has demonstrated the advantage in extracting effective features. However, the research on leveraging deep learning frameworks for co-clustering is limited for two reasons: 1) current deep clustering approaches usually decouple feature learning and cluster assignment as two separate steps, which cannot yield the task-specific feature representation; 2) existing deep clustering approaches cannot learn representations for instances and features simultaneously. In this paper, we propose a deep learning model for co-clustering called DeepCC. DeepCC utilizes the deep autoencoder for dimension reduction, and employs a variant of Gaussian Mixture Model (GMM) to infer the cluster assignments. A mutual information loss is proposed to bridge the training of instances and features. DeepCC jointly optimizes the parameters of the deep autoencoder and the mixture model in an end-to-end fashion on both the instance and the feature spaces, which can help the deep autoencoder escape from local optima and the mixture model circumvent the Expectation-Maximization (EM) algorithm. To the best of our knowledge, DeepCC is the first deep learning model for co-clustering. Experimental results on various dataseis demonstrate the effectiveness of DeepCC.

【Keywords】:

48. Discovering Multiple Co-Clusterings in Subspaces.

【Paper Link】【Pages】:423-431

【Authors】: Shixin Yao ; Guoxian Yu ; Xing Wang ; Jun Wang ; Carlotta Domeniconi ; Maozu Guo

【Abstract】: Multiple clustering approaches aim at exploring alternative ways of organizing a given collection of data into various clusters from different perspectives. Although multiple one-way clusterings have been studied for more than a decade, how to explore alternative two-way clusterings (or co-clusterings) still remains an untouched topic, and an important one from an application standpoint. To solve this interesting but yet unexplored topic, we assume the existence of alternative co-clusterings embedded in different subspaces and simultaneously pursue multiple co-clusterings therein. We initially specify a subspace indicator matrix for each feature subspace, and employ matrix tri-factorization to seek row-wise and column-wise cluster indicator matrices in each subspace. To ensure diversity, we quantify the redundancy between pairwise co-clusterings using the cluster indicator and the subspace indicator matrices. We further introduce a unified objective function to simultaneously account for the two pursues, and an alternating optimization solution to iteratively optimize cluster indicator and feature indicator matrices. Our empirical study shows that the proposed solution can explore multiple meaningful co-clusterings and generally achieves better results than state-of-the-art methods.

【Keywords】:

49. Accelerated Experimental Design for Pairwise Comparisons.

【Paper Link】【Pages】:432-440

【Authors】: Yuan Guo ; Jennifer G. Dy ; Deniz Erdogmus ; Jayashree Kalpathy-Cramer ; Susan Ostmo ; J. Peter Campbell ; Michael F. Chiang ; Stratis Ioannidis

【Abstract】: Pairwise comparison labels are more informative and less variable than class labels, but generating them poses a challenge: their number grows quadratically in the dataset size. We study a natural experimental design objective, namely, D-optimality, that can be used to identify which K pairwise comparisons to generate. This objective is known to perform well in practice, and is submodular, making the selection approximable via the greedy algorithm. A naïve greedy implementation has O(N2 d2 K) complexity, where N is the dataset size, d is the feature space dimension, and K is the number of generated comparisons. We show that, by exploiting the inherent geometry of the dataset–namely, that it consists of pairwise comparisons–the greedy algorithm's complexity can be reduced to O(N2 (K + d) + N(dK + d2) + d2 K). We apply the same acceleration also to the so-called lazy greedy algorithm. When combined, the above improvements lead to an execution time of less than 1 hour for a dataset with 108 comparisons; the naïve greedy algorithm on the same dataset would require more than 10 days to terminate.

【Keywords】:

50. Region-Based Active Learning with Hierarchical and Adaptive Region Construction.

【Paper Link】【Pages】:441-449

【Authors】: Zhipeng Luo ; Milos Hauskrecht

【Abstract】: Learning of classification models in practice often relies on human annotation effort in which humans assign class labels to data instances. As this process can be very time-consuming and costly, finding effective ways to reduce the annotation cost becomes critical for building such models. To solve this problem, instead of soliciting instance-based annotation we explore region-based annotation as the human feedback. A region is defined as a hyper-cubic subspace of the input space X and it covers a subpopulation of data instances that fall into this region. Each region is labeled with a number in [0,1] (in binary classification setting), representing a human estimate of the positive (or negative) class proportion in the subpopulation. To quickly discover pure regions (in terms of class proportion) in the data, we have developed a novel active learning framework that constructs regions in a hierarchical and adaptive way. Hierarchical means that regions are incrementally built into a hierarchical tree, which is done by repeatedly splitting the input space. Adaptive means that our framework can adaptively choose the best heuristic for each of the region splits. Through experiments on numerous datasets we demonstrate that our framework can identify pure regions in very few region queries. Thus our approach is shown to be effective in learning classification models from very limited human feedback.

【Keywords】:

51. Feature selection as Monte-Carlo Search in Growing Single Rooted Directed Acyclic Graph by Best Leaf Identification.

【Paper Link】【Pages】:450-458

【Authors】: Aurélien Pélissier ; Atsuyoshi Nakamura ; Koji Tabata

【Abstract】: Monte Carlo tree search (MCTS) has received considerable interest due to its spectacular success in the difficult problem of computer Go and also proved beneficial in a range of other domains. A major issue that has received little attention in the MCTS literature is the fact that, in most games, different actions can lead to the same state, that may lead to a high degree of redundancy in tree representation and unnecessary additional computational cost. We extend MCTS to single rooted directed acyclic graph (SR-DAG), and consider the Best Arm Identification (BAI) and the Best Leaf Identification (BLI) problem of an expanding SR-DAG of arbitrary depth. We propose algorithms that are (∊, σ)-correct in the fixed confidence setting, and prove an asymptotic upper bounds of sample complexity for our BAI algorithm. As a major application for our BLI algorithm, a novel approach for Feature Selection is proposed by representing the feature set space as a SR-DAG and repeatedly evaluating feature subsets until a candidate for the best leaf is returned, a proof of concept is shown on benchmark data sets.

【Keywords】:

52. Relative distance comparisons with confidence judgements.

【Paper Link】【Pages】:459-467

【Authors】: Stefan Mojsilovic ; Antti Ukkonen

【Abstract】: Relative distance comparisons, or “triplets”, are statements of the form “item a is closer to item b than c”. When eliciting such comparisons from human annotators, it is often the case that some comparisons are easy, while others are more ambiguous. However, also for the latter cases annotators are forced to choose one of the alternatives, despite possibly having a low confidence with their selection. To alleviate this problem, we discuss a variant of the distance comparison query where annotators are allowed to explicitly state their degree of confidence for each triplet. We propose algorithms both for learning the underlying pairwise distances, as well as computing an embedding of the items from such triplets. For the distance learning problem we devise an approach based on solving a system of linear equations, while for the embedding task we modify the t-STE algorithm to handle the confidence statements. We report experiments with synthetic and real data, including a novel study in which we collected the proposed type of triplets from 80 volunteers.

【Keywords】:

53. Doubly Robust Prediction and Evaluation Methods Improve Uplift Modeling for Observational Data.

【Paper Link】【Pages】:468-476

【Authors】: Yuta Saito ; Hayato Sakata ; Kazuhide Nakata

【Abstract】: Uplift modeling aims to optimize treatment allocation by predicting the net effect of a treatment on each individual (ITE) and is expected to achieve causal-based personalization in medicine, marketing, etc. This approach needs specialized methods to train and evaluate ITE prediction models because the true ITE is unobservable. The conventional uplift modeling requires data to be gathered through randomized controlled trials (RCTs), on the other hand, for non-RCT data, the transformed outcome (TO) is commonly used as an unbiased estimator of ITE. However, it is often impossible to conduct RCTs for ethical and economic reasons, and, in observational data, the unbiasedness of TO is based on the unrealistic assumption that the propensity score of each individual is given. In this paper, we theoretically and quantitatively show TO becomes an unreliable proxy ITE when the propensity score estimator is biased or has a large degree of heterogeneity. We then propose a novel proxy outcome, Switch Doubly Robust, turning on and off the effect of propensity score estimator on the outcome prediction models. We theoretically prove SDR achieves better bias-variance trade-off as a proxy ITE than TO and develop novel prediction (SDRM) and evaluation (SDR-MSE) methods. Furthermore, we experimentally show our methods outperformed existing approaches on synthetic datasets. In addition, we applied them to the Right Heart Catheterization dataset and discovered 20% of patients are actually curable, even though the conventional causal inference methods only showed the average treatment effect is negative. We anticipate our methods to be a standard practice of uplift modeling for observational data and lead to optimized personalization in various fields.

【Keywords】:

54. EstImAgg: A Learning Framework for Groupwise Aggregated Data.

【Paper Link】【Pages】:477-485

【Authors】: Avradeep Bhowmik ; Minmin Chen ; Zhengming Xing ; Suju Rajan

【Abstract】: Aggregation is a common technique in data-driven applications for handling issues like privacy, scalability and reliability in a vast range of domains including healthcare, sensor networks and web applications. However, despite the ubiquitousness, extending machine learning methods to the aggregation context is unfortunately not well-studied. In this work, we consider the problem of learning individual level predictive models when the target variables used for training are only available as aggregates. In particular, this problem is a critical bottleneck in designing effective bidding strategies in the context of online advertising where ground-truth cost-per-click (CPC) data is aggregated before being released to advertisers. We introduce a novel learning framework that can use aggregates computed at varying levels of granularity for building individual-level predictive models. We generalise our modelling and algorithmic framework to handle data from diverse domains, and extend our techniques to cover arbitrary aggregation paradigms like sliding windows and overlapping/non-uniform aggregation. We show empirical evidence for the efficacy of our techniques with experiments on both synthetic data and real data from the online advertising domain as well as healthcare to demonstrate the wider applicability of our framework.

【Keywords】:

55. DTEC: Distance Transformation Based Early Time Series Classification.

【Paper Link】【Pages】:486-494

【Authors】: Liuyi Yao ; Yaliang Li ; Yezheng Li ; Hengtong Zhang ; Mengdi Huai ; Jing Gao ; Aidong Zhang

【Abstract】: In many time-sensitive applications, knowing the classification results as early as possible while preserving the accuracy is extremely important for further actions. Shapelet-based early classification methods are popular due to their natural interpretability. However, most of the existing shapelet-based methods ignore the distance information between the shapelets and the time series. The distance information, though may contain some noise, can reflect more information between the shapelets and the time series. Some existing works adopt the distance information, but are not robust to the noise in the distance information. To tackle this challenge, we present a novel distance transformation based early classification (DTEC) framework, which transfers the original time series into the distance space. Upon the distance space, a probabilistic classifier is trained, and a novel classification criterion confidence area is proposed in order to overcome the noise brought by the training phase and the dataset. The effectiveness of the proposed framework is validated on three time series benchmarks as well as the extensive datasets selected from UCR time series archive.

【Keywords】:

56. Leveraging Subsequence-orders for Univariate and Multivariate Time-series Classification.

【Paper Link】【Pages】:495-503

【Authors】: Shoumik Roychoudhury ; Fang Zhou ; Zoran Obradovic

【Abstract】: Highly discriminative short time-series subsequences, known as shapelets, are used to classify a time-series. The existing shapelet-based methods for time-series classification assume that shapelets are independent of each other. However, they neglect temporal dependencies among pairs of shapelets, which are informative features that exist in many applications. Within this new framework, we explore a scheme to extract informative orders among shapelets by considering the time gap between two shapelets. In addition, we propose a novel model, Pairwise Shapelet-Orders Discovery, which extracts both informative shapelets and shapelet-orders and incorporates the shapelet-transformed space with shapelet-order space for time-series classification. The hypothesis of the study is that the extracted orders could increase the confidence of the prediction and further improves the classification performance. The results of extensive experiments conducted on 75 univariate and 6 multivariate real-world datasets provide evidence that the proposed model could significantly improve accuracy on average over baseline methods.

【Keywords】:

57. Branch and Border: Partition-Based Change Detection in Multivariate Time Series.

【Paper Link】【Pages】:504-512

【Authors】: Bryan Hooi ; Christos Faloutsos

【Abstract】: Given multivariate time series data, how do we detect changes in the behavior of the time series: for example, the onset of illnesses or complications in patients? Can we do this without making strong assumptions about the data? We propose BnB (Branch and Border), an online, nonparametric change detection method that detects multiple changes in multivariate data. Unlike existing methods, BnB approaches change detection by separating points before and after the change using an ensemble of random partitions. BnB is (a) scalable: it scales linearly in the number of time ticks and dimensions, and is online, thus using bounded memory and bounded time per iteration; (b) effective: providing theoretical guarantees on the false positive rate, and achieving 70% or more increased F-measure over baselines in experiments averaged over 11 datasets; (c) general: it is nonparametric, and works on mixed data, including numerical, categorical, and ordinal data.

【Keywords】:

58. Spatial Context-Aware Networks for Mining Temporal Discriminative Period in Land Cover Detection.

【Paper Link】【Pages】:513-521

【Authors】: Xiaowei Jia ; Sheng Li ; Ankush Khandelwal ; Guruprasad Nayak ; Anuj Karpatne ; Vipin Kumar

【Abstract】: Detecting land use and land cover changes is critical to monitor natural resources and analyze global environmental changes. In this paper, we investigate the land cover detection using the remote sensing data from earth-observing satellites. Due to the natural disturbances, e.g., clouds and aerosoles, and the data acquisition errors by devices, remote sensing data frequently contain much noise. Also, many land covers cannot be easily identified in most dates of a year. Instead, they show distinctive temporal patterns only during certain period of a year, which is also referred to as the discriminative period. To address these challenges, we propose a novel framework which combines the spatial context knowledge with the LSTM-based temporal modeling for land cover detection. Specifically, the framework learns the spatial context knowledge selectively from its neighboring locations. Then we propose two approaches for discriminative period detection based on multi-instance learning and local attention mechanism, respectively. Our evaluations in two real-world applications demonstrate the effectiveness of the proposed method in identifying land covers and detecting discriminative periods.

【Keywords】:

59. Elastic bands across the path: A new framework and method to lower bound DTW.

【Paper Link】【Pages】:522-530

【Authors】: Chang Wei Tan ; François Petitjean ; Geoffrey I. Webb

【Abstract】: The Nearest Neighbour algorithm coupled with the Dynamic Time Warping similarity measure (NN-DTW) is at the core of state-of-the-art classification algorithms including Ensemble of Elastic Distances and Collection of Transformation-Based Ensemble. DTW's complexity makes NN-DTW highly computationally demanding. To combat this, lower bounds to DTW are used to minimize the number of times the expensive DTW need be computed during NN-DTW search. Effective lower bounds must balance ‘time to calculate’ vs ‘tightness to DTW.‘ On the one hand, the tighter the bound the fewer the calls to the full DTW. On the other, calculating tighter bounds usually requires greater computation. Numerous lower bounds have been proposed. Different bounds provide different trade-off between computational time and tightness. In this work, we present a new class of lower bounds that are tighter than the popular Keogh lower bound, while requiring similar computation time. Our new lower bounds take advantage of the DTW boundary condition, monotonicity and continuity constraints. In contrast to most existing bounds, they remain relatively tight even for large windows. A single parameter to these new lower bounds controls the speed-tightness trade-off. We demonstrate that these new lower bounds provide an exceptional balance between computation time and tightness for the NN-DTW time series classification task, resulting in greatly improved efficiency for NN-DTW lower bound search.

【Keywords】:

60. Coupled Clustering of Time-Series and Networks.

【Paper Link】【Pages】:531-539

【Authors】: Yike Liu ; Linhong Zhu ; Pedro A. Szekely ; Aram Galstyan ; Danai Koutra

【Abstract】: Motivated by the problem of human-trafficking, where it is often observed that criminal organizations are linked and behave similarly over time, we introduce the problem of Coupled Clustering of Time-series and their underlying Network. The goal is to find tightly connected subgroups of nodes that also have similar node-specific time series (temporal—not necessarily structural—behavior). We formulate the problem as a coupled matrix factorization for the time series, combined with regularization for network smoothness. We propose CCTN, and an incrementally-updated counterpart, CCTN-inc, which efficiently handles network updates. Extensive experiments show that CCTN is up to 4x more accurate than baselines that consider graph structure or time series alone, and CCTN-inc is up to 55x faster than CCTN. As an application, we explore an exclusive database with millions of online ads on human trafficking, and successfully deploy our technique to detect criminal organizations.

【Keywords】:

61. Classifying Heterogeneous Sequential Data by Cyclic Domain Adaptation: An Application in Land Cover Detection.

【Paper Link】【Pages】:540-548

【Authors】: Xiaowei Jia ; Guruprasad Nayak ; Ankush Khandelwal ; Anuj Karpatne ; Vipin Kumar

【Abstract】: Recent advances in processing remote sensing data have provided unprecedented potential for monitoring land covers. However, it is extremely challenging to deploy an automated monitoring system for different regions and across different years given the involved data heterogeneity over space and over time. The heterogeneity exists on two aspects. First, for many land covers, the distinguishing temporal patterns are only visible in certain discriminative period. Due to the change of weather conditions, the discriminative period can shift across space and time, which causes heterogeneity to the sequential data. Second, the collected remote sensing data are affected by acquisition devices and natural variables, e.g., precipitation and sunlight. In this paper, we introduce a novel framework to effectively detect land covers using the sequential remote sensing data. At the same time, we propose new learning strategies based on attention networks and domain adaptation to addresses the aforementioned challenges. The evaluation on two real-world applications - cropland mapping and burned area detection, demonstrate that the proposed method can effectively detect land covers under different weather conditions.

【Keywords】:

62. Incomplete Conditional Density Estimation for Fast Materials Discovery.

【Paper Link】【Pages】:549-557

【Authors】: Phuoc Nguyen ; Truyen Tran ; Sunil Gupta ; Santu Rana ; Matthew Barnett ; Svetha Venkatesh

【Abstract】: Designing new physical products and processes requires enormous experimentation. The scientific simulators play a fundamental role for such design tasks. To design a new product with certain target characteristics, a search is performed in the design space by trying out a large number of design combinations through simulators before reaching to the target characteristics. However, searching for the target design using simulators is generally expensive and becomes prohibitive when the target is either revised or only partially specified. To address this problem, we use a machine learning model to predict the design in single step using the target product specifications as input. We overcome two technical challenges: the first caused due to one-to-many mapping when learning the inverse problem and the second caused due to a user specifying the target specifications only partially. We unify a conditional variational auto-encoder model (to address the partial target specification) with mixture density networks (to address the one-to-many mapping) and train an end-to-end model to predict the optimum design.

【Keywords】:

63. Physics Guided RNNs for Modeling Dynamical Systems: A Case Study in Simulating Lake Temperature Profiles.

【Paper Link】【Pages】:558-566

【Authors】: Xiaowei Jia ; Jared Willard ; Anuj Karpatne ; Jordan S. Read ; Jacob Zwart ; Michael S. Steinbach ; Vipin Kumar

【Abstract】: This paper proposes a physics-guided recurrent neural network model (PGRNN) that combines RNNs and physics-based models to leverage their complementary strengths and improve the modeling of physical processes. Specifically, we show that a PGRNN can improve prediction accuracy over that of physical models, while generating outputs consistent with physical laws, and achieving good generalizability. Standard RNNs, even when producing superior prediction accuracy, often produce physically inconsistent results and lack generalizability. We further enhance this approach by using a pre-training method that leverages the simulated data from a physics-based model to address the scarcity of observed data. Although we present and evaluate this methodology in the context of modeling the dynamics of temperature in lakes, it is applicable more widely to a range of scientific and engineering disciplines where mechanistic (also known as process-based) models are used, e.g., power engineering, climate science, materials science, computational chemistry, and biomedicine.

【Keywords】:

【Paper Link】【Pages】:567-575

【Authors】: Miao Fan ; Chao Feng ; Mingming Sun ; Ping Li ; Haifeng Wang

【Abstract】: The e-commerce websites are ready to build the community question answering (CQA) service, as it can facilitate questioners (potential buyers) to obtain satisfying answers from experienced customers and furthermore stimulate consumption. Given that more than 50% product-related questions only anticipate a binary response (i.e., “Yes” or “No”), the research on product-related question answering (PQA), which aims to automatically provide instant and correct replies to questioners, emerges rapidly. The mainstream approaches on PQA generally employ customer reviews as the evidence to help predict answers to the questions which are product-specific and concerned more about subjective personal experiences. However, the supportive features either extracted by heuristic rules or acquired from unsupervised manners are not able to perform well on PQA. In this paper, we contribute an end-to-end neural architecture directly fed by the raw text of product-related questions and customer reviews to predict the answers. Concretely, it teaches machines to generate and to synthesize multiple question-aware review representations in a reading comprehension fashion to make the final decision. We also extract a real-world dataset crawled from 9 categories in Amazon.com for PQA to assess the performance of our neural reading architecture (NRA) and other mainstream approaches such as COR-L [12], MOQA [12], and AAP [21]. Experimental results show that our NRA sets up a new state-of-the-art performance on this dataset, significantly outperforming existing algorithms.

【Keywords】:

65. An Interpretable Fast Model for Predicting The Risk of Heart Failure.

【Paper Link】【Pages】:576-584

【Authors】: Xianli Zhang ; Buyue Qian ; Xiaoyu Li ; Jishang Wei ; Yingqian Zheng ; Lingyun Song ; Qinghua Zheng

【Abstract】: Lately, thanks to the huge amount of Electronic Health Records (EHR) data, deep learning models have been successfully applied to a variety of clinical prediction problems. Existing state-of-the-art clinical predicting models are usually built with recurrent neural network (RNN) and attention mechanism. However, such RNN based approaches mainly suffer from three limitations on clinical predictions, which if addressed would significantly widen their applicability. (i) Accuracy: The performance of RNN based models drops fast when the length of EHR sequences increases. (ii) Interpretability: The prediction results of RNN based models are hard to interpret due to the nature of deep models. (iii) Efficiency: The sequential property of RNN based models makes the parallelization of computation impossible, and accordingly hurts the efficiency of such models in practice. In this paper, we propose an efficient attention-based model to address the above three challenges simultaneously. In the context of heart failure prediction task, we demonstrate the interpretation capability of our model by visualizing relative connections between events and prediction result, and the high computational efficiency comparing to other baseline methods. Meanwhile, we show that the accuracy of our prediction model is comparable or better than those of other state-of-the-art prediction models in healthcare applications.

【Keywords】:

66. LSCP: Locally Selective Combination in Parallel Outlier Ensembles.

【Paper Link】【Pages】:585-593

【Authors】: Yue Zhao ; Zain Nasrullah ; Maciej K. Hryniewicki ; Zheng Li

【Abstract】: In unsupervised outlier ensembles, the absence of ground truth makes the combination of base outlier detectors a challenging task. Specifically, existing parallel outlier ensembles lack a reliable way of selecting competent base detectors, affecting accuracy and stability, during model combination. In this paper, we propose a framework—called Locally Selective Combination in Parallel Outlier Ensembles (LSCP)—which addresses the issue by defining a local region around a test instance using the consensus of its nearest neighbors in randomly selected feature subspaces. The top-performing base detectors in this local region are selected and combined as the model's final output. Four variants of the LSCP framework are compared with seven widely used parallel frameworks. Experimental results demonstrate that one of these variants, LSCP_AOM, consistently outperforms baselines on the majority of twenty real-world datasets.

【Keywords】:

67. Deep Anomaly Detection on Attributed Networks.

【Paper Link】【Pages】:594-602

【Authors】: Kaize Ding ; Jundong Li ; Rohit Bhanushali ; Huan Liu

【Abstract】: Attributed networks are ubiquitous and form a critical component of modern information infrastructure, where additional node attributes complement the raw network structure in knowledge discovery. Recently, detecting anomalous nodes on attributed networks has attracted an increasing amount of research attention, with broad applications in various high-impact domains, such as cybersecurity, finance, and healthcare. Most of the existing attempts, however, tackle the problem with shallow learning mechanisms by ego-network or community analysis, or through subspace selection. Undoubtedly, these models cannot fully address the computational challenges on attributed networks. For example, they often suffer from the network sparsity and data nonlinearity issues, and fail to capture the complex interactions between different information modalities, thus negatively impact the performance of anomaly detection. To tackle the aforementioned problems, in this paper, we study the anomaly detection problem on attributed networks by developing a novel deep model. In particular, our proposed deep model: (1) explicitly models the topological structure and nodal attributes seamlessly for node embedding learning with the prevalent graph convolutional network (GCN); and (2) is customized to address the anomaly detection problem by virtue of deep autoencoder that leverages the learned embeddings to reconstruct the original data. The synergy between GCN and autoencoder enables us to spot anomalies by measuring the reconstruction errors of nodes from both the structure and the attribute perspectives. Extensive experiments on real-world attributed network datasets demonstrate the efficacy of our proposed algorithm.

【Keywords】:

68. ABACUS: Unsupervised Multivariate Change Detection via Bayesian Source Separation.

【Paper Link】【Pages】:603-611

【Authors】: Wenyu Zhang ; Daniel E. Gilbert ; David S. Matteson

【Abstract】: Change detection involves segmenting sequential data such that observations in the same segment share some desired properties. Multivariate change detection continues to be a challenging problem due to the variety of ways change points can be correlated across channels and the potentially poor signal-to-noise ratio on individual channels. In this paper, we are interested in locating additive outliers (AO) and level shifts (LS) in the unsupervised setting. We propose ABACUS, Automatic BAyesian Changepoints Under Sparsity, a Bayesian source separation technique to recover latent signals while also detecting changes in model parameters. Multi-level sparsity achieves both dimension reduction and modeling of signal changes. We show ABACUS has competitive or superior performance in simulation studies against state-of-the-art change detection methods and established latent variable models. We also illustrate ABACUS on two real application, modeling genomic profiles and analyzing household electricity consumption.

【Keywords】:

69. Learning On-the-Job to Re-rank Anomalies from Top-1 Feedback.

【Paper Link】【Pages】:612-620

【Authors】: Hemank Lamba ; Leman Akoglu

【Abstract】: In many anomaly mining scenarios, a human expert verifies the anomaly at-the-top (as ranked by an anomaly detector) before they move on to the next. This verification produces a label—true positive (TP) or false positive (FP). In this work, we show how to leverage this label feedback for the top-1 instance to quickly re-rank the anomalies in an online fashion. In contrast to a detector that ranks once and goes offline, we propose a detector called OJRank that works alongside the human and continues to learn (how to rank) on-the-job, i.e., from every feedback. The benefits OJRank provides are two-fold; it reduces (i) the false positive rate by ‘muting’ the anomalies similar to FP instances; as well as (ii) the expert effort by elevating to the top the anomalies similar to a TP instance. We show that OJRank achieves statistically significant improvement on both detection precision and human effort over the offline detector as well as existing state-of-the-art ranking strategies, while keeping the per feedback response time (to re-rank) well below a second.

【Keywords】:

70. SMF: Drift-Aware Matrix Factorization with Seasonal Patterns.

【Paper Link】【Pages】:621-629

【Authors】: Bryan Hooi ; Kijung Shin ; Shenghua Liu ; Christos Faloutsos

【Abstract】: Consider a stream of time-stamped events, such as taxi rides, where we record the start and end locations of each ride. How do we learn a matrix factorization model which takes into account seasonal patterns (such as: rides toward office areas occur more frequently in the morning), and use it to forecast taxi rides tomorrow? Also, how can we model drift (such as population growth), and detect sudden changes (or anomalies)? Existing matrix factorization algorithms do not take seasonal patterns into account. We propose SMF (Seasonal Matrix Factorization), a matrix factorization model for seasonal data, and a streaming algorithm for fitting it. SMF is (a) accurate in forecasting: outperforming baselines by 13% to 60% in RMSE; (b) online: requiring fixed memory even as more data is received over time, and scaling linearly; (c) effective: providing interpretable results. In addition, we propose SMF-A, an algorithm which detects anomalies in a computationally feasible way, without forecasting every observation in the matrix.

【Keywords】:

71. Multi-Stage Variational Auto-Encoders for Coarse-to-Fine Image Generation.

【Paper Link】【Pages】:630-638

【Authors】: Lei Cai ; Hongyang Gao ; Shuiwang Ji

【Abstract】: Variational auto-encoder (VAE) is a powerful unsupervised learning framework for image generation. One drawback of VAE is that it generates blurry images due to its Gaussianity assumption and thus ℓ2 loss. To allow the generation of high quality images by VAE, we increase the capacity of decoder network by employing residual blocks and skip connections, which also enable efficient optimization. To overcome the limitation of ℓ2 loss, we propose to generate images in a multi-stage manner from coarse to fine. In the simplest case, the proposed multi-stage VAE divides the decoder into two components in which the second component generates refined images based on the course images generated by the first component. Since the second component is independent of the VAE model, it can employ other loss functions beyond the ℓ2 loss and different model architectures. The proposed framework can be easily generalized to contain more than two components. Experiment results on the MNIST and CelebA datasets demonstrate that the proposed multi-stage VAE can generate sharper images as compared to those from the original VAE.

【Keywords】:

72. Path-based Attribute-aware Representation Learning for Relation Prediction.

【Paper Link】【Pages】:639-647

【Authors】: Ying Shen ; Desi Wen ; Yaliang Li ; Nan Du ; Hai-Tao Zheng ; Min Yang

【Abstract】: Knowledge graphs (KGs) have been applied to many semantic-driven applications, including knowledge interchange and semantic inference. However, most KGs are far from complete and are growing rapidly. Although significant progress has been made in the symbolic representation learning of KGs with structural information, the textual knowledge that plays a crucial role in relation prediction is underutilized, and the issues of redundancy and noise path remain to be settled. In this paper, a Path-based Attribute-aware Representation Learning model (PARL) has been proposed to perform path denoising and path representation learning for the relation prediction task. We develop a novel text-enhanced relation prediction architecture, which interactively learns KG structural and textual representations to vary the sparsity and reliability of KG. Moreover, a path denoising algorithm is presented to emphasize paths with rich information and reduce the impact of redundancy and noise path. Experiments on a public dataset demonstrate that PARL consistently outperforms state-of-the-art methods on relation prediction and KG completion tasks.

【Keywords】:

73. Spatial Variational Auto-Encoding via Matrix-Variate Normal Distributions.

【Paper Link】【Pages】:648-656

【Authors】: Zhengyang Wang ; Hao Yuan ; Shuiwang Ji

【Abstract】: The key idea of variational auto-encoders (VAEs) resembles that of traditional auto-encoder models in which spatial information is supposed to be explicitly encoded in the latent space. However, the latent variables in VAEs are vectors, which can be interpreted as multiple feature maps of size 1x1. Such representations can only convey spatial information implicitly when coupled with powerful decoders. In this work, we propose spatial VAEs that use feature maps of larger size as latent variables to explicitly capture spatial information. This is achieved by allowing the latent variables to be sampled from matrix-variate normal (MVN) distributions whose parameters are computed from the encoder network. To increase dependencies among locations on latent feature maps and reduce the number of parameters, we further propose spatial VAEs via low-rank MVN distributions. Experimental results show that the proposed spatial VAEs outperform original VAEs in capturing rich structural and spatial information.

【Keywords】:

74. Multi-dimensional Graph Convolutional Networks.

【Paper Link】【Pages】:657-665

【Authors】: Yao Ma ; Suhang Wang ; Charu C. Aggarwal ; Dawei Yin ; Jiliang Tang

【Abstract】: Convolutional neural networks (CNNs) leverage the great power in representation learning on regular grid data such as image and video. Recently, increasing attention has been paid on generalizing CNNs to graph or network data which is highly irregular. Some focus on graph-level representation learning while others aim to learn node-level representations. These methods have been shown to boost the performance of many graph-level tasks such as graph classification and node-level tasks such as node classification. Most of these methods have been designed for single-dimensional graphs where a pair of nodes can only be connected by one type of relation. However, many real-world graphs have multiple types of relations and they can be naturally modeled as multi-dimensional graphs with each type of relation as a dimension. Multi-dimensional graphs bring about richer interactions between dimensions, which poses tremendous challenges to the graph convolutional neural networks designed for single-dimensional graphs. In this paper, we study the problem of graph convolutional networks for multidimensional graphs and propose a multi-dimensional convolutional neural network model mGCN aiming to capture rich information in learning node-level representations for multi-dimensional graphs. Comprehensive experiments on real-world multi-dimensional graphs demonstrate the effectiveness of the proposed framework.

【Keywords】:

75. Autonomous Deep Learning: Continual Learning Approach for Dynamic Environments.

【Paper Link】【Pages】:666-674

【Authors】: Andri Ashfahani ; Mahardhika Pratama

【Abstract】: The feasibility of deep neural networks (DNNs) to address data stream problems still requires intensive study because of the static and offline nature of conventional deep learning approaches. A deep continual learning algorithm, namely autonomous deep learning (ADL), is proposed in this paper. Unlike traditional deep learning methods, ADL features a flexible structure where its network structure can be constructed from scratch with the absence of initial network structure via the self-constructing network structure. ADL specifically addresses catastrophic forgetting by having a different-depth structure which is capable of achieving a trade-off between plasticity and stability. Network significance (NS) formula is proposed to drive the hidden nodes growing and pruning mechanism. Drift detection scenario (DDS) is put forward to signal distributional changes in data streams which induce the creation of a new hidden layer. Maximum information compression index (MICI) method plays an important role as a complexity reduction module eliminating redundant layers. The efficacy of ADL is numerically validated under the prequential test-then-train procedure in lifelong environments using nine popular data stream problems. The numerical results demonstrate that ADL consistently outperforms recent continual learning methods while characterizing the automatic construction of network structures.

【Keywords】:

76. Deep Transfer Reinforcement Learning for Text Summarization.

【Paper Link】【Pages】:675-683

【Authors】: Yaser Keneshloo ; Naren Ramakrishnan ; Chandan K. Reddy

【Abstract】: Deep neural networks are data hungry models and thus face difficulties when attempting to train on small text datasets. Transfer learning is a potential solution but their effectiveness in the text domain is not as explored as in areas such as image analysis. In this paper, we study the problem of transfer learning for text summarization and discuss why existing state-of-the-art models fail to generalize well on other (unseen) datasets. We propose a reinforcement learning framework based on a self-critic policy gradient approach which achieves good generalization and state-of-the-art results on a variety of datasets. Through an extensive set of experiments, we also show the ability of our proposed framework to fine-tune the text summarization model using only a few training samples. To the best of our knowledge, this is the first work that studies transfer learning in text summarization and provides a generic solution that works well on unseen data.

【Keywords】:

77. TMSA: A Mutual Learning Model for Topic Discovery and Word Embedding.

【Paper Link】【Pages】:684-692

【Authors】: Dingcheng Li ; Jingyuan Zhang ; Ping Li

【Abstract】: Both topic modeling and word embedding map documents onto a low-dimensional space, with the former clustering words into a global topic space and the latter into a local continuous embedding space. In this study, we propose the TMSA framework to unify these two complementary patterns by the construction of a mutual learning mechanism between word-cooccurrence based topic modeling and autoencoder. In our model, word topics generated with topic modeling are passed into auto-encoder to impose topic sparsity so that auto-encoder can learn topic-relevant word representations. In return, word embedding learned by autoencoder is sent back to topic modeling to improve the quality of topic generations. Empirical studies show the effectiveness of the proposed TMSA model in discovering topics and embedding words.

【Keywords】:

78. Attentional Heterogeneous Graph Neural Network: Application to Program Reidentification.

【Paper Link】【Pages】:693-701

【Authors】: Shen Wang ; Zhengzhang Chen ; Ding Li ; Zhichun Li ; Lu-An Tang ; Jingchao Ni ; Junghwan Rhee ; Haifeng Chen ; Philip S. Yu

【Abstract】: Program or process is an integral part of almost every IT/OT system. Can we trust the identity/ID (e.g., executable name) of the program? To avoid detection, malware may disguise itself using the ID of a legitimate program, and a system tool (e.g., PowerShell) used by the attackers may have the fake ID of another common software, which is less sensitive. However, existing intrusion detection techniques often overlook this critical program reidentification problem (i.e., checking the program's identity). In this paper, we propose an attentional heterogeneous graph neural network model (DeepHGNN) to verify the program's identity based on its system behaviors. The key idea is to leverage the representation learning of the heterogeneous program behavior graph to guide the reidentification process. We formulate the program reidentification as a graph classification problem and develop an effective attentional heterogeneous graph embedding algorithm to solve it. Extensive experiments — using real-world enterprise monitoring data and real attacks — demonstrate the effectiveness of DeepHGNN across multiple popular metrics and the robustness to the normal dynamic changes like program version upgrades.

【Keywords】:

79. GPU Accelerated Sub-Sampled Newton's Method for Convex Classification Problems.

【Paper Link】【Pages】:702-710

【Authors】: Sudhir B. Kylasa ; Fred (Farbod) Roosta ; Michael W. Mahoney ; Ananth Grama

【Abstract】: First order optimization methods, which rely only on gradient information, are commonly used in diverse machine learning (ML) applications, owing to their simplicity of implementations and low per-iteration computational/storage costs. However, they suffer from significant disadvantages; most notably, their performance degrades with increasing problem ill-conditioning. Furthermore, they often involve a large number of hyperparameters, and are notoriously sensitive to parameters such as the step-size. By incorporating additional information from the Hessian, second-order methods, have been shown to be resilient to many such adversarial effects. However, these advantages come at the expense of higher per-iteration costs, which in “big data” regimes, can be computationally prohibitive. In this paper, we show that, contrary to conventional belief, second-order methods, when designed suitably, can be much more efficient than first-order alternatives for large-scale ML applications. In convex settings, we show that variants of classical Newton's method in which the Hessian and/or gradient are randomly subsampled, coupled with efficient GPU implementations, far outperform state of the art implementations of existing techniques in popular ML software packages such as TensorFlow. We show that our proposed methods (i) achieve better generalization errors in significantly lower wall-clock time – orders of magnitude faster, compared to first-order alternatives (in TensorFlow) and, (ii) offers significantly smaller (and easily parameterized) hyperparameter space making our methods highly robust.

【Keywords】:

80. CP Tensor Decomposition with Cannot-Link Intermode Constraints.

【Paper Link】【Pages】:711-719

【Authors】: Jette Henderson ; Bradley A. Malin ; Joshua C. Denny ; Abel N. Kho ; Jimeng Sun ; Joydeep Ghosh ; Joyce C. Ho

【Abstract】: Tensor factorization is a methodology that is applied in a variety of fields, ranging from climate modeling to medical informatics. A tensor is an n-way array that captures the relationship between n objects. These multiway arrays can be factored to study the underlying bases present in the data. Two challenges arising in tensor factorization are 1) the resulting factors can be noisy and highly overlapping with one another and 2) they may not map to insights within a domain. However, incorporating supervision to increase the number of insightful factors can be costly in terms of the time and domain expertise necessary for gathering labels or domain-specific constraints. To meet these challenges, we introduce CANDECOMP/PARAFAC (CP) tensor factorization with Cannot-Link Intermode Constraints (CP-CLIC), a framework that achieves succinct, diverse, interpretable factors. This is accomplished by gradually learning constraints that are verified with auxiliary information during the decomposition process. We demonstrate CP-CLIC's potential to extract sparse, diverse, and interpretable factors through experiments on simulated data and a real-world application in medical informatics.

【Keywords】:

81. Find the dimension that counts: Fast dimension estimation and Krylov PCA.

【Paper Link】【Pages】:720-728

【Authors】: Shashanka Ubaru ; Abd-Krim Seghouane ; Yousef Saad

【Abstract】: High dimensional data and systems with many degrees of freedom are often characterized by covariance matrices. In this paper, we consider the problem of simultaneously estimating the dimension of the principal (dominant) subspace of these covariance matrices and obtaining an approximation to the subspace. This problem arises in the popular principal component analysis (PCA), and in many applications of machine learning, data analysis, signal and image processing, and others. We first present a novel method for estimating the dimension of the principal subspace. We then show how this method can be coupled with a Krylov subspace method to simultaneously estimate the dimension and obtain an approximation to the subspace. The dimension estimation is achieved at no additional cost. The proposed method operates on a model selection framework, where the novel selection criterion is derived based on random matrix perturbation theory ideas. We present theoretical analyses which (a) show that the proposed method achieves strong consistency (i.e., yields optimal solution as the number of data-points n → ∞), and (b) analyze conditions for exact dimension estimation in the finite n case. Using recent results, we show that our algorithm also yields near optimal PCA. The proposed method avoids forming the sample covariance matrix (associated with the data) explicitly and computing the complete eigen-decomposition. Therefore, the method is inexpensive, which is particularly advantageous in modern data applications where the covariance matrices can be very large. Numerical experiments illustrate the performance of the proposed method in various applications.

【Keywords】:

82. Boolean matrix factorization meets consecutive ones property.

【Paper Link】【Pages】:729-737

【Authors】: Nikolaj Tatti ; Pauli Miettinen

【Abstract】: Boolean matrix factorization is a natural and a popular technique for summarizing binary matrices. In this paper, we study a problem of Boolean matrix factorization where we additionally require that the factor matrices have consecutive ones property (OBMF). A major application of this optimization problem comes from graph visualization: standard techniques for visualizing graphs are circular or linear layout, where nodes are ordered in circle or on a line. A common problem with visualizing graphs is clutter due to too many edges. The standard approach to deal with this is to bundle edges together and represent them as ribbon. We also show that we can use OBMF for edge bundling combined with circular or linear layout techniques. We demonstrate that not only this problem is NP-hard but we cannot have a polynomial-time algorithm that yields a multiplicative approximation guarantee (unless P = NP). On the positive side, we develop a greedy algorithm where at each step we look for the best 1-rank factorization. Since even obtaining 1-rank factorization is NP-hard, we propose an iterative algorithm where we fix one side and and find the other, reverse the roles, and repeat. We show that this step can be done in linear time using pq-trees. We also extend the problem to cyclic ones property and symmetric factorizations. Our experiments show that our algorithms find high-quality factorizations and scale well.

【Keywords】:

83. Robust Factorization Machine: A Doubly Capped Norms Minimization.

【Paper Link】【Pages】:738-746

【Authors】: Chenghao Liu ; Teng Zhang ; Jundong Li ; Jianwen Yin ; Peilin Zhao ; Jianling Sun ; Steven C. H. Hoi

【Abstract】: Factorization Machine (FM) is a general supervised learning framework for many AI applications due to its powerful capability of feature engineering. Despite being extensively studied, existing FM methods have several limitations in common. First of all, most existing FM methods often adopt the squared loss in the modeling process, which can be very sensitive when the data for learning contains noises and outliers. Second, some recent FM variants often explore the low-rank structure of the feature interactions matrix by relaxing the low-rank minimization problem as a trace norm minimization, which cannot always achieve a tight approximation to the original one. To address the aforementioned issues, this paper proposes a new scheme of Robust Factorization Machine (RFM) by exploring a doubly capped norms minimization approach, which employs both a capped squared trace norm in achieving a tighter approximation of the rank minimization and a capped ℓ1-norm loss to enhance the robustness of the empirical loss minimization from noisy data. We develop an efficient algorithm with a rigorous convergence proof of RFM. Experiments on public real-world datasets show that our method outperforms the state-of-the-art FM methods significantly.

【Keywords】:

84. Collaborative Filtering with Noisy Ratings.

【Paper Link】【Pages】:747-755

【Authors】: Dongsheng Li ; Chao Chen ; Zhilin Gong ; Tun Lu ; Stephen M. Chu ; Ning Gu

【Abstract】: User ratings on items are noisy in real-world recommender systems, which raises challenges to matrix approximation (MA)-based collaborative filtering (CF) algorithms — the learned models will be easily biased to the noisy training data and yield low generalization performance. This paper proposes a noise-resilient matrix approximation (NORMA) method, which can achieve less biased matrix approximation and thus more accurate collaborative filtering. In NORMA, an adaptive weighting strategy is proposed to decrease the gradient updates of noisy ratings, so that the learned MA models will be less prone to the noisy ratings. Theoretical analyses show that NORMA can achieve better generalization performance than standard matrix approximation methods. Experimental studies on real-world datasets demonstrate that NORMA can outperform state-of-the-art matrix approximation-based collaborative filtering methods in recommendation accuracy.

【Keywords】:

85. A Full Probabilistic Model for Yes/No Type Crowdsourcing in Multi-Class Classification.

【Paper Link】【Pages】:756-764

【Authors】: Belén Saldías-Fuentes ; Pavlos Protopapas ; Karim Pichara B.

【Abstract】: Crowdsourcing has become widely used in supervised scenarios where training sets are scarce and difficult to obtain. Most crowdsourcing models in the literature assume labelers can provide answers to full questions. In classification contexts, full questions require a labeler to discern among all possible classes. Unfortunately, discernment is not always easy in realistic scenarios. Labelers may not be experts in differentiating all classes. In this work, we provide a full probabilistic model for a shorter type of queries. Our shorter queries only require “yes” or “no” responses. Our model estimates a joint posterior distribution of matrices related to labelers' confusions and the posterior probability of the class of every object. We developed an approximate inference approach, using Monte Carlo Sampling and Black Box Variational Inference, which provides the derivation of the necessary gradients. We built two realistic crowdsourcing scenarios to test our model. The first scenario queries for irregular astronomical time-series. The second scenario relies on the image classification of animals. We achieved results that are comparable with those of full query crowdsourcing. Furthermore, we show that modeling labelers' failures plays an important role in estimating true classes. Finally, we provide the community with two real datasets obtained from our crowdsourcing experiments.

【Keywords】:

86. Predicting Multiple Demographic Attributes with Task Specific Embedding Transformation and Attention Network.

【Paper Link】【Pages】:765-773

【Authors】: Raehyun Kim ; Hyunjae Kim ; Janghyuk Lee ; Jaewoo Kang

【Abstract】: Most companies utilize demographic information to develop their strategy in a market. However, such information is not available to most retail companies. Several studies have been conducted to predict the demographic attributes of users from their transaction histories, but they have some limitations. First, they focused on parameter sharing to predict all attributes but capturing task-specific features is also important in multi-task learning. Second, they assumed that all transactions are equally important in predicting demographic attributes. However, some transactions are more useful than others for predicting a certain attribute. Furthermore, decision making process of models cannot be interpreted as they work in a black-box manner. To address the limitations, we propose an Embedding Transformation Network with Attention (ETNA) model which shares representations at the bottom of the model structure and transforms them to task-specific representations using a simple linear transformation method. In addition, we can obtain more informative transactions for predicting certain attributes using the attention mechanism. The experimental results show that our model outperforms the previous models on all tasks. In our qualitative analysis, we show the visualization of attention weights, which provides business managers with some useful insights.

【Keywords】:

87. Semantics-Aware Hidden Markov Model for Human Mobility.

【Paper Link】【Pages】:774-782

【Authors】: Hongzhi Shi ; Hancheng Cao ; Xiangxin Zhou ; Yong Li ; Chao Zhang ; Vassilis Kostakos ; Funing Sun ; Fanchao Meng

【Abstract】: Understanding human mobility benefits numerous applications such as urban planning, traffic control and city management. Previous work mainly focuses on modeling spatial and temporal patterns of human mobility. However, the semantics of trajectory are ignored, thus failing to model people's motivation behind mobility. In this paper, we propose a novel semantics-aware mobility model that captures human mobility motivation using large-scale semantics-rich spatial-temporal data from location-based social networks. In our system, we first develop a multimodal embedding method to project user, location, time, and activity on the same embedding space in an unsupervised way while preserving original trajectory semantics. Then, we use hidden Markov model to learn latent states and transitions between them in the embedding space, which is the location embedding vector, to jointly consider spatial, temporal, and user motivations. In order to tackle the sparsity of individual mobility data, we further propose a von Mises-Fisher mixture clustering for user grouping so as to learn a reliable and fine-grained model for groups of users sharing mobility similarity. We evaluate our proposed method on two large-scale real-world datasets, where we validate the ability of our method to produce high-quality mobility models. We also conduct extensive experiments on the specific task of location prediction. The results show that our model outperforms state-of-the-art mobility models with higher prediction accuracy and much higher efficiency.

【Keywords】:

88. Dissecting the Learning Curve of Taxi Drivers: A Data-Driven Approach.

【Paper Link】【Pages】:783-791

【Authors】: Menghai Pan ; Yanhua Li ; Xun Zhou ; Zhenming Liu ; Rui Song ; Hui Lu ; Jun Luo

【Abstract】: Many real world human behaviors can be modeled and characterized as sequential decision making processes, such as taxi driver's choices of working regions and times. Each driver possesses unique preferences on the sequential choices over time and improves their working efficiency. Understanding the dynamics of such preferences helps accelerate the learning process of taxi drivers. Prior works on taxi operation management mostly focus on finding optimal driving strategies or routes, lacking in-depth analysis on what the drivers learned during the process and how they affect the performance of the driver. In this work, we make the first attempt to inversely learn the taxi drivers' preferences from data and characterize the dynamics of such preferences over time. We extract two types of features, i.e., profile features and habit features, to model the decision space of drivers. Then through inverse reinforcement learning we learn the preferences of drivers with respect to these features. The results illustrate that self-improving drivers tend to keep adjusting their preferences to habit features to increase their earning efficiency, while keeping the preferences to profile features invariant. On the other hand, experienced drivers have stable preferences over time.

【Keywords】:

89. Recommendations for optimizing the collective user experience.

【Paper Link】【Pages】:792-800

【Authors】: Behzad Golshan ; Evimaria Terzi ; Panayiotis Tsaparas

【Abstract】: Traditional recommender systems aim to satisfy individual users by providing them with recommendations that match their preferences. Such recommender systems don't take into consideration how the number of users recommended to use a particular item affects the users' experience. For example, a highly-recommended restaurant may match the preferences of many users. However, increasing its popularity via recommendations may make the experience unsatisfactory due to high volume of customers, long lines and inevitably slow service. In this paper, we develop a new recommendation-system paradigm that we call collective recommendations. Collective recommendations take into consideration not only the user preferences, but also the effect of the popularity of a venue to the overall user experience. We formally define the algorithmic problems motivated by collective recommendations and develop an algorithmic framework for solving them effectively. Our experiments with real data demonstrate the effectiveness of our methods in practice. Nobody goes there any more. It's too crowded — Yogi Berra

【Keywords】:

90. Fairness in representation: quantifying stereotyping as a representational harm.

【Paper Link】【Pages】:801-809

【Authors】: Mohsen Abbasi ; Sorelle A. Friedler ; Carlos Scheidegger ; Suresh Venkatasubramanian

【Abstract】: While harms of allocation have been increasingly studied as part of the subfield of algorithmic fairness, harms of representation have received considerably less attention. In this paper, we formalize two notions of stereotyping and show how they manifest in later allocative harms within the machine learning pipeline. We also propose mitigation strategies and demonstrate their effectiveness on synthetic datasets.

【Keywords】:

SDM 2019:Calgary, Alberta, Canada

Paper Num: 90 || Session Num: 0

1. Incremental Multi-view Support Vector Machine.

2. Deep Multimodality Model for Multi-task Multi-view Learning.

3. Augmented Multi-Task Learning by Optimal Transport.

4. Calibrating Probability Estimation Trees using Venn-Abers Predictors.

5. Deep Multi-view Information Bottleneck.

6. Forest Packing: Fast Parallel, Decision Forests.

7. Fast Unsupervised Location Category Inference from Highly Inaccurate Mobility Data.

8. GeoAttn: Localization of Social Media Messages via Attentional Memory Network.

9. Bus Travel Speed Prediction using Attention Network of Heterogeneous Correlation Features.

10. A Nondeterministic Normalization based Scan Statistic (NN-scan) towards Robust Hotspot Detection: A Summary of Results.

11. A Fast Algorithm for Combinatorial Hotspot Mining Based on Spatial Scan Statistic.

12. A Deep Spatio-Temporal Fuzzy Neural Network for Passenger Demand Prediction.

13. AMAS: Attention Model for Attributed Sequence Classification.

14. MM-Pred: A Deep Predictive Model for Multi-attribute Event Sequence.

15. Constraint Programming for Association Rules.

16. Linear Time Motif Discovery in Time Series.

17. Deep Skip-Gram Networks for Text Classification.

18. Mining Rules Incrementally over Large Knowledge Bases.

19. Optimally rotated coordinate systems for adaptive least-squares regression on sparse grids.

20. ℓ0-Regularized Sparsity for Probabilistic Mixture Models.

21. Intrinsic Dimensionality Estimation within Tight Localities.

22. Identifying When Effect Restoration Will Improve Estimates of Causal Effect.

23. We Are Not Your Real Parents: Telling Causal from Confounded using MDL.

24. Query-Aware Bayesian Committee Machine for Scalable Gaussian Process Regression.

25. A Novel Regularizer for Temporally Stable Learning with an Application to Twitter Topic Classification.

26. Efficient Summarizing of Evolving Events from Twitter Streams.

27. Hierarchical Attention Networks for Cyberbullying Detection on the Instagram Social Network.

28. SPMM: A Soft Piecewise Mapping Model for Bilingual Lexicon Induction.

29. Aspect-Based Sentiment Analysis with Minimal Guidance.

30. Joint Post and Link-level Influence Modeling on Social Media.

31. Positive-Unlabeled Classification under Class Prior Shift and Asymmetric Error.

32. Fast Training for Large-Scale One-versus-All Linear Classifiers using Tree-Structured Initialization.

33. Towards Identifying Causal Relation Between Instances and Labels.

34. Discovering and Controlling for Latent Confounds in Text Classification Using Adversarial Domain Adaptation.

35. A Bayesian Approach for Accurate Classification-Based Aggregates.

36. Decision List Optimization based on Continuous Relaxation.

37. Mining Maximal Induced Bicliques using Odd Cycle Transversals.

38. DSL: Discriminative Subgraph Learning via Sparse Self-Representation.

39. Relationship Profiling over Social Networks: Reverse Smoothness from Similarity to Closeness.

40. Edge Replacement Grammars : A Formal Language Approach for Generating Graphs.

41. Temporal Graph Regression via Structure-Aware Intrinsic Representation Learning.

42. Multi-Dimensional, Multilayer, Nonlinear and Dynamic HITS.

43. Flow-Based Local Graph Clustering with Better Seed Set Inclusion.

44. Spherical Principal Component Analysis.

45. Auto-Summarization: A Step Towards Unsupervised Learning of a Submodular Mixture.

46. Learning Diverse Gaussian Graphical Models and Interpreting Edges.

47. Deep Co-Clustering.

48. Discovering Multiple Co-Clusterings in Subspaces.

49. Accelerated Experimental Design for Pairwise Comparisons.

50. Region-Based Active Learning with Hierarchical and Adaptive Region Construction.

51. Feature selection as Monte-Carlo Search in Growing Single Rooted Directed Acyclic Graph by Best Leaf Identification.

52. Relative distance comparisons with confidence judgements.

53. Doubly Robust Prediction and Evaluation Methods Improve Uplift Modeling for Observational Data.

54. EstImAgg: A Learning Framework for Groupwise Aggregated Data.

55. DTEC: Distance Transformation Based Early Time Series Classification.

56. Leveraging Subsequence-orders for Univariate and Multivariate Time-series Classification.

57. Branch and Border: Partition-Based Change Detection in Multivariate Time Series.

58. Spatial Context-Aware Networks for Mining Temporal Discriminative Period in Land Cover Detection.

59. Elastic bands across the path: A new framework and method to lower bound DTW.

60. Coupled Clustering of Time-Series and Networks.

61. Classifying Heterogeneous Sequential Data by Cyclic Domain Adaptation: An Application in Land Cover Detection.

62. Incomplete Conditional Density Estimation for Fast Materials Discovery.

63. Physics Guided RNNs for Modeling Dynamical Systems: A Case Study in Simulating Lake Temperature Profiles.

64. Reading Customer Reviews to Answer Product-related Questions.

65. An Interpretable Fast Model for Predicting The Risk of Heart Failure.

66. LSCP: Locally Selective Combination in Parallel Outlier Ensembles.

67. Deep Anomaly Detection on Attributed Networks.

68. ABACUS: Unsupervised Multivariate Change Detection via Bayesian Source Separation.

69. Learning On-the-Job to Re-rank Anomalies from Top-1 Feedback.

70. SMF: Drift-Aware Matrix Factorization with Seasonal Patterns.

71. Multi-Stage Variational Auto-Encoders for Coarse-to-Fine Image Generation.

72. Path-based Attribute-aware Representation Learning for Relation Prediction.

73. Spatial Variational Auto-Encoding via Matrix-Variate Normal Distributions.

74. Multi-dimensional Graph Convolutional Networks.

75. Autonomous Deep Learning: Continual Learning Approach for Dynamic Environments.

76. Deep Transfer Reinforcement Learning for Text Summarization.

77. TMSA: A Mutual Learning Model for Topic Discovery and Word Embedding.

78. Attentional Heterogeneous Graph Neural Network: Application to Program Reidentification.