16. ICDM 2016:Barcelona, Spain

IEEE 16th International Conference on Data Mining, ICDM 2016, December 12-15, 2016, Barcelona, Spain. IEEE 【DBLP Link

Paper Num: 178 || Session Num: 2

Regular Papers 77

1. Auditing Black-Box Models for Indirect Influence.

Paper Link】 【Pages】:1-10

【Authors】: Philip Adler ; Casey Falk ; Sorelle A. Friedler ; Gabriel Rybeck ; Carlos Scheidegger ; Brandon Smith ; Suresh Venkatasubramanian

【Abstract】: Data-trained predictive models see widespread use, but for the most part they are used as black boxes which output a prediction or score. It is therefore hard to acquire a deeper understanding of model behavior, and in particular how different features influence the model prediction. This is important when interpreting the behavior of complex models, or asserting that certain problematic attributes (like race or gender) are not unduly influencing decisions. In this paper, we present a technique for auditing black-box models, which lets us study the extent to which existing models take advantage of particular features in the dataset, without knowing how the models work. Our work focuses on the problem of indirect influence: how some features might indirectly influence outcomes via other, related features. As a result, we can find attribute influences even in cases where, upon further direct examination of the model, the attribute is not referred to by the model at all. Our approach does not require the black-box model to be retrained. This is important if (for example) the model is only accessible via an API, and contrasts our work with other methods that investigate feature influence like feature selection. We present experimental evidence for the effectiveness of our procedure using a variety of publicly available datasets and models. We also validate our procedure using techniques from interpretable learning and feature selection, as well as against other black-box auditing procedures.

【Keywords】: Predictive models; Computational modeling; Computer science; Data models; Testing; Standards; Context

2. Asynchronous Multi-task Learning.

Paper Link】 【Pages】:11-20

【Authors】: Inci M. Baytas ; Ming Yan ; Anil K. Jain ; Jiayu Zhou

【Abstract】: Many real-world machine learning applications involve several learning tasks which are inter-related. For example, in healthcare domain, we need to learn a predictive model of a certain disease for many hospitals. The models for each hospital may be different because of the inherent differences in the distributions of the patient populations. However, the models are also closely related because of the nature of the learning tasks modeling the same disease. By simultaneously learning all the tasks, multi-task learning (MTL) paradigm performs inductive knowledge transfer among tasks to improve the generalization performance. When datasets for the learning tasks are stored at different locations, it may not always be feasible to transfer the data to provide a data centralized computing environment due to various practical issues such as high data volume and privacy. In this paper, we propose a principled MTL framework for distributed and asynchronous optimization to address the aforementioned challenges. In our framework, gradient update does not wait for collecting the gradient information from all the tasks. Therefore, the proposed method is very efficient when the communication delay is too high for some task nodes. We show that many regularized MTL formulations can benefit from this framework, including the low-rank MTL for shared subspace learning. Empirical studies on both synthetic and real-world datasets demonstrate the efficiency and effectiveness of the proposed framework.

【Keywords】: Optimization; Hospitals; Data models; Predictive models; Servers; Sociology; Statistics

3. Unsupervised Exceptional Attributed Sub-Graph Mining in Urban Data.

Paper Link】 【Pages】:21-30

【Authors】: Ahmed Anes Bendimerad ; Marc Plantevit ; Céline Robardet

【Abstract】: Geo-located social media provide a wealth of information that describes urban areas based on user descriptions and comments. Such data makes possible to identify meaningful city neighborhoods on the basis of the footprints left by a large and diverse population that uses this type of media. In this paper, we present some methods to exhibit the predominant activities and their associated urban areas to automatically describe a whole city. Based on a suitable attributed graph model, our approach identifies neighborhoods with homogeneous and exceptional characteristics. We introduce the novel problem of exceptional sub-graph mining in attributed graphs and propose a complete algorithm that takes benefits from new upper bounds and pruning properties. We also propose an approach to sample the space of exceptional sub-graphs within a given time-budget. Experiments performed on 10 real datasets are reported and demonstrate the relevancy and the limits of both approaches.

【Keywords】: Urban areas; Upper bound; Algorithm design and analysis; Space exploration; Data mining; Social network services; Weight measurement

4. ADAGIO: Fast Data-Aware Near-Isometric Linear Embeddings.

Paper Link】 【Pages】:31-40

【Authors】: Jaroslaw Blasiok ; Charalampos E. Tsourakakis

【Abstract】: Many important applications, including signal reconstruction, parameter estimation, and signal processing in a compressed domain, rely on a low-dimensional representation of the dataset that preserves all pairwise distances between the data points and leverages the inherent geometric structure that is typically present. Recently Hedge, Sankaranarayanan, Yin and Baraniuk [19] proposed the first data-aware near-isometric linear embedding which achieves the best of both worlds. However, their method NuMax does not scale to large-scale datasets. Our main contribution is a simple, data-aware, near-isometric linear dimensionality reduction method which significantly outperforms a state-of-the-art method [19] with respect to scalability while achieving high quality near-isometries. Furthermore, our method comes with strong worst-case theoretical guarantees that allow us to guarantee the quality of the obtained nearisometry. We verify experimentally the efficiency of our method on numerous real-world datasets, where we find that our method (<;10 secs) is more than 3000× faster than the state-of-the-art method [19] (>9 hours) on medium scale datasets with 60000 datapoints in 784 dimensions. Finally, we use our method as a preprocessing step to increase the computational efficiency of a classification application and for speeding up approximate nearest neighbor queries.

【Keywords】: Principal component analysis; Nonlinear distortion; Algorithm design and analysis; Signal processing algorithms; Parameter estimation

5. Causal Inference by Compression.

Paper Link】 【Pages】:41-50

【Authors】: Kailash Budhathoki ; Jilles Vreeken

【Abstract】: Causal inference is one of the fundamental problems in science. In recent years, several methods have been proposed for discovering causal structure from observational data. These methods, however, focus specifically on numeric data, and are not applicable on nominal or binary data. In this work, we focus on causal inference for binary data. Simply put, we propose causal inference by compression. To this end we propose an inference framework based on solid information theoretic foundations, i.e. Kolmogorov complexity. However, Kolmogorov complexity is not computable, and hence we propose a practical and computable instantiation based on the Minimum Description Length (MDL) principle. To apply the framework in practice, we propose ORIGO, an efficient method for inferring the causal direction from binary data. ORIGO employs the lossless PACK compressor, works directly on the data and does not require assumptions about neither distributions nor the type of causal relations. Extensive evaluation on synthetic, benchmark, and real-world data shows that ORIGO discovers meaningful causal relations, and outperforms state-of-the-art methods by a wide margin.

【Keywords】: Complexity theory; Inference algorithms; Information theory; Benchmark testing; Drugs; Decision trees; Informatics

6. On Dense Subgraphs in Signed Network Streams.

Paper Link】 【Pages】:51-60

【Authors】: Jose Cadena ; Anil Kumar S. Vullikanti ; Charu C. Aggarwal

【Abstract】: Signed networks remain relatively under explored despite the fact that many real networks are of this kind. Here, we study the problem of subgraph density in signed networks and show connections to the event detection task. Notions of density have been used in prior studies on anomaly detection, but all existing methods have been developed for unsigned networks. We develop the first algorithms for finding dense subgraphs in signed networks using semi-definite programming based rounding. We give rigorous guarantees for our algorithms, and develop a heuristic EGOSCAN which is significantly faster. We evaluate the performance of EGOSCAN for different notions of density, and observe that it performs significantly better than natural adaptations of prior algorithms for unsigned networks. In particular, the improvement in edge density over previous methods is as much as 85% and usually over 50%. These results are consistent across signed and unsigned networks in different domains. The improvement in performance is even more significant for a constrained version of the problem involving finding subgraphs containing a subset of query nodes. We also develop an event detection method for signed and unsigned networks based on subgraph density. We apply this to three different temporal datasets, and show that our method based on EGOSCAN significantly outperforms existing approaches and baseline methods in terms of the precision-recall tradeoff (by as much as 25-50% in some instances).

【Keywords】: Event detection; Image edge detection; Approximation algorithms; Programming; Social network services; Quadratic programming; Conferences

7. Relief of Spatiotemporal Accessibility Overloading with Optimal Resource Placement.

Paper Link】 【Pages】:61-70

【Authors】: Chien-Wei Chang ; Hao-Yi Chih ; Dean Chou ; Yu-Chen Shu ; Kun-Ta Chuang

【Abstract】: With the effects of global warming, some epidemic diseases via mosquito (e.g. mosquito-borne diseases) become more serious, such as dengue fever and zika virus. It is reported that the epidemic disease may cause many challenges to the hospital management due to the unexpected burst with uncertain reasons. Furthermore, the imperfect cares during the propagation of epidemic diseases, such as dengue fever (so far the appropriate treatment is not well established), may lead to the increasing mortality rate which should be avoided. In this paper, a novel paradigm for optimizing the placement of medical resource is proposed in pursuit of reducing the overloading cases in hospitals during the epidemic outbreak in the urban area. In this paper we explore the first paper to explore two important issues, including the strategy to evaluate the service quality and the solution to dynamically dispatch the medical resource, along with the spatial variation of epidemic outbreak. As validated in our experimental results in real data of dengue outbreak happening in Tainan (2015), we present the feasibility of our framework to deploy a dynamic placement strategy for medical resource assignment.

【Keywords】: Hospitals; Resource management; Diseases; Urban areas; Graphical models; Distribution functions; Data mining

8. Mining Graphlet Counts in Online Social Networks.

Paper Link】 【Pages】:71-80

【Authors】: Xiaowei Chen ; John C. S. Lui

【Abstract】: Counting subgraphs is a fundamental analysis task for online social networks (OSNs). Given the sheer size and restricted access of online social network data, efficient computation of subgraph counts is highly challenging. Although a number of algorithms have been proposed to estimate the relative counts of subgraphs in OSNs with restricted access, there are only few works which try to solve a more general problem, i.e., counting subgraph frequencies. In this paper, we propose an efficient random walk-based framework to estimate the subgraph counts. Our framework generates samples by leveraging consecutive steps of the random walk as well as by observing neighbors of visited nodes. Using the importance sampling technique, we derive unbiased estimators of the subgraph counts. To make better use of the degree information of visited nodes, we also design an improved estimator, which increases the efficiency of the estimate at no additional cost. We conduct extensive experimental evaluation on real-world OSNs to confirm our theoretical claims. The experiment results show that our estimators are unbiased, accurate, efficient and better than the state-of-the-art algorithm. For the Weibo graph with more than 58 million nodes, our method produces estimate of triangle count with an error less than 5% using only 20 thousands sampled nodes. Detailed comparison with the state-of-the-art method demonstrates that our algorithm is 4 to 5 times more accurate.

【Keywords】: Markov processes; Algorithm design and analysis; Monte Carlo methods; Data mining; Computational modeling; Twitter

9. Differentially Private Regression Diagnostics.

Paper Link】 【Pages】:81-90

【Authors】: Yan Chen ; Ashwin Machanavajjhala ; Jerome P. Reiter ; Andrés F. Barrientos

【Abstract】: Linear and logistic regression are popular statistical techniques for analyzing multi-variate data. Typically, analysts do not simply posit a particular form of the regression model, estimate its parameters, and use the results for inference orprediction. Instead, they first use a variety of diagnostic techniques to assess how well the model fits the relationships in the data and how well it can be expected to predict outcomes for out-of-sample records, revising the model as necessary to improve fit and predictive power. In this article, we develop ε-differentially private diagnostics for regression, beginning to fill a gap in privacy-preserving data analysis. Specifically, we create differentially private versions of residual plots for linear regression and of receiver operating characteristic (ROC) curves for logistic regression. The former helps determine whether or not the data satisfy the assumptions underlying the linear regression model, and the latter is used to assess the predictive power of the logistic regression model. These diagnostics improve the usefulness of algorithms for computing differentially private regression output, which alone does not allow analysts to assess the quality of the posited model. Our empirical studies show that these algorithms are adequate for diagnosing the fit and predictive power of regression models on representative datasets when the size of the dataset times the privacy parameter (ε) is at least 1000.

【Keywords】: Privacy; Algorithm design and analysis; Computational modeling; Predictive models; Analytical models; Data models; Data privacy

10. Vote-and-Comment: Modeling the Coevolution of User Interactions in Social Voting Web Sites.

Paper Link】 【Pages】:91-100

【Authors】: Alceu Ferraz Costa ; Agma Juci Machado Traina ; Caetano Traina Jr. ; Christos Faloutsos

【Abstract】: In social voting Web sites, how do the user actions - up-votes, down-votes and comments - evolve over time? Are there relationships between votes and comments? What is normal and what is suspicious? These are the questions we focus on. We analyzed over 20,000 submissions corresponding to more than 100 million user interactions from three social voting Web sites: Reddit, Imgur and Digg. Our first contribution is two discoveries: (i) the number of comments grows as a power-law on the number of votes and (ii) the time between a submission creation and a user's reaction obeys a log-logistic distribution. Based on these patterns, we propose VnC (Vote-and-Comment), a parsimonious but accurate and scalable model that models the coevolution of user activities. In our experiments on real data, VnC outperformed state-of-the-art baselines on accuracy. Additionally, we illustrate VnC usefulness for forecasting and outlier detection.

【Keywords】: Web sites; Social network services; Data models; Computational modeling; Trajectory; Forecasting; Blogs

11. On Efficient External-Memory Triangle Listing.

Paper Link】 【Pages】:101-110

【Authors】: Yi Cui ; Di Xiao ; Dmitri Loguinov

【Abstract】: Discovering triangles in large graphs is a well-studied area, however, both external-memory performance of existing methods and our understanding of the complexity involved leave much room for improvement. To shed light on this problem, we first generalize the existing in-memory algorithms into a single framework of 18 triangle-search techniques. We then develop a novel external-memory approach, which we call Pruned Companion Files (PCF), that supports operation of all 18 algorithms, while significantly reducing I/O compared to the common methods in this area. After finding the best node-traversal order, we build an implementation around it using SIMD instructions for list intersection and PCF for I/O. This method runs 5-10 times faster than the best available implementation and exhibits orders of magnitude less I/O. In one of our graphs, the program finds 1 trillion triangles in 237 seconds using a desktop CPU.

【Keywords】: Nickel; Complexity theory; Partitioning algorithms; Redundancy; Data mining; Random access memory; Taxonomy

12. Efficient Distributed SGD with Variance Reduction.

Paper Link】 【Pages】:111-120

【Authors】: Soham De ; Tom Goldstein

【Abstract】: Stochastic Gradient Descent (SGD) has become one of the most popular optimization methods for training machine learning models on massive datasets. However, SGD suffers from two main drawbacks: (i) The noisy gradient updates have high variance, which slows down convergence as the iterates approach the optimum, and (ii) SGD scales poorly in distributed settings, typically experiencing rapidly decreasing marginal benefits as the number of workers increases. In this paper, we propose a highly parallel method, CentralVR, that uses error corrections to reduce the variance of SGD gradient updates, and scales linearly with the number of worker nodes. CentralVR enjoys low iteration complexity, provably linear convergence rates, and exhibits linear performance gains up to hundreds of cores for massive datasets. We compare CentralVR to state-of-the-art parallel stochastic optimization methods on a variety of models and datasets, and find that our proposed methods exhibit stronger scaling than other SGD variants.

【Keywords】: Convergence; Stochastic processes; Error correction; Servers; Indexes; Computational modeling; Approximation algorithms

13. Triply Stochastic Variational Inference for Non-linear Beta Process Factor Analysis.

Paper Link】 【Pages】:121-130

【Authors】: Kai Fan ; Yizhe Zhang ; Ricardo Henao ; Katherine A. Heller

【Abstract】: We propose a non-linear extension to factor analysis with beta process priors for improved data representation ability. This non-linear Beta Process Factor Analysis (nBPFA) allows data to be represented as a non-linear transformation of a standard sparse factor decomposition. We develop a scalable variational inference framework, which builds upon the ideas of the variational auto-encoder, by allowing latent variables of the model to be sparse. Our framework can be readily used for real-valued, binary and count data. We show theoretically and with experiments that our training scheme, with additive or multiplicative noise on observations, improves performance and prevents overfitting. We benchmark our algorithms on image, text and collaborative filtering datasets. We demonstrate faster convergence rates and competitive performance compared to standard gradient-based approaches.

【Keywords】: Stochastic processes; Standards; Analytical models; Gaussian distribution; Decoding; Convergence; Computational modeling

14. Beyond Points and Paths: Counting Private Bodies.

Paper Link】 【Pages】:131-140

【Authors】: Maryam Fanaeepour ; Benjamin I. P. Rubinstein

【Abstract】: Mining of spatial data is an enabling technology for mobile services, Internet-connected cars, and the Internet of Things. But the very distinctiveness of spatial data that drives utility, comes at the cost of user privacy. In this work, we continue the tradition of privacy-preserving spatial analytics, focusing not on point or path data, but on planar spatial regions. Such data represents the area of a user's most frequent visitation-such as "around home and nearby shops". Specifically we consider the differentially-private release of data structures that support range queries for counting users' spatial regions. Counting planar regions leads to unique challenges not faced in existing work. A user's spatial region that straddles multiple data structure cells can lead to duplicate counting at query time. We provably avoid this pitfall by leveraging the Euler characteristic. To address the increased sensitivity of range queries to spatial region data, we calibrate privacy-preserving noise using bounded user region size and a constrained inference that uses robust least absolute deviations. Our novel constrained inference reduces noise and introduces covertness by (privately) imposing consistency. We provide a full end-to-end theoretical analysis of both differential privacy and high-probability utility for our approach using concentration bounds. A comprehensive experimental study on several real-world datasets establishes practical validity.

【Keywords】: Histograms; Privacy; Data structures; Data privacy; Spatial databases; Face; Trajectory

15. Efficient Rectangular Maximal-Volume Algorithm for Rating Elicitation in Collaborative Filtering.

Paper Link】 【Pages】:141-150

【Authors】: Alexander Fonarev ; Alexander Mikhalev ; Pavel Serdyukov ; Gleb Gusev ; Ivan V. Oseledets

【Abstract】: Cold start problem in Collaborative Filtering can be solved by asking new users to rate a small seed set of representative items or by asking representative users to rate a new item. The question is how to build a seed set that can give enough preference information for making good recommendations. One of the most successful approaches, called Representative Based Matrix Factorization, is based on Maxvol algorithm. Unfortunately, this approach has one important limitation - a seed set of a particular size requires a rating matrix factorization of fixed rank that should coincide with that size. This is not necessarily optimal in the general case. In the current paper, we introduce a fast algorithm for an analytical generalization of this approach that we call Rectangular Maxvol. It allows the rank of factorization to be lower than the required size of the seed set. Moreover, the paper includes the theoretical analysis of the method's error, the complexity analysis of the existing methods and the comparison to the state-of-the-art approaches.

【Keywords】: Optimization; Prediction algorithms; Collaboration; Filtering; Algorithm design and analysis; Matrix decomposition; Mathematical model

16. New Robust Clustering Model for Identifying Cancer Genome Landscapes.

Paper Link】 【Pages】:151-160

【Authors】: Hongchang Gao ; Xiaoqian Wang ; Heng Huang

【Abstract】: In recent decades, the availability of comprehensive genomic data has facilitated the insight of molecular portraits of cancer. Specifically, by conducting cancer clustering, cancer samples can be divided into several groups according to their differences and similarities in molecular characteristics. Traditional cancer clustering usually analyzes cancer samples from a single tissue, but such analysis cannot reveal the connections among different types of cancer. Landscape analysis across human cancers can help discover molecular signatures shared across cancer tissues, providing an opportunity to design new gene therapy tailored for different cancer patients. However, the noise level in genomic data is high. The robust clustering method is crucial to tackle this problem. In this paper, we propose a new robust clustering method to approach the landscape analysis for TCGA cancer data from a novel view, which is to eliminate the noise and then perform clustering on the cleaned data rather than weaken the effect of noise as existing noise-resistant norm methods. Extensive experiments on both genomic datasets and clustering benchmark datasets confirm the effectiveness and correctness of our proposed method.

【Keywords】: Cancer; Robustness; Genomics; Bioinformatics; Clustering methods; Matrix decomposition; Tumors

17. Event Series Prediction via Non-Homogeneous Poisson Process Modelling.

Paper Link】 【Pages】:161-170

【Authors】: James Goulding ; Simon Preston ; Gavin Smith

【Abstract】: Data streams whose events occur at random arrival times rather than at the regular, tick-tock intervals of traditional time series are increasingly prevalent. Event series are continuous, irregular and often highly sparse, differing greatly in nature to the regularly sampled time series traditionally the concern of hard sciences. As mass sets of such data have become more common, so interest in predicting future events in them has grown. Yet repurposing of traditional forecasting approaches has proven ineffective, in part due to issues such as sparsity, but often due to inapplicable underpinning assumptions such as stationarity and ergodicity. In this paper we derive a principled new approach to forecasting event series that avoids such assumptions, based upon: 1. The processing of event series datasets in order to produce a first parameterized mixture model of non-homogeneous Poisson processes, and 2. Application of a technique called parallel forecasting that uses these processes' rate functions to directly generate accurate temporal predictions for new query realizations. This approach uses forerunners of a stochastic process to shed light on the distribution of future events, not for themselves, but for realizations that subsequently follow in their footsteps.

【Keywords】: Forecasting; Predictive models; Time series analysis; Markov processes; Mixture models; Windows

18. Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study.

Paper Link】 【Pages】:171-180

【Authors】: Suyog Gupta ; Wei Zhang ; Fei Wang

【Abstract】: Deep learning with a large number of parameters requires distributed training, where model accuracy and runtime are two important factors to be considered. However, there has been no systematic study of the tradeoff between these two factors during the model training process. This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, mini batch size, learning rates, and number of learners on run time performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.

【Keywords】: Servers; Training; Neural networks; Protocols; Computational modeling; Synchronization; Mathematical model

19. Waddling Random Walk: Fast and Accurate Mining of Motif Statistics in Large Graphs.

Paper Link】 【Pages】:181-190

【Authors】: Guyue Han ; Harish Sethu

【Abstract】: Algorithms for mining very large graphs, such as those representing online social networks, to discover the relative frequency of small subgraphs within them are of high interest to sociologists, computer scientists and marketeers alike. However, the computation of these network motif statistics via naive enumeration is infeasible for either its prohibitive computational costs or access restrictions on the full graph data. Methods to estimate the motif statistics based on random walks by sampling only a small fraction of the subgraphs in the large graph address both of these challenges. In this paper, we present a new algorithm, called the Waddling Random Walk (WRW), which estimates the concentration of motifs of any size. It derives its name from the fact that it sways a little to the left and to the right, thus also sampling nodes not directly on the path of the random walk. The WRW algorithm achieves its computational efficiency by not trying to enumerate subgraphs around the random walk but instead using a randomized protocol to sample subgraphs in the neighborhood of the nodes visited by the walk. In addition, WRW achieves significantly higher accuracy (measured by the closeness of its estimate to the correct value) and higher precision (measured by the low variance in its estimations) than the current state-of-the-art algorithms for mining subgraph statistics. We illustrate these advantages in speed, accuracy and precision using simulations on well-known and widely used graph datasets representing real networks.

【Keywords】: Data mining; Social network services; Protocols; Estimation; Probability; Computers; Microstructure

20. Fusing Similarity Models with Markov Chains for Sparse Sequential Recommendation.

Paper Link】 【Pages】:191-200

【Authors】: Ruining He ; Julian McAuley

【Abstract】: Predicting personalized sequential behavior is a key task for recommender systems. In order to predict user actions such as the next product to purchase, movie to watch, or place to visit, it is essential to take into account both long-term user preferences and sequential patterns (i.e., short-term dynamics). Matrix Factorization and Markov Chain methods have emerged as two separate but powerful paradigms for modeling the two respectively. Combining these ideas has led to unified methods that accommodate long-and short-term dynamics simultaneously by modeling pairwise user-item and item-item interactions. In spite of the success of such methods for tackling dense data, they are challenged by sparsity issues, which are prevalent in real-world datasets. In recent years, similarity-based methods have been proposed for (sequentially-unaware) item recommendation with promising results on sparse datasets. In this paper, we propose to fuse such methods with Markov Chains to make personalized sequential recommendations. We evaluate our method, Fossil, on a variety of large, real-world datasets. We show quantitatively that Fossil outperforms alternative algorithms, especially on sparse datasets, and qualitatively that it captures personalized dynamics and is able to make meaningful recommendations.

【Keywords】: Markov processes; Predictive models; Sparse matrices; Motion pictures; Heuristic algorithms; Prediction algorithms; Portable computers

21. Adaptive Neighborhood Propagation by Joint L2, 1-Norm Regularized Sparse Coding for Representation and Classification.

Paper Link】 【Pages】:201-210

【Authors】: Lei Jia ; Zhao Zhang ; Lei Wang ; Weiming Jiang ; Ming-Bo Zhao

【Abstract】: We propose a new transductive label propagation method, termed Adaptive Neighborhood Propagation (Adaptive-NP) by joint L2,1-norm regularized sparse coding, for semi-supervised classification. To make the predicted soft labels more accurate for predicting the labels of samples and to avoid the tricky process of choosing the optimal neighborhood size or kernel width for graph construction, Adaptive-NP seamlessly integrates sparse coding and neighborhood propagation into a unified framework. That is, the sparse reconstruction error and classification error are combined for joint minimization, which clearly differs from traditional methods that explicitly separate graph construction and label propagation into independent steps, which may result in inaccurate predictions. Note that our Adaptive-NP alternately optimize the sparse codes and soft labels matrices, where the sparse codes are used as adaptive weights for neighborhood propagation at each iteration, so the tricky process of determining neighborhood size or kernel width is avoided. Besides, for enhancing sparse coding, we use the L2,1-norm constraint on the sparse coding coefficients and the reconstruction error at the same time for delivering more accurate and robust representations. Extensive simulations show that our model can deliver state-of-the-art performances on several public datasets for classification.

【Keywords】: Encoding; Sparse matrices; Manifolds; Kernel; Minimization; Robustness; Weight measurement

22. From Sets of Good Redescriptions to Good Sets of Redescriptions.

Paper Link】 【Pages】:211-220

【Authors】: Janis Kalofolias ; Esther Galbrun ; Pauli Miettinen

【Abstract】: Redescription mining aims at finding pairs of queries over data variables that describe roughly the same set of observations. These redescriptions can be used to obtain different views on the same set of entities. So far, redescription mining methods have aimed at listing all redescriptions supported by the data. Such an approach can result in many redundant redescriptions and hinder the user's ability to understand the overall characteristics of the data. In this work, we present an approach to find a good set of redescriptions, instead of finding a set of good redescriptions. That is, we present a way to remove the redundant redescriptions from a given set of redescriptions. We measure the redundancy using a framework inspired by the subjective interestingness based on maximum-entropy distributions as proposed by De Bie in 2011. Redescriptions, however, raise their unique requirements on the framework, and our solution differs significantly from the existing ones. Notably, our approach can handle disjunctions and conjunctions in the queries, whereas the existing approaches are limited only to conjunctive queries. The framework also reduces the redundancy in the redescription mining results, as we show in our empirical evaluation.

【Keywords】: Data mining; Entropy; Computational modeling; Biological system modeling; Data models; Informatics; Optimization

23. Edge Weight Prediction in Weighted Signed Networks.

Paper Link】 【Pages】:221-230

【Authors】: Srijan Kumar ; Francesca Spezzano ; V. S. Subrahmanian ; Christos Faloutsos

【Abstract】: Weighted signed networks (WSNs) are networks in which edges are labeled with positive and negative weights. WSNs can capture like/dislike, trust/distrust, and other social relationships between people. In this paper, we consider the problem of predicting the weights of edges in such networks. We propose two novel measures of node behavior: the goodness of a node intuitively captures how much this node is liked/trusted by other nodes, while the fairness of a node captures how fair the node is in rating other nodes' likeability or trust level. We provide axioms that these two notions need to satisfy and show that past work does not meet these requirements for WSNs. We provide a mutually recursive definition of these two concepts and prove that they converge to a unique solution in linear time. We use the two measures to predict the edge weight in WSNs. Furthermore, we show that when compared against several individual algorithms from both the signed and unsigned social network literature, our fairness and goodness metrics almost always have the best predictive power. We then use these as features in different multiple regression models and show that we can predict edge weights on 2 Bitcoin WSNs, an Epinions WSN, 2 WSNs derived from Wikipedia, and a WSN derived from Twitter with more accurate results than past work. Moreover, fairness and goodness metrics form the most significant feature for prediction in most (but not all) cases.

【Keywords】: Wireless sensor networks; Prediction algorithms; Social network services; Online banking; Correlation; Weight measurement

24. Transfer Learning for Survival Analysis via Efficient L2, 1-Norm Regularized Cox Regression.

Paper Link】 【Pages】:231-240

【Authors】: Yan Li ; Lu Wang ; Jie Wang ; Jieping Ye ; Chandan K. Reddy

【Abstract】: In survival analysis, the primary goal is to monitor several entities and model the occurrence of a particular event of interest. In such applications, it is quite often the case that the event of interest may not always be observed during the study period and this gives rise to the problem of censoring which cannot be easily handled in the standard regression approaches. In addition, obtaining sufficient labeled training instances for learning a robust prediction model is a very time consuming process and can be extremely difficult in practice. In this paper, we propose a transfer learning based Cox method, called Transfer-Cox, which uses auxiliary data to augment learning when there are insufficient amount of training examples. The proposed method aims to extract "useful" knowledge from the source domain and transfer it to the target domain, thus potentially improving the prediction performance in such time-to-event data. The proposed method uses the l2,1-norm penalty to encourage multiple predictors to share similar sparsity patterns, thus learns a shared representation across source and target domains, potentially improving the model performance on the target task. To speedup the computation, we apply the screening approach and extend the strong rule to sparse survival analysis models in multiple high-dimensional censored datasets. We demonstrate the performance of the proposed transfer learning method using several synthetic and high-dimensional microarray gene expression benchmark datasets and compare with other related competing state-of-the-art methods. Our results show that the proposed screening approach significantly improves the computational efficiency of the proposed algorithm without compromising the prediction performance. We also demonstrate the scalability of the proposed approach and show that the time taken to obtain the results is linear with respect to both the number of instances and features.

【Keywords】: Analytical models; Learning systems; Data models; Hazards; Predictive models; Prediction algorithms; Standards

25. Interactive Multi-task Relationship Learning.

Paper Link】 【Pages】:241-250

【Authors】: Kaixiang Lin ; Jiayu Zhou

【Abstract】: Multi-task learning (MTL) is a learning paradigm that provides a principled way to improve the generalization performance of a set of related machine learning tasks by transferring knowledge among the tasks. The past decade has witnessed many successful applications of MTL in different domains. In the center of MTL algorithms is how the relatedness of tasks are modeled and encoded in learning formulations to facilitate knowledge transfer. Among the MTL algorithms, the multi-task relationship learning (MTRL) attracted much attention in the community because it learns task relationship from data to guide knowledge transfer, instead of imposing a prior task relatedness assumption. However, this method heavily depends on the quality of training data. When there is insufficient training data or the data is too noisy, the algorithm could learn an inaccurate task relationship that misleads the learning towards suboptimal models. To address the aforementioned challenge, in this paper we propose a novel interactive multi-task relationship learning (iMTRL) framework that efficiently solicits partial order knowledge of task relationship from human experts, effectively incorporates the knowledge in a proposed knowledge-aware MTRL formulation. We propose an efficient optimization algorithm for kMTRL and comprehensively study query strategies that identify the critical pairs that are most influential to the learning. We present extensive empirical studies on both synthetic and real datasets to demonstrate the effectiveness of proposed framework.

【Keywords】: Knowledge transfer; Covariance matrices; Diseases; Training; Data models; Predictive models; Machine learning algorithms

26. Augmented LSTM Framework to Construct Medical Self-Diagnosis Android.

Paper Link】 【Pages】:251-260

【Authors】: Chaochun Liu ; Huan Sun ; Nan Du ; Shulong Tan ; Hongliang Fei ; Wei Fan ; Tao Yang ; Hao Wu ; Yaliang Li ; Chenwei Zhang

【Abstract】: Given a health-related question (such as "I have a bad stomach ache. What should I do?"), a medical self-diagnosis Android inquires further information from the user, diagnoses the disease, and ultimately recommend best solutions. One practical challenge to build such an Android is to ask correct questions and obtain most relevant information, in order to correctly pinpoint the most likely causes of health conditions. In this paper, we tackle this challenge, named "relevant symptom question generation": Given a limited set of patient described symptoms in the initial question (e.g., "stomach ache"), what are the most critical symptoms to further ask the patient, in order to correctly diagnose their potential problems? We propose an augmented long short-term memory (LSTM) framework, where the network architecture can naturally incorporate the inputs from embedding vectors of patient described symptoms and an initial disease hypothesis given by a predictive model. Then the proposed framework generates the most important symptom questions. The generation process essentially models the conditional probability to observe a new and undisclosed symptom, given a set of symptoms from a patient as well as an initial disease hypothesis. Experimental results show that the proposed model obtains improvements over alternative methods by over 30% (both precision and mean ordinal distance).

【Keywords】: Diseases; Medical diagnostic imaging; Androids; Humanoid robots; Predictive models; Electronic mail; Correlation

27. The Optimal Distribution of Electric-Vehicle Chargers across a City.

Paper Link】 【Pages】:261-270

【Authors】: Chen Liu ; Ke Deng ; Chaojie Li ; Jianxin Li ; Yanhua Li ; Jun Luo

【Abstract】: It has been estimated that the cumulative sales of Electric Vehicles (EVs) will be up to 5.9 million and the stock of EVs will be up to 20 million by 2020 [1]. As the number of EVs is expanding, there is a growing need for widely distributed, publicly accessible, EV charging facilities. The public EV Chargers (EVCs) are expected to be found and will be needed where there is on-street parking, at taxi stands, in parking lots at places of employment, hotels, airports, shopping centres, convenience shops, fast food restaurants, and coffee houses, etc. In this work, we aim to optimize the distribution of public EVCs across the city such that (i) the overall revenue generated by the EVCs is maximized, subject to (ii) the overall driver discomfort (e.g., queueing time) for EV charging is minimized. This is the first study on EVC distribution where EVCs are assumed to be installed in almost all regions across a city. The problem is formulated using a bilevel optimization model. We propose an alternating framework to solve it and have proved that a local minima is achievable. Moreover, this work introduces novel methods to extract information to understand the discomfort of petroleum car drivers, EV charging demands, parking time and parking fees across the city. The source data explored include the trajectories of taxis, the distribution of petroleum stations and various local features. The empirical study uses the real data sets from Shenzhen City, one of the largest cities in China. The extensive tests verify the superiority of the proposed bilevel optimization model in all aspects.

【Keywords】: Charging stations; Urban areas; Optimization; Petroleum; Automobiles; Public transportation

28. Streaming Model Selection via Online Factorized Asymptotic Bayesian Inference.

Paper Link】 【Pages】:271-280

【Authors】: Chunchen Liu ; Lu Feng ; Ryohei Fujimaki

【Abstract】: Recent growing needs for real time data analytics have increased importance of streaming model selection. Real-world streaming observations are often obtained by dynamically-changing or heterogeneous data sources, and learning machines must identify the complexities of the data generation processes on the fly without prior knowledge. This paper proposes online FAB (OFAB) inference as a general framework for streaming model selection of latent variable models. The key idea in OFAB inference is degeneration, i.e. it intentionally considers a "redundant" latent space anddynamically derives a "non-redundant" latent sub-space using a FAB-unique shrinkage mechanism on demand. By integrating the idea of stochastic variational inference, OFAB automatically and dynamically selects the best dimensionality of latent variables in a streaming and Bayesian principled manner. Empirical results on two applications, density estimation and abnormal detection, show that online FAB (OFAB) outperformed the state-of-the-art online inference methods.

【Keywords】: Data models; Bayes methods; Hidden Markov models; Analytical models; Stochastic processes; Complexity theory; Inference algorithms

29. Robust Multi-View Feature Selection.

Paper Link】 【Pages】:281-290

【Authors】: Hongfu Liu ; Haiyi Mao ; Yun Fu

【Abstract】: High-throughput technologies have enabled us to rapidly accumulate a wealth of diverse data types. These multi-view data contain much more information to uncover the cluster structure than single-view data, which draws raising attention in data mining and machine learning areas. On one hand, many features are extracted to provide enough information for better representations, on the other hand, such abundant features might result in noisy, redundant and irrelevant information, which harms the performance of the learning algorithms. In this paper, we focus on a new topic, multi-view unsupervised feature selection, which aims to discover the discriminative features in each view for better explanation and representation. Although there are some exploratory studies along this direction, most of them employ the traditional feature selection by putting the features in different views together and fail to evaluate the performance in the multi-view setting. The features selected in this way are difficult to explain due to the meaning of different views, which disobeys the goal of feature selection as well. In light of this, we intend to give a correct understanding of multi-view feature selection. Different from the existing work, which either incorrectly concatenates the features from different views, or takes huge time complexity to learn the pseudo labels, we propose a novel algorithm, Robust Multi-view Feature Selection (RMFS), which applies robust multi-view K-means to obtain the robust and high quality pseudo labels for sparse feature selection in an efficient way. Nontrivially we give the solution by taking the derivatives and further provide a K-means-like optimization to update several variables in a unified framework with the convergence guarantee. We demonstrate extensive experiments on three real-world multi-view data sets, which illustrate the effectiveness and efficiency of RMFS in terms of both single-view and multi-view evaluations by a large margin.

【Keywords】: Robustness; Feature extraction; Data mining; Algorithm design and analysis; Periodic structures; Optimization; Clustering algorithms

30. KNN Classifier with Self Adjusting Memory for Heterogeneous Concept Drift.

Paper Link】 【Pages】:291-300

【Authors】: Viktor Losing ; Barbara Hammer ; Heiko Wersing

【Abstract】: Data Mining in non-stationary data streams is gaining more attention recently, especially in the context of Internet of Things and Big Data. It is a highly challenging task, since the fundamentally different types of possibly occurring drift undermine classical assumptions such as i.i.d. data or stationary distributions. Available algorithms are either struggling with certain forms of drift or require a priori knowledge in terms of a task specific setting. We propose the Self Adjusting Memory (SAM) model for the k Nearest Neighbor (kNN) algorithm since kNN constitutes a proven classifier within the streaming setting. SAM-kNN can deal with heterogeneous concept drift, i.e different drift types and rates, using biologically inspired memory models and their coordination. It can be easily applied in practice since an optimization of the meta parameters is not necessary. The basic idea is to construct dedicated models for the current and former concepts and apply them according to the demands of the given situation. An extensive evaluation on various benchmarks, consisting of artificial streams with known drift characteristics as well as real world datasets is conducted. Thereby, we explicitly add new benchmarks enabling a precise performance evaluation on multiple types of drift. The highly competitive results throughout all experiments underline the robustness of SAM-kNN as well as its capability to handle heterogeneous concept drift.

【Keywords】: Adaptation models; Data mining; Benchmark testing; Biological system modeling; Prediction algorithms; Predictive models; Heuristic algorithms

31. New Probabilistic Multi-graph Decomposition Model to Identify Consistent Human Brain Network Modules.

Paper Link】 【Pages】:301-310

【Authors】: Dijun Luo ; Zhouyuan Huo ; Yang Wang ; Andrew J. Saykin ; Li Shen ; Heng Huang

【Abstract】: Many recent scientific efforts have been devoted to constructing the human connectome using Diffusion Tensor Imaging (DTI) data for understanding large-scale brain networks that underlie higher-level cognition in human. However, suitable network analysis computational tools are still lacking in human brain connectivity research. To address this problem, we propose a novel probabilistic multi-graph decomposition model to identify consistent network modules from the brain connectivity networks of the studied subjects. At first, we propose a new probabilistic graph decomposition model to address the high computational complexity issue in existing stochastic block models. After that, we further extend our new probabilistic graph decomposition model for multiple networks/graphs to identify the shared modules cross multiple brain networks by simultaneously incorporating multiple networks and predicting the hidden block state variables. We also derive an efficient optimization algorithm to solve the proposed objective and estimate the model parameters. We validate our method by analyzing both the weighted fiber connectivity networks constructed from DTI images and the standard human face image clustering benchmark data sets. The promising empirical results demonstrate the superior performance of our proposed method.

【Keywords】: Brain modeling; Probabilistic logic; Computational modeling; Data models; Diffusion tensor imaging; Stochastic processes; Optimization

32. Efficient Extraction of Non-negative Latent Factors from High-Dimensional and Sparse Matrices in Industrial Applications.

Paper Link】 【Pages】:311-319

【Authors】: Xin Luo ; Mingsheng Shang ; Shuai Li

【Abstract】: High-dimensional and sparse (HiDS) matrices are commonly encountered in many big data-related industrial applications like recommender systems. When acquiring useful patterns from them, non-negative matrix factorization (NMF) models have proven to be highly effective because of their fine representativeness of non-negative data. However, current NMF techniques suffer from a) inefficiency in addressing HiDS matrices, and b) constrained training schemes lack of flexibility, extensibility and adaptability. To address these issues, this work proposes to factorize industrial-size sparse matrices via a novel Inherently Non-negative Latent Factor (INLF) model. It connects the output factors and decision variables via a single-element-dependent sigmoid function, thereby innovatively removing the non-negativity constraints from its training process without impacting the solution accuracy. Hence, its training process is unconstrained, highly flexible and compatible with general learning schemes. Experimental results on five HiDS matrices generated by industrial applications indicate that INLF is able to acquire non-negative latent factors from them in a more efficient manner than any existing method does.

【Keywords】: Manganese; Conferences; Data mining; Sparse matrices; Estimation

33. Robust Graph-Theoretic Clustering Approaches Using Node-Based Resilience Measures.

Paper Link】 【Pages】:320-329

【Authors】: John Matta ; Tayo Obafemi-Ajayi ; Jeffrey Borwey ; Donald Wunsch ; Gunes Ercal

【Abstract】: This paper examines a schema for graph-theoretic clustering using node-based resilience measures. Node-based resilience measures optimize an objective based on a critical set of nodes whose removal causes some severity of disconnection in the network. Beyond presenting a general framework for the usage of node based resilience measures for variations of clustering problems, we emphasize the unique potential of such methods to accomplish the following properties: (i) clustering a graph in one step without knowing the number of clusters a priori, and (ii) removing noise from noisy data. We first present results of clustering experiments using a β-parametrized generalization of vertex attack tolerance, showing high clustering accuracy for both real datasets and equal density synthetic data sets, as well as successful removal of noise nodes. It is shown that arbitrarily increasing β increases the number of noise nodes removed in some cases, and that internal validation measures can be used to determine the correct number of clusters in a class of datasets. Further results are presented using five different resilience measures with a general node-based resilience clustering technique. In a subset of cases a resilience measure, such as integrity, is able to cluster to high accuracy in one step, giving the correct clustering while also determining the correct number of clusters. Integrity is also shown to be promising with respect to noise removal, removing up to 80% of noise on some datasets.

【Keywords】: Resilience; Noise measurement; Clustering algorithms; Partitioning algorithms; Context; Scattering; Image color analysis

34. Hyperbolae are No Hyperbole: Modelling Communities That are Not Cliques.

Paper Link】 【Pages】:330-339

【Authors】: Saskia Metzler ; Stephan Günnemann ; Pauli Miettinen

【Abstract】: Cliques are frequently used to model communities: a community is a set of nodes where each pair is equally likely to be connected. But studying real-world communities reveals that they have more structure than that. In particular, the nodes can be ordered in such a way that (almost) all edges in the community lie below a hyperbola. In this paper we present three new models for communities that capture this phenomenon. Our models explain the structure of the communities differently, but we also prove that they are identical in their expressive power. Our models fit to real-world data much better than traditional block models or previously-proposed hyperbolic models, both of which are a special case of our model. Our models also allow for intuitive interpretation of the parameters, enabling us to summarize the shapes of the communities in graphs effectively.

【Keywords】: Computational modeling; Data models; Shape; Peer-to-peer computing; Informatics; Probabilistic logic; YouTube

35. Learning Hierarchically Decomposable Concepts with Active Over-Labeling.

Paper Link】 【Pages】:340-349

【Authors】: Yuji Mo ; Stephen D. Scott ; Doug Downey

【Abstract】: Many classification tasks target high-level concepts that can be decomposed into a hierarchy of finer-grained sub-concepts. For example, some string entities that are Locations are also Attractions, some Attractions are Museums, etc. Such hierarchies are common in named entity recognition (NER), document classification, and biological sequence analysis. We present a new approach for learning hierarchically decomposable concepts. The approach learns a high-level classifier (e.g., location vs. non-location) by seperately learning multiple finer-grained classifiers (e.g., museum vs. non-museum), and then combining the results. Soliciting labels at a finer level of granularity than that of the target concept is a new approach to active learning, which we term active over-labeling. In experiments in NER and document classification tasks, we show that active over-labeling substantially improves area under the precision-recall curve when compared with standard passive or active learning. Finally, because finer-grained labels may be more expensive to obtain, we also present a cost-sensitive active learner that uses a multi-armed bandit approach to dynamically choose the label granularity to target, and show that the bandit-based learner is robust to differences in label cost and labeling budget.

【Keywords】: Labeling; Picture archiving and communication systems; Standards; Training; Diamond; Electronic mail; Biology

36. AWarp: Fast Warping Distance for Sparse Time Series.

Paper Link】 【Pages】:350-359

【Authors】: Abdullah Mueen ; Nikan Chavoshi ; Noor Abu-El-Rub ; Hossein Hamooni ; Amanda J. Minnich

【Abstract】: Dynamic Time Warping (DTW) distance has been effectively used in mining time series data in a multitude of domains. However, in its original formulation DTW is extremely inefficient in comparing long sparse time series, containing mostly zeros and some unevenly spaced non-zero observations. Original DTW distance does not take advantage of this sparsity, leading to redundant calculations and a prohibitively large computational cost for long time series. We derive a new time warping similarity measure (AWarp) for sparse time series that works on the run-length encoded representation of sparse time series. The complexity of AWarp is quadratic on the number of observations as opposed to the range of time of the time series. Therefore, AWarp can be several orders of magnitude faster than DTW on sparse time series. AWarp is exact for binary-valued time series and a close approximation of the original DTW distance for any-valued series. We discuss useful variants of AWarp: bounded (both upper and lower), constrained, and multidimensional. We show applications of AWarp to three data mining tasks including clustering, classification, and outlier detection, which are otherwise not feasible using classic DTW, while producing equivalent results. Potential areas of application include bot detection, human activity classification, and unusual review pattern mining.

【Keywords】: Time series analysis; Data mining; Signal processing algorithms; Encoding; Heuristic algorithms; Sparse matrices; Time measurement

37. Binary Classifier Calibration Using an Ensemble of Near Isotonic Regression Models.

Paper Link】 【Pages】:360-369

【Authors】: Mahdi Pakdaman Naeini ; Gregory F. Cooper

【Abstract】: Learning accurate probabilistic models from data is crucial in many practical tasks in data mining. In this paper we present a new non-parametric calibration method called Ensemble of Near Isotonic Regression (ENIR). The method can be considered as an extension of BBQ, a recently proposed calibration method, as well as the commonly used calibration method based on isotonic regression (IsoRegC). ENIR is designed to address the key limitation of IsoRegC which is the monotonicity assumption of the predictions. Similar to BBQ, the method post-processes the output of a binary classifier to obtain calibrated probabilities. Thus it can be used with many existing classification models to generate accurate probabilistic predictions. We demonstrate the performance of ENIR on synthetic and real datasets for commonly applied binary classification models. Experimental results show that the method outperforms several common binary classifier calibration methods. In particular on the real data, ENIR commonly performs statistically significantly better than the other methods, and never worse. It is able to improve the calibration power of classifiers, while retaining their discrimination power. The method is also computationally tractable for large scale datasets, as it is O(N log N) time, where N is the number of samples.

【Keywords】: Calibration; Data models; Data mining; Predictive models; Computational modeling; Histograms; Probabilistic logic

38. A Scalable and Generic Framework to Mine Top-k Representative Subgraph Patterns.

Paper Link】 【Pages】:370-379

【Authors】: Dheepikaa Natarajan ; Sayan Ranu

【Abstract】: Mining subgraph patterns is an active area of research. Existing research has primarily focused on mining all subgraph patterns in the database. However, due to the exponential subgraph search space, the number of patterns mined, typically, is too large for any human mediated analysis. Consequently, deriving insights from the mined patterns is hard for domain scientists. In addition, subgraph pattern mining is posed in multiple forms: the function that models if a subgraph is a pattern varies based on the application and the database could be over multiple graphs or a single, large graph. In this paper, we ask the following question: Given a subgraph importance function and a budget k, which are the k subgraph patterns that best represent all other patterns of interest? We show that the problem is NP-hard, and propose a generic framework called RESLING that adapts to arbitrary subgraph importance functions and generalizable to both transactional graph databases as well as single, large graphs. Experiments show that RESLING is up to 20 times more representative of the pattern space and 2 orders of magnitude faster than the state-of-the-art techniques. The executables for the tool are available at http://www.cse.iitm.ac.in/~sayan/software.html.

【Keywords】: Databases; Data mining; Proteins; Social network services; Roads; Redundancy; Conferences

39. What You Will Gain By Rounding: Theory and Algorithms for Rounding Rank.

Paper Link】 【Pages】:380-389

【Authors】: Stefan Neumann ; Rainer Gemulla ; Pauli Miettinen

【Abstract】: When factorizing binary matrices, we often have to make a choice between using expensive combinatorial methods that retain the discrete nature of the data and using continuous methods that can be more efficient but destroy the discrete structure. Alternatively, we can first compute a continuous factorization and subsequently apply a rounding procedure to obtain a discrete representation. But what will we gain by rounding? Will this yield lower reconstruction errors? Is it easy to find a low-rank matrix that rounds to a given binary matrix? Does it matter which threshold we use for rounding? Does it matter if we allow for only non-negative factorizations? In this paper, we approach these and further questions by presenting and studying the concept of rounding rank. We show that rounding rank is related to linear classification, dimensionality reduction, and nested matrices. We also report on an extensive experimental study that compares different algorithms for finding good factorizations under the rounding rank model.

【Keywords】: Matrix decomposition; Complexity theory; Symmetric matrices; Light rail systems; Data mining; Probabilistic logic; Informatics

40. A Fast Iterative Algorithm for Improved Unsupervised Feature Selection.

Paper Link】 【Pages】:390-399

【Authors】: Bruno Ordozgoiti Rubio ; Sandra Gómez Canaval ; Alberto Mozo

【Abstract】: Dimensionality reduction is often a crucial step for the successful application of machine learning and data mining methods. One way to achieve said reduction is feature selection. Due to the impossibility of labelling many data sets, unsupervised approaches are frequently the only option. The column subset selection problem translates naturally to this purpose, and has received consider able attention over the last few years, as it provides simple linear models for data reconstruction. Existing methods, however, often achieve approximation errors that are far from the optimum. In this paper we present a novel algorithm for column subset selection that consistently outperforms state-of-the-art methods in approximation error. We present a series of key derivations that allow an efficient implementation, making it comparable in speed and in some cases faster than other algorithms. We also prove results that make it possible to deal with huge matrices, which has strong implications for other algorithms of this type in the big data field. We validate our claimsthrough experiments on a wide variety of well-known data sets.

【Keywords】: Approximation algorithms; Algorithm design and analysis; Linear programming; Prototypes; Data mining; Matrix decomposition; Proposals

41. Sparse Factorization Machines for Click-through Rate Prediction.

Paper Link】 【Pages】:400-409

【Authors】: Zhen Pan ; Enhong Chen ; Qi Liu ; Tong Xu ; Haiping Ma ; Hongjie Lin

【Abstract】: With the rapid development of E-commerce, recent years have witnessed the booming of online advertising industry, which raises extensive concerns of both academic and business circles. Among all the issues, the task of Click-through rates (CTR) prediction plays a central role, as it may influence the ranking and pricing of online ads. To deal with this task, the Factorization Machines (FM) model is designed for better revealing proper combinations of basic features. However, the sparsity of ads transaction data, i.e., a large proportion of zero elements, may severely disturb the performance of FM models. To address this problem, in this paper, we propose a novel Sparse Factorization Machines (SFM) model, in which the Laplace distribution is introduced instead of traditional Gaussian distribution to model the parameters, as Laplace distribution could better fit the sparse data with higher ratio of zero elements. Along this line, it will be beneficial to select the most important features or conjunctions with the proposed SFM model. Furthermore, we develop a distributed implementation of our SFM model on Spark platform to support the prediction task on mass dataset in practice. Comprehensive experiments on two large-scale real-world datasets clearly validate both the effectiveness and efficiency of our SFM model compared with several state-of-the-art baselines, which also proves our assumption that Laplace distribution could be more suitable to describe the online ads transaction data.

【Keywords】: Frequency modulation; Predictive models; Data models; Advertising; Gaussian distribution; Bayes methods; Mathematical model

42. Unsupervised Feature Selection for Outlier Detection by Modelling Hierarchical Value-Feature Couplings.

Paper Link】 【Pages】:410-419

【Authors】: Guansong Pang ; Longbing Cao ; Ling Chen ; Huan Liu

【Abstract】: Proper feature selection for unsupervised outlier detection can improve detection performance but is very challenging due to complex feature interactions, the mixture of relevant features with noisy/redundant features in imbalanced data, and the unavailability of class labels. Little work has been done on this challenge. This paper proposes a novel Coupled Unsupervised Feature Selection framework (CUFS for short) to filter out noisy or redundant features for subsequent outlier detection in categorical data. CUFS quantifies the outlierness (or relevance) of features by learning and integrating both the feature value couplings and feature couplings. Such value-to-feature couplings capture intrinsic data characteristics and distinguish relevant features from those noisy/redundant features. CUFS is further instantiated into a parameter-free Dense Subgraph-based Feature Selection method, called DSFS. We prove that DSFS retains a 2-approximation feature subset to the optimal subset. Extensive evaluation results on 15 real-world data sets show that DSFS obtains an average 48% feature reduction rate, and enables three different types of pattern-based outlier detection methods to achieve substantially better AUC improvements and/or perform orders of magnitude faster than on the original feature set. Compared to its feature selection contender, on average, all three DSFS-based detectors achieve more than 20% AUC improvement.

【Keywords】: Feature extraction; Couplings; Noise measurement; Data mining; Search problems; Data models; Detectors

43. Partition Aware Connected Component Computation in Distributed Systems.

Paper Link】 【Pages】:420-429

【Authors】: Ha-Myung Park ; Namyong Park ; Sung-Hyon Myaeng ; U. Kang

【Abstract】: How can we find all connected components in an enormous graph with billions of nodes and edges?Finding connected components is a fundamental operation for various graph computation tasks such as pattern recognition, reachability, graph compression, etc. Many algorithms have been proposed for decades, but most of them are not scalable enough to process recent web scale graphs. Recently, a MapReduce algorithm was proposed to handle such large graphs. However, the algorithm repeatedly reads and writes numerous intermediate data that cause network overload and prolong the running time. In this paper, we propose PACC (Partition-Aware Connected Components), a new distributed algorithm based on graph partitioning for load-balancing and edge-filtering. Experimental results show that PACC significantly reduces the intermediate data, and provides up to 10 times faster performance than the current state-of-the-art MapReduce algorithm on real world graphs.

【Keywords】: Partitioning algorithms; Algorithm design and analysis; Distributed algorithms; Memory management; Distributed databases; Phase change random access memory; Computational modeling

44. Iteratively Reweighted Least Squares Algorithms for L1-Norm Principal Component Analysis.

Paper Link】 【Pages】:430-438

【Authors】: Young Woong Park ; Diego Klabjan

【Abstract】: Principal component analysis (PCA) is often used to reduce the dimension of data by selecting a few orthonormal vectors that explain most of the variance structure of the data. L1 PCA uses the L1 norm to measure error, whereas the conventional PCA uses the L2 norm. For the L1 PCA problem minimizing the fitting error of the reconstructed data, we propose an exact reweighted and an approximate algorithm based on iteratively reweighted least squares. We provide convergence analyses, and compare their performance against benchmark algorithms in the literature. The computational experiment shows that the proposed algorithms consistently perform best.

【Keywords】: Principal component analysis; Linear programming; Approximation algorithms; Algorithm design and analysis; Benchmark testing; Iterative methods; Loading

45. Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis.

Paper Link】 【Pages】:439-448

【Authors】: Soujanya Poria ; Iti Chaturvedi ; Erik Cambria ; Amir Hussain

【Abstract】: Technology has enabled anyone with an Internet connection to easily create and share their ideas, opinions and content with millions of other people around the world. Much of the content being posted and consumed online is multimodal. With billions of phones, tablets and PCs shipping today with built-in cameras and a host of new video-equipped wearables like Google Glass on the horizon, the amount of video on the Internet will only continue to increase. It has become increasingly difficult for researchers to keep up with this deluge of multimodal content, let alone organize or make sense of it. Mining useful knowledge from video is a critical need that will grow exponentially, in pace with the global growth of content. This is particularly important in sentiment analysis, as both service and product reviews are gradually shifting from unimodal to multimodal. We present a novel method to extract features from visual and textual modalities using deep convolutional neural networks. By feeding such features to a multiple kernel learning classifier, we significantly outperform the state of the art of multimodal emotion recognition and sentiment analysis on different datasets.

【Keywords】: Sentiment analysis; Kernel; Neurons; Emotion recognition; Feature extraction; Biological neural networks; Visualization

46. Recommending Packages to Groups.

Paper Link】 【Pages】:449-458

【Authors】: Shuyao Qi ; Nikos Mamoulis ; Evaggelia Pitoura ; Panayiotis Tsaparas

【Abstract】: The success of recommender systems has made them the focus of a massive research effort in both industry and academia. Recent work has generalized recommendations to suggest packages of items to single users, or single items to groups of users. However, to the best of our knowledge, the interesting problem of recommending a package to a group of users (P2G) has not been studied to date. This is a problem with several practical applications, such as recommending vacation packages to tourist groups, entertainment packages to groups of friends, or sets of courses to groups of students. In this paper, we formulate the P2G problem, and we propose probabilistic models that capture the preference of a group towards a package, incorporating factors such as user impact and package viability. We also investigate the issue of recommendation fairness. This is a novel consideration that arises in our setting, where we require that no user is consistently slighted by the item selection in the package. We present aggregation algorithms for finding the best packages and compare our suggested models with baseline approaches stemming from previous work. The results show that our models find packages of high quality which consider all special requirements of P2G recommendation.

【Keywords】: Bars; Probabilistic logic; Computational modeling; Algorithm design and analysis; Recommender systems; Motion pictures; Mathematical model

47. Subspace Outlier Detection in Linear Time with Randomized Hashing.

Paper Link】 【Pages】:459-468

【Authors】: Saket Sathe ; Charu C. Aggarwal

【Abstract】: Outlier detection algorithms are often computationally intensive because of their need to score each point in the data. Even simple distance-based algorithms have quadratic complexity. High-dimensional outlier detection algorithms such as subspace methods are often even more computationally intensive because of their need to explore different subspaces of the data. In this paper, we propose an exceedingly simple subspace outlier detection algorithm, which can be implemented in a few lines of code, and whose complexity is linear in the size of the data set and the space requirement is constant. We show that this outlier detection algorithm is much faster than both conventional and high-dimensional algorithms and also provides more accurate results. The approach uses randomized hashing to score data points and has a neat subspace interpretation. Furthermore, the approach can be easily generalized to data streams. We present experimental results showing the effectiveness of the approach over other state-of-the-art methods.

【Keywords】: Detectors; Training; Detection algorithms; Complexity theory; Data models; Electronic mail; Robustness

48. CoreScope: Graph Mining Using k-Core Analysis - Patterns, Anomalies and Algorithms.

Paper Link】 【Pages】:469-478

【Authors】: Kijung Shin ; Tina Eliassi-Rad ; Christos Faloutsos

【Abstract】: How do the k-core structures of real-world graphs look like? What are the common patterns and the anomalies? How can we use them for algorithm design and applications? A k-core is the maximal subgraph where all vertices have degree at least k. This concept has been applied to such diverse areas as hierarchical structure analysis, graph visualization, and graph clustering. Here, we explore pervasive patterns that are related to k-cores and emerging in graphs from several diverse domains. Our discoveries are as follows: (1) Mirror Pattern: coreness of vertices (i.e., maximum k such that each vertex belongs to the k-core) is strongly correlated to their degree. (2) Core-Triangle Pattern: degeneracy of a graph (i.e., maximum k such that the k-core exists in the graph) obeys a 3-to-1 power law with respect to the count of triangles. (3) Structured Core Pattern: degeneracy-cores are not cliques but have non-trivial structures such as core-periphery and communities. Our algorithmic contributions show the usefulness of these patterns. (1) Core-A, which measures the deviation from Mirror Pattern, successfully finds anomalies in real-world graphs complementing densest-subgraph based anomaly detection methods. (2) Core-D, a single-pass streaming algorithm based on Core-Triangle Pattern, accurately estimates the degeneracy of billion-scale graphs up to 7× faster than a recent multipass algorithm.(3) Core-S, inspired by Structured Core Pattern, identifies influential spreaders up to 17× faster than top competitors with comparable accuracy.

【Keywords】: Mirrors; Electronic mail; Twitter; Correlation; Algorithm design and analysis; Companies

49. L-EnsNMF: Boosted Local Topic Discovery via Ensemble of Nonnegative Matrix Factorization.

Paper Link】 【Pages】:479-488

【Authors】: Sangho Suh ; Jaegul Choo ; Joonseok Lee ; Chandan K. Reddy

【Abstract】: Nonnegative matrix factorization (NMF) has been widely applied in many domains. In document analysis, it has been increasingly used in topic modeling applications, where a set of underlying topics are revealed by a low-rank factor matrix from NMF. However, it is often the case that the resulting topics give only general topic information in the data, which tends not to convey much information. To tackle this problem, we propose a novel ensemble model of nonnegative matrix factorization for discovering high-quality local topics. Our method leverages the idea of an ensemble model, which has been successful in supervised learning, into an unsupervised topic modeling context. That is, our model successively performs NMF given a residual matrix obtained from previous stages and generates a sequence of topic sets. Our algorithm for updating the input matrix has novelty in two aspects. The first lies in utilizing the residual matrix inspired by a state-of-the-art gradient boosting model, and the second stems from applying a sophisticated local weighting scheme on the given matrix to enhance the locality of topics, which in turn delivers high-quality, focused topics of interest to users. We evaluate our proposed method by comparing it against other topic modeling methods, such as a few variants of NMF and latent Dirichlet allocation, in terms of various evaluation measures representing topic coherence, diversity, coverage, computing time, and so on. We also present qualitative evaluation on the topics discovered by our method using several real-world data sets.

【Keywords】: Standards; Data mining; Boosting; Data models; Context modeling; Context; Matrix decomposition

50. Modeling Ambiguity, Subjectivity, and Diverging Viewpoints in Opinion Question Answering Systems.

Paper Link】 【Pages】:489-498

【Authors】: Mengting Wan ; Julian McAuley

【Abstract】: Product review websites provide an incredible lens into the wide variety of opinions and experiences of different people, and play a critical role in helping users discover products that match their personal needs and preferences. To help address questions that can't easily be answered by reading others' reviews, some review websites also allow users to pose questions to the community via a question-answering (QA) system. As one would expect, just as opinions diverge among different reviewers, answers to such questions may also be subjective, opinionated, and divergent. This means that answering such questions automatically is quite different from traditional QA tasks, where it is assumed that a single 'correct' answer is available. While recent work introduced the idea of question-answering using product reviews, it did not account for two aspects that we consider in this paper: (1) Questions have multiple, often divergent, answers, and this full spectrum of answers should somehow be used to train the system, and (2) What makes a 'good' answer depends on the asker and the answerer, and these factors should be incorporated in order for the system to be more personalized. Here we build a new QA dataset with 800 thousand questions-and over 3.1 million answers-and show that explicitly accounting for personalization and ambiguity leads both to quantitatively better answers, but also a more nuanced view of the range of supporting, but subjective, opinions.

【Keywords】: Lenses; Knowledge discovery; Cameras; Standards; Predictive models; Data mining; Information retrieval

51. Traffic Speed Prediction and Congestion Source Exploration: A Deep Learning Method.

Paper Link】 【Pages】:499-508

【Authors】: Jingyuan Wang ; Qian Gu ; Junjie Wu ; Guannan Liu ; Zhang Xiong

【Abstract】: Traffic speed prediction is a long-standing and critically important topic in the area of Intelligent Transportation Systems (ITS). Recent years have witnessed the encouraging potentials of deep neural networks for real-life applications of various domains. Traffic speed prediction, however, is still in its initial stage without making full use of spatio-temporal traffic information. In light of this, in this paper, we propose a deep learning method with an Error-feedback Recurrent Convolutional Neural Network structure (eRCNN) for continuous traffic speed prediction. By integrating the spatio-temporal traffic speeds of contiguous road segments as an input matrix, eRCNN explicitly leverages the implicit correlations among nearby segments to improve the predictive accuracy. By further introducing separate error feedback neurons to the recurrent layer, eRCNN learns from prediction errors so as to meet predictive challenges rising from abrupt traffic events such as morning peaks and traffic accidents. Extensive experiments on real-life speed data of taxis running on the 2nd and 3rd ring roads of Beijing city demonstrate the strong predictive power of eRCNN in comparison to some state-of-the-art competitors. The necessity of weight pre-training using a transfer learning notion has also been testified. More interestingly, we design a novel influence function based on the deep learning model, and showcase how to leverage it to recognize the congestion sources of the ring roads in Beijing.

【Keywords】: Neurons; Roads; Convolution; Machine learning; Biological neural networks; Feature extraction; Training

52. Blind Men and The Elephant: Thurstonian Pairwise Preference for Ranking in Crowdsourcing.

Paper Link】 【Pages】:509-518

【Authors】: Xiaolong Wang ; Jingjing Wang ; Luo Jie ; Chengxiang Zhai ; Yi Chang

【Abstract】: Crowdsourcing services make it possible to collect huge amount of annotations from less trained crowd workers in an inexpensive and efficient manner. However, unlike making binary or pairwise judgements, labeling complex structures such as ranked lists by crowd workers is subject to large variance and low efficiency, mainly due to the huge labeling space and the annotators' non-expert nature. Yet ranked lists offer the most informative knowledge for training and testing in various data mining and information retrieval tasks such as learning to rank. In this paper, we propose a novel generative model called "Thurstonian Pairwise Preference" (TPP) to infer the true ranked list out of a collection of crowdsourced pairwise annotations. The key challenges that TPP addresses are to resolve the inevitable incompleteness and inconsistency of judgements, as well as to model variable query difficulty and different labeling quality resulting from workers' domain expertise and truthfulness. Experimental results on both synthetic and real-world datasets demonstrate that TPP can effectively bind pairwise preferences of the crowd into rankings and substantially outperforms previously published methods.

【Keywords】: Noise measurement; Transmission line measurements; Labeling; Crowdsourcing; Data mining; Training; Testing

53. Regularizing Deep Convolutional Neural Networks with a Structured Decorrelation Constraint.

Paper Link】 【Pages】:519-528

【Authors】: Wei Xiong ; Bo Du ; Lefei Zhang ; Ruimin Hu ; Dacheng Tao

【Abstract】: Deep convolutional networks have achieved successful performance in data mining field. However, training large networks still remains a challenge, as the training data may be insufficient and the model can easily get overfitted. Hence the training process is usually combined with a model regularization. Typical regularizers include weight decay, Dropout, etc. In this paper, we propose a novel regularizer, named Structured Decorrelation Constraint (SDC), which is applied to the activations of the hidden layers to prevent overfitting and achieve better generalization. SDC impels the network to learn structured representations by grouping the hidden units and encouraging the units within the same group to have strong connections during the training procedure. Meanwhile, it forces the units in different groups to learn non-redundant representations by minimizing the cross-covariance between them. Compared with Dropout, SDC reduces the co-adaptions between the hidden units in an explicit way. Besides, we propose a novel approach called Reg-Conv that can help SDC to regularize the complex convolutional layers. Experiments on extensive datasets show that SDC significantly reduces overfitting and yields very meaningful improvements on classification performance (on CIFAR-10 6.22% accuracy promotion and on CIFAR-100 9.63% promotion).

【Keywords】: Neurons; Correlation; Decorrelation; Training; Data models; Electronic mail; Biological neural networks

54. Aligned Matrix Completion: Integrating Consistency and Independency in Multiple Domains.

Paper Link】 【Pages】:529-538

【Authors】: Linli Xu ; Zaiyi Chen ; Qi Zhou ; Enhong Chen ; Nicholas Jing Yuan ; Xing Xie

【Abstract】: Matrix completion is the task of recovering a data matrix from a sample of entries, and has received significant attention in theory and practice. Normally, matrix completion considers a single matrix, which can be a noisy image or a rating matrix in recommendation. In practice however, data is often obtained from multiple domains rather than a single domain. For example, in recommendation, multiple matrices may exist as user x movie and user x book, while correlations among the multiple domains can be reasonably exploited to improve the quality of matrix completion. In this paper, we consider the problem of aligned matrix completion, where multiple matrices are recovered that correspond to different representations of the same group of objects. In the proposed model, we maintain consistency of multiple domains with a shared latent structure, while allowing independent patterns for each separate domain. In addition, we impose the low-rank structure of a matrix with a novel regularizer which provides better approximation than the standard nuclear norm relaxation.

【Keywords】: Algorithm design and analysis; Convergence; Correlation; Minimization; Optimization; Electronic mail; Noise measurement

55. Heterogeneous Representation Learning with Structured Sparsity Regularization.

Paper Link】 【Pages】:539-548

【Authors】: Pei Yang ; Jingrui He

【Abstract】: Motivated by real applications, heterogeneous learning has emerged as an important research area, which aims to model the co-existence of multiple types of heterogeneity. In this paper, we propose a HEterogeneous REpresentation learning model with structured Sparsity regularization (HERES) to learn from multiple types of heterogeneity. HERES aims to leverage two kinds of information to build a robust learning system. One is the rich correlations among heterogeneous data such as task relatedness, view consistency, and label correlation. The other is the prior knowledge of the data in the form of, e.g., the soft-clustering of the tasks. HERES is a generic framework for heterogeneous learning, which integrates multi-task, multi-view, and multi-label learning into a principled framework based on representation learning. The objective of HERES is to minimize the reconstruction loss of using the factor matrices to recover the input matrix for heterogeneous data, regularized by the structured sparsity constraint. The resulting optimization problem is challenging due to the non-smoothness and non-separability of structured sparsity. We develop an iterative updating method to solve the problem. Furthermore, we prove that the reformulation of structured sparsity is separable, which leads to a family of efficient and scalable algorithms for solving structured sparsity penalized problems. The experimental results in comparison with state-of-the-art methods demonstrate the effectiveness of the proposed approach.

【Keywords】: Correlation; Matrix decomposition; Encoding; Data models; Robustness; Learning systems; Optimization

56. POI Recommendation: A Temporal Matching between POI Popularity and User Regularity.

Paper Link】 【Pages】:549-558

【Authors】: Zijun Yao ; Yanjie Fu ; Bin Liu ; Yanchi Liu ; Hui Xiong

【Abstract】: Point of interest (POI) recommendation, which provides personalized recommendation of places to mobile users, is an important task in location-based social networks (LBSNs). However, quite different from traditional interest-oriented merchandise recommendation, POI recommendation is more complex due to the timing effects: we need to examine whether the POI fits a user's availability. While there are some prior studies which included the temporal effect into POI recommendations, they overlooked the compatibility between time-varying popularity of POIs and regular availability of users, which we believe has a non-negligible impact on user decision-making. To this end, in this paper, we present a novel method which incorporates the degree of temporal matching between users and POIs into personalized POI recommendations. Specifically, we first profile the temporal popularity of POIs to show when a POI is popular for visit by mining the spatio-temporal human mobility and POI category data. Secondly, we propose latent user regularities to characterize when a user is regularly available for exploring POIs, which is learned with a user-POI temporal matching function. Finally, results of extensive experiments with real-world POI check-in and human mobility data demonstrate that our proposed user-POI temporal matching method delivers substantial advantages over baseline models for POI recommendation tasks.

【Keywords】: Public transportation; Pattern matching; Data mining; History; Urban areas; Mobile computing; Mobile communication

57. College Student Scholarships and Subsidies Granting: A Multi-modal Multi-label Approach.

Paper Link】 【Pages】:559-568

【Authors】: Han-Jia Ye ; De-Chuan Zhan ; Xiaolin Li ; Zhen-Chuan Huang ; Yuan Jiang

【Abstract】: Scholarships and financial aids in modern universities are the basic administrative plans to ensure and promote the completion of academic training and studies for students. Traditional grants allocation procedures are based on manual determination, which costs lots of human resources. In this paper, we investigate an assistance model for helping improve the scheme of granting. We first collect students information from multi-modal channels, including their behaviors of campus consumption, internet usage, daily trajectory together with their enrollment information. The approval status and amount of funds granted are converted as labels. We propose the College Student Scholarships and Subsidies Granting (CS3G) approach to address the concrete problem. CS3G approach overcomes 3 obstacles, i.e., complicated multi-label influences, private modal information protection and difficulties in label collection. In detail, based on the facts that scholarships mainly depend on academic achievements, subsidies granting is generally based on students financial hardships as well as credits, and there are implicit influences among scholarships and subsidies, the CS3G approach handles types of interactions between multiple labels, it is notable that data from different modalities are collected by different divisions of a university, privacy protection is considered in CS3G, i.e., no interaction between features from different modalities in the model training phase. Besides, due to the confidentiality of the concrete types/amounts of granting, only a portion of labels is collected in this application, CS3G is trained in a semi-supervised style. Empirical investigations show good generalization ability of CS3G on benchmark datasets, and a real assessment of a university also validates the power of our approach for tackling this type of problem well.

【Keywords】: Scholarships; Resource management; Feature extraction; Training; Internet; Trajectory; Concrete

58. Generalized Independent Subspace Clustering.

Paper Link】 【Pages】:569-578

【Authors】: Wei Ye ; Samuel Maurus ; Nina Hubig ; Claudia Plant

【Abstract】: Data can encapsulate different object groupings in subspaces of arbitrary dimension and orientation. Finding such subspaces and the groupings within them is the goal of generalized subspace clustering. In this work we present a generalized subspace clustering technique capable of finding multiple non-redundant clusterings in arbitrarily-oriented subspaces. We use Independent Subspace Analysis (ISA) to find the subspace collection that minimizes the statistical dependency (redundancy) between clusterings. We then cluster in the arbitrarily-oriented subspaces identified by ISA. Our algorithm ISAAC (Independent Subspace Analysis and Clustering) uses the Minimum Description Length principle to automatically choose parameters that are otherwise difficult to set. We comprehensively demonstrate the effectiveness of our approach on synthetic and real-world data.

【Keywords】: Random variables; Clustering algorithms; Principal component analysis; Probability density function; Kernel; Data models; Redundancy

59. Matrix Profile III: The Matrix Profile Allows Visualization of Salient Subsequences in Massive Time Series.

Paper Link】 【Pages】:579-588

【Authors】: Chin-Chia Michael Yeh ; Helga Van Herle ; Eamonn J. Keogh

【Abstract】: Multidimensional Scaling (MDS) is one of the most versatile tools used for exploratory data mining. It allows a first glimpse of possible structure in the data, which can inform the choice of analyses used. Its uses are multiple. It can give the user an idea as to the cluster ability or linear separability of the data. It can help spot outliers, or can hint at the intrinsic dimensionality of the data. Moreover, it can sometimes reveal unexpected latent dimensions in the data. With all these uses, MDS is increasingly used in areas as diverse as marketing, medicine, genetics, music and linguistics. One of the strengths of MDS is that it is essentially agnostic to data type, as we can use any distance measure to create the distance matrix, which is the only required input to the MDS algorithm. In spite of this generality, we make the following claim. MDS is not (well) defined for an increasingly important data type, time series subsequences. In this work we explain why this is the case, and we propose a scalable solution. We demonstrate the utility of our ideas on several diverse real-world datasets. At the core of our approach is a novel Minimum Description Length (MDL) subsequence extraction algorithm. Beyond MDS visualization, this subsequence extraction subroutine may be a useful tool in its own right.

【Keywords】: Time series analysis; Heart rate variability; Clustering algorithms; Data mining; Data visualization; Algorithm design and analysis; Heart beat

60. Random Walk with Restart over Dynamic Graphs.

Paper Link】 【Pages】:589-598

【Authors】: Weiren Yu ; Julie A. McCann

【Abstract】: Random Walk with Restart (RWR) is an appealing measure of proximity between nodes based on graph structures. Since real graphs are often large and subject to minor changes, it is prohibitively expensive to recompute proximities from scratch. Previous methods use LU decomposition and degree reordering heuristics, entailing O(|ν|3) time and O(|ν|2) memory to compute all (|ν|2) pairs of node proximities in a static graph. In this paper, a dynamic scheme to assess RWR proximities is proposed: (1) For unit update, we characterize the changes to all-pairs proximities as the outer product of two vectors. We notice that the multiplication of an RWR matrix and its transition matrix, unlike traditional matrix multiplications, is commutative. This can greatly reduce the computation of all-pairs proximities from O(|ν|3) to O(|Δ|) time for each update without loss of accuracy, where |Δ| (≪|V|2) is the number of affected proximities. (2) To avoid O(|V|2) memory for all pairs of outputs, we also devise efficient partitioning techniques for our dynamic model, which can compute all pairs of proximities segment-wisely within O(I|V|) memory and O([|V|/l]) I/O costs, where 1 ≤ I ≤ |V| is a user-controlled trade-off between memory and I/O costs. (3) For bulk updates, we also devise aggregation and hashing methods, which can discard many unnecessary updates further and handle chunks of unit updates simultaneously. Our experimental results on various datasets demonstrate that our methods can be 1-2 orders of magnitude faster than other competitors while securing scalability and exactness.

【Keywords】: Time measurement; Loss measurement; Symmetric matrices; Noise measurement; Computational modeling; Matrix decomposition; Matrix converters

61. Communities in Preference Networks: Refined Axioms and Beyond.

Paper Link】 【Pages】:599-608

【Authors】: Gang Zeng ; Yuyi Wang ; Juhua Pu ; Xingwu Liu ; Xiaoming Sun ; Jialin Zhang

【Abstract】: Borgs et al. [2016] investigated essential requirements for communities in preference networks. They defined six axioms on community functions, i.e., community detection rules. Though having elegant properties, the practicality of this axiomsystem is compromised by the intractability of checking twocritical axioms, so no nontrivial consistent community functionwas reported in [Borgs et al., 2016]. By adapting the two axioms in a natural way, we propose two new axioms that are efficiently-checkable. We show that most of the desirable properties of the original axiom system are preserved. More importantly, the new axioms provide a general approach to constructing consistent community functions. We further find a natural consistent community function that is also enumerable and samplable, answering an open problem in the literature.

【Keywords】: Lattices; Clustering algorithms; Laboratories; Software; Social network services; Context; Conferences

62. Homophily, Structure, and Content Augmented Network Representation Learning.

Paper Link】 【Pages】:609-618

【Authors】: Daokun Zhang ; Jie Yin ; Xingquan Zhu ; Chengqi Zhang

【Abstract】: Advances in social networking and communication technologies have witnessed an increasing number of applications where data is not only characterized by rich content information, but also connected with complex relationships representing social roles and dependencies between individuals. To enable knowledge discovery from such networked data, network representation learning (NRL) aims to learn vector representations for network nodes, such that off-the-shelf machine learning algorithms can be directly applied. To date, existing NRL methods either primarily focus on network structure or simply combine node content and topology for learning. We argue that in information networks, information is mainly originated from three sources: (1) homophily, (2) topology structure, and (3) node content. Homophily states social phenomenon where individuals sharing similar attributes (content) tend to be directly connected through local relational ties, while topology structure emphasizes more on global connections. To ensure effective network representation learning, we propose to augment three information sources into one learning objective function, so that the interplay roles between three parties are enforced by requiring the learned network representations (1) being consistent with node content and topology structure, and also (2) following the social homophily constraints in the learned space. Experiments on multi-class node classification demonstrate that the representations learned by the proposed method consistently outperform state-of-the-art NRL methods, especially for very sparsely labeled networks.

【Keywords】: Context; Network topology; Topology; Electronic mail; Social network services; Context modeling; Australia

63. Fixing the Convergence Problems in Parallel Asynchronous Dual Coordinate Descent.

Paper Link】 【Pages】:619-628

【Authors】: Huan Zhang ; Cho-Jui Hsieh

【Abstract】: Solving L2-regularized empirical risk minimization (e.g., linear SVMs and logistic regression) using multiple cores has become an important research topic. Among all the existing algorithms, Parallel ASynchronous Stochastic dual Co-Ordinate DEscent (PASSCoDe) demonstrates superior performance compared with other methods. Although PASSCoDe is fast when it converges, the algorithm has been observed to diverge on several cases especially when a relatively large number of threads are used. This is mainly due to the delayed parameter access problem - the parameters used for the current update may be delayed and are not the latest ones. In theory, the algorithm converges only when the delay is small enough, but in practice the delay depends on the underlying parallel computing environment and cannot be guaranteed. In this work, we propose a simple and computational efficient way to fix the convergence problem of PASSCoDe. Instead of allowing all worker threads to conduct asynchronous updates wildly, we add periodic check points to the procedure, where all workers need to stop and refine the current solution at each check point. The resulting "semi-asynchronous" algorithm is guaranteed to converge for any problem even when PASSCoDe diverges, and for the cases where PASSCoDe converges they have almost identical speed.

【Keywords】: Convergence; Instruction sets; Delays; Logistics; Risk management; Message systems; Support vector machines

64. HogWild++: A New Mechanism for Decentralized Asynchronous Stochastic Gradient Descent.

Paper Link】 【Pages】:629-638

【Authors】: Huan Zhang ; Cho-Jui Hsieh ; Venkatesh Akella

【Abstract】: Stochastic Gradient Descent (SGD) is a popular technique for solving large-scale machine learning problems. In order to parallelize SGD on multi-core machines, asynchronous SGD (Hogwild!) has been proposed, where each core updates a global model vector stored in a shared memory simultaneously, without using explicit locks. We show that the scalability of Hogwild! on modern multi-socket CPUs is severely limited, especially on NUMA (Non-Uniform Memory Access) system, due to the excessive cache invalidation requests and false sharing. In this paper we propose a novel decentralized asynchronous SGD algorithm called HogWild++ that overcomes these drawbacks and shows almost linear speedup on multi-socket NUMA systems. The main idea in HogWild++ is to replace the global model vector with a set of local model vectors that are shared by a cluster (a set of cores), keep them synchronized through a decentralized token-based protocol that minimizes remote memory access conflicts and ensures convergence. We present the design and experimental evaluation of HogWild++ on a variety of datasets and show that it outperforms state-of-the-art parallel SGD implementations in terms of efficiency and scalability.

【Keywords】: Scalability; Algorithm design and analysis; Sockets; Multicore processing; Instruction sets; Machine learning algorithms; Computational modeling

65. Predicting COPD Failure by Modeling Hazard in Longitudinal Clinical Data.

Paper Link】 【Pages】:639-648

【Authors】: Jianfei Zhang ; Shengrui Wang ; Josiane Courteau ; Lifei Chen ; Aurélien Bach ; Alain Vanasse

【Abstract】: Chronic obstructive pulmonary disease (COPD) accounts for the highest rate of hospital readmissions and is the third leading cause of death in Canada, the United States and worldwide. Predicting COPD failure provides a prognostic warning of death or readmission, and is crucial to early intervention and decision-making. The aim of this study is to perform COPD failure prediction on longitudinal data. To address the inappropriate estimation of Cox hazard in current approaches, we propose a new representation of hazard to capture the relationship between survival probability and time-varying risk factors in a concise but effective way. To optimize model parameters, we design and maximize a new joint likelihood that comprises two components used to estimate survival status separately for failure and censored patients. A regularized optimization is performed on the joint likelihood to prevent overfitting arising from model learning. Our approach is applied to a real-life COPD data set and outperforms the current state-of-the-art prediction models in terms of the survival AUC, concordance index and Birer score metrics, this reveals that the great promise of our approach for clinical prediction.

【Keywords】: Hazards; Predictive models; Analytical models; Data models; Mathematical model; Hospitals; Indexes

66. Reliable Gender Prediction Based on Users' Video Viewing Behavior.

Paper Link】 【Pages】:649-658

【Authors】: Jie Zhang ; Kuang Du ; Ruihua Cheng ; Zhi Wei ; Chenguang Qin ; Huaxin You ; Sha Hu

【Abstract】: With the growth of the digital advertising market, it has become more important than ever to target the desired audiences. Among various demographic traits, gender information plays a key role in precisely targeting the potential consumers in online advertising and ecommerce. However, such personal information is generally unavailable to digital media sellers. In this paper, we propose a novel task-specific multi-task learning algorithm to predict users' gender information from their video viewing behaviors. To detect as many desired users as possible, while controlling the Type I error rate at a user-specified level, we further propose Bayes testing and decision procedures to efficiently identify male and female users, respectively. Comprehensive experiments have justified the effectiveness and reliability of our framework.

【Keywords】: Streaming media; Advertising; Media; TV; Data models; Reliability; Motion pictures

67. Probabilistic-Mismatch Anomaly Detection: Do One's Medications Match with the Diagnoses.

Paper Link】 【Pages】:659-668

【Authors】: Lingxiao Zhang ; Xiang Li ; Haifeng Liu ; Jing Mei ; Gang Hu ; Junfeng Zhao ; Yanzhen Zou ; Bing Xie ; Guotong Xie

【Abstract】: Anomaly detection in healthcare data like patient records is no trivial task. The anomalies in these datasets are often caused by mismatches between different types of feature, e.g., medications that do not match with the diagnoses. Existing anomaly detection methods do not perform well when detecting "mismatches" between multiple types of feature, especially when the feature space is high-dimensional and sparse. This paper introduces a novel anomaly detection paradigm: Probabilistic-Mismatch Anomaly Detection (PMAD), which detects mismatches between features by modeling a normal instance with a common latent probability distribution that governs the generation of all types of feature. Under this paradigm, the target of anomaly detection is to find instances with dissimilar latent distributions. We further propose Topical PMAD based on an extended Latent Dirichlet Allocation (LDA) model, which is able to capture the latent relationship between features in a high-dimensional space. Experiments on both synthetic data and real-world patient records show that Topical PMAD can effectively detect anomalies with mismatched features, and is highly robust against high-dimensional data as well as inaccurate model selection. The real-world anomalies detected on a patient record dataset show a promising application prospect.

【Keywords】: Feature extraction; Solid modeling; Context; Medical diagnostic imaging; Medical services; Data models; Context modeling

68. Inferring Latent Network from Cascade Data for Dynamic Social Recommendation.

Paper Link】 【Pages】:669-678

【Authors】: Qin Zhang ; Jia Wu ; Peng Zhang ; Guodong Long ; Ivor W. Tsang ; Chengqi Zhang

【Abstract】: Social recommendation explores social information to improve the quality of a recommender system. It can be further divided into explicit and implicit social network recommendation. The former assumes the existence of explicit social connections between users in addition to the rating data. The latter one assumes the availability of only the ratings but not the social connections between users since the explicit social information data may not necessarily be available and usually are binary decision values (e.g., whether two people are friends), while the strength of their relationships is missing. Most of the works in this field use only rating data to infer the latent social networks. They ignore the dynamic nature of users that the preferences of users drift over time distinctly. To this end, we propose a new Implicit Dynamic Social Recommendation (IDSR) model, which infers latent social network from cascade data. It can sufficiently mine the information contained in time by mining the cascade data and identify the dynamic changes in the users in time by using the latest updated social network to make recommendations. Experiments and comparisons on three real-world datasets show that the proposed model outperforms the state-of-the-art solutions in both explicit and implicit scenarios.

【Keywords】: Social network services; Data models; Recommender systems; Predictive models; Data mining; Probabilistic logic; Optimization

69. Group Preference Aggregation: A Nash Equilibrium Approach.

Paper Link】 【Pages】:679-688

【Authors】: Hongke Zhao ; Qi Liu ; Yong Ge ; Ruoyan Kong ; Enhong Chen

【Abstract】: Group-oriented services such as group recommendations aim to provide services for a group of users. For these applications, how to aggregate the preferences of different group members is the toughest yet most important problem. Inspired by game theory, in this paper, we propose to explore the idea of Nash equilibrium to simulate the selections of members in a group by a game process. Along this line, we first compute the preferences (group-dependent optimal selections) of each individual member in a given group scene, i.e., an equilibrium solution of this group, with the help of two pruning approaches. Then, to get the aggregated unitary preference of each group from all group members, we design a matrix factorization-based method which aggregates the preferences in latent space and estimates the final group preference in rating space. After obtaining the group preference, group-oriented services (e.g., group recommendation) can be directly provided. Finally, we construct extensive experiments on two real-world data sets from multiple aspects. The results clearly demonstrate the effectiveness of our method.

【Keywords】: Nash equilibrium; Games; Aggregates; Computer science; Prototypes; Computational modeling

70. Multi-resolution Spatial Event Forecasting in Social Media.

Paper Link】 【Pages】:689-698

【Authors】: Liang Zhao ; Feng Chen ; Chang-Tien Lu ; Naren Ramakrishnan

【Abstract】: Social media has become a significant surrogate forspatial event forecasting. The accuracy and discernibility of aspatial event forecasting model are two key concerns, whichrespectively determine how accurate and how detailed themodel's predictions could be. Existing work pays most attentionon the accuracy alone, seldom considering the accuracyand discernibility simultaneously, because this would requiresa considerably more sophisticated model while still sufferingfrom several challenges: 1) the precise formulation of thetrade-off between accuracy and discernibility, 2) the scarcityof social media data with a high spatial resolution, and 3)the characterization of spatial correlation and heterogeneity. This paper proposes a novel feature learning model thatconcurrently addresses all the above challenges by formulatingprediction tasks for different locations with different spatialresolutions, allowing the heterogeneous relationships amongthe tasks to be characterized. This characterization is thenintegrated into our new model based on multitask learning, whose parameters are optimized by our proposed algorithmbased on the Alternative Direction Method of Multipliers(ADMM). Extensive experimental evaluations on 11 datasetsfrom different domains demonstrated the effectiveness of ourproposed approach.

【Keywords】: Spatial resolution; Forecasting; Predictive models; Urban areas; Sensors; Twitter

71. To be or Not to be Friends: Exploiting Social Ties for Venture Investments.

Paper Link】 【Pages】:699-708

【Authors】: Hao Zhong ; Chuanren Liu ; Xinjiang Lu ; Hui Xiong

【Abstract】: Recent years have witnessed the boom of venture capital industry. Venture capitalists can attain great financial rewards if their invested companies exit successfully, via being acquired or going IPO (Initial Public Offering). The literature has revealed that, from both financial and managerial perspectives, decision-making process and successful rates of venture capital (VC) investments can be greatly improved if the investors well know the team members of target startups. However, much less efforts have been made on understanding the impact of prominent social ties between the members of VC firms and start-up companies on investment decisions. To this end, we propose to study such social relationship and see how this information can contribute to foreseeing investment deals. We aim at providing analytical guidance for the venture capitalists in choosing right investment targets. Specifically, we develop a Social-Adjusted Probabilistic Matrix Factorization (PMF) model to exploit members social connections information from VC firms and startups for investment recommendations. Unlike previous studies, we make use of the directed relationship between any pair of connected members from the two institutions respectively and quantify the variety of social network groups. As a result, it brings in much more flexibility, and the modeling results inherently provide meaningful managerial implications for the operators of VC firms and startups. Finally, we evaluate our model on both synthetic and real-world data. The results demonstrate that our approach outperforms the baseline algorithms with a significant margin.

【Keywords】: Investment; Venture capital; Companies; Decision making; Predictive models; Twitter; Facebook

72. Graph-Structured Sparse Optimization for Connected Subgraph Detection.

Paper Link】 【Pages】:709-718

【Authors】: Baojian Zhou ; Feng Chen

【Abstract】: Structured sparse optimization is an important and challenging problem for analyzing high-dimensional data in a variety of applications such as bioinformatics, medical imaging, social networks, and astronomy. Although a number of structured sparsity models have been explored, such as trees, groups, clusters, and paths, connected subgraphs have been rarely explored in the current literature. One of the main technical challenges is that there is no structured sparsity-inducing norm that can directly model the space of connected subgraphs, and there is no exact implementation of a projection oracle for connected subgraphs due to its NP-hardness. In this paper, we explore efficient approximate projection oracles for connected subgraphs, and propose two new efficient algorithms, namely, Graph-IHT and Graph-GHTP, to optimize a generic nonlinear objective function subject to connectivity constraint on the support of the variables. Our proposed algorithms enjoy strong guarantees analogous to several current methods for sparsity-constrained optimization, such as Projected Gradient Descent (PGD), Approximate Model Iterative Hard Thresholding (AM-IHT), and Gradient Hard Thresholding Pursuit (GHTP) with respect to convergence rate and approximation accuracy. We apply our proposed algorithms to optimize several well-known graph scan statistics in several applications of connected subgraph detection as a case study, and the experimental results demonstrate that our proposed algorithms outperform state-of-the-art methods.

【Keywords】: Approximation algorithms; Algorithm design and analysis; Optimization; Silicon; Bioinformatics; Social network services; Mathematical model

73. Bi-Level Rare Temporal Pattern Detection.

Paper Link】 【Pages】:719-728

【Authors】: Dawei Zhou ; Jingrui He ; Yu Cao ; Jae-sun Seo

【Abstract】: Nowadays, temporal data is generated at an unprecedented speed from a variety of applications, such as wearable devices, sensor networks, wireless networks and etc. In contrast to such large amount of temporal data, it is usually the case that only a small portion of them contains information of interest. For example, for the ECG signals collected by wearable devices, most of them collected from healthy people are normal, and only a small number of them collected from people with certain heart diseases are abnormal. Furthermore, even for the abnormal temporal sequences, the abnormal patterns may only be present in a few time segments and are similar among themselves, forming a rare category of temporal patterns. For example, the ECG signal collected from an individual with a certain heart disease may be normal in most time segments, and abnormal in only a few time segments, exhibiting similar patterns. What is even more challenging is that such rare temporal patterns are often non-separable from the normal ones. Existing works on outlier detection for temporal data focus on detecting either the abnormal sequences as a whole, or the abnormal time segments directly, ignoring the relationship between abnormal sequences and abnormal time segments. Moreover, the abnormal patterns are typically treated as isolated outliers instead of a rare category with self-similarity. In this paper, for the first time, we propose a bi-level (sequence-level/ segment-level) model for rare temporal pattern detection. It is based on an optimization framework that fully exploits the bi-level structure in the data, i.e., the relationship between abnormal sequences and abnormal time segments. Furthermore, it uses sequence-specific simple hidden Markov models to obtain segment-level labels, and leverages the similarity among abnormal time segments to estimate the model parameters. To solve the optimization framework, we propose the unsupervised algorithm BIRAD, and also the semi-supervised version BIRAD-K which learns from a single labeled example. Experimental results on both synthetic and real data sets demonstrate the performance of the proposed algorithms from multiple aspects, outperforming state-of-the-art techniques on both temporal outlier detection and rare category analysis.

【Keywords】: Hidden Markov models; Data models; Electrocardiography; Optimization; Time series analysis; Heart; Diseases

74. A Bayesian Nonparametric Approach to Dynamic Dyadic Data Prediction.

Paper Link】 【Pages】:729-738

【Authors】: Fengyuan Zhu ; Guangyong Chen ; Pheng-Ann Heng

【Abstract】: An important issue of using matrix factorization for recommender systems is to capture the dynamics of user preference over time for more accurate prediction. We find that considering the existence of clusters among users with respect to evolution behavior of their preference can improve performance effectively. This is especially important to commercial recommender systems, where the evolution of preference for different users is heterogeneous, and historical ratings are not enough to estimate the preference of each user individually. Based on this, we propose a novel Bayesian nonparametric method based on the Dirichlet process, to detect users sharing the same evolution behavior of their preference. For each community, we use vector autoregressive model (VAR) to capture the evolution to explore higher-order dependency on historical user preference, and incorporate this feature with a novel adaptive prior strategy. We also derive variational inference approach to infer our method. Finally, we conduct extensive empirical experiments to show the advantage of our method over state-of-the-art algorithms.

【Keywords】: Reactive power; Bayes methods; Adaptation models; Kalman filters; Motion pictures; Heuristic algorithms; Gold

75. Matrix Profile II: Exploiting a Novel Algorithm and GPUs to Break the One Hundred Million Barrier for Time Series Motifs and Joins.

Paper Link】 【Pages】:739-748

【Authors】: Yan Zhu ; Zachary Zimmerman ; Nader Shakibay Senobari ; Chin-Chia Michael Yeh ; Gareth Funning ; Abdullah Mueen ; Philip Brisk ; Eamonn J. Keogh

【Abstract】: Time series motifs have been in the literature for about fifteen years, but have only recently begun to receive significant attention in the research community. This is perhaps due to the growing realization that they implicitly offer solutions to a host of time series problems, including rule discovery, anomaly detection, density estimation, semantic segmentation, etc. Recent work has improved the scalability to the point where exact motifs can be computed on datasets with up to a million data points in tenable time. However, in some domains, for example seismology, there is an insatiable need to address even larger datasets. In this work we show that a combination of a novel algorithm and a high-performance GPU allows us to significantly improve the scalability of motif discovery. We demonstrate the scalability of our ideas by finding the full set of exact motifs on a dataset with one hundred million subsequences, by far the largest dataset ever mined for time series motifs. Furthermore, we demonstrate that our algorithm can produce actionable insights in seismology and other domains.

【Keywords】: Time series analysis; Earthquakes; Seismology; Scalability; Data mining; Graphics processing units; Indexes

76. Measuring Patient Similarities via a Deep Architecture with Medical Concept Embedding.

Paper Link】 【Pages】:749-758

【Authors】: Zihao Zhu ; Changchang Yin ; Buyue Qian ; Yu Cheng ; Jishang Wei ; Fei Wang

【Abstract】: Evaluating the clinical similarities between pairwise patients is a fundamental problem in healthcare informatics. Aproper patient similarity measure enables various downstream applications, such as cohort study and treatment comparative effectiveness research. One major carrier for conducting patient similarity research is the Electronic Health Records(EHRs), which are usually heterogeneous, longitudinal, and sparse. Though existing studies on learning patient similarity from EHRs have shown being useful in solving real clinical problems, their applicability is limited due to the lack of medical interpretations. Moreover, most previous methods assume a vector based representation for patients, which typically requires aggregation of medical events over a certain time period. As aconsequence, the temporal information will be lost. In this paper, we propose a patient similarity evaluation framework based on temporal matching of longitudinal patient EHRs. Two efficient methods are presented, unsupervised and supervised, both of which preserve the temporal properties in EHRs. The supervised scheme takes a convolutional neural network architecture, and learns an optimal representation of patient clinical records with medical concept embedding. The empirical results on real-world clinical data demonstrate substantial improvement over the baselines.

【Keywords】: Medical diagnostic imaging; Context; Diseases; Neural networks; Natural language processing

77. ConTrack: A Scalable Method for Tracking Multiple Concepts in Large Scale Multidimensional Data.

Paper Link】 【Pages】:759-768

【Authors】: Ali Zonoozi ; Qirong Ho ; Shonali Krishnaswamy ; Gao Cong

【Abstract】: In industrial domains such as finance, telecommunications, the internet, and sensor monitoring, large volumes of unlabeled temporal data are continuously generated, such as financial transactions, sensor measurements and user activities. From a data analysis standpoint, there is significant utility to be gained by detecting and understanding changes in the data, such as physical activity recognition and content consumption behavior, or anomalies and faults in robots and sensors. However, because the data is unlabeled, it is challenging to visualize and understand in a way that produces interpretable insights, furthermore, the large volume of data imposes a scalability requirement. In the concept drift and stream mining literature, existing methods may focus on one or two, but rarely all three, of the aforementioned aspects: unlabeled data, interpretable output, scalability. Addressing this need, we propose ConTrack, an unsupervised method that tracks multiple evolving concepts in temporal data, and which is parallelized over a cluster of machines. To enhance interpretability, our method structures its output at a per-user (or actor) level, where users subscribe to one or more evolving concepts. Our method applies to problem settings (multiple concepts, unsupervised data, temporal data, user-oriented data) that cannot be handled by existing concept drift and stream mining methods, and outperforms popular unsupervised baselines from the wider Data Mining and Machine Learning literature.

【Keywords】: Data mining; Legged locomotion; Data models; Activity recognition; Algorithm design and analysis; Robot sensing systems; Scalability

Short Papers 101

78. Fractality of Massive Graphs: Scalable Analysis with Sketch-Based Box-Covering Algorithm.

Paper Link】 【Pages】:769-774

【Authors】: Takuya Akiba ; Kenko Nakamura ; Taro Takaguchi

【Abstract】: Analysis and modeling of networked objects are fundamental pieces of modern data mining. Most real-world networks, from biological to social ones, are known to have common structural properties. These properties allow us to model the growth processes of networks and to develop useful algorithms. One remarkable example is the fractality of networks, which suggests the self-similar organization of global network structure. To determine the fractality of a network, we need to solve the so-called box-covering problem, where preceding algorithms are not feasible for large-scale networks. The lack of an efficient algorithm prevents us from investigating the fractal nature of large-scale networks. To overcome this issue, we propose a new box-covering algorithm based on recently emerging sketching techniques. We theoretically show that it works in near-linear time with a guarantee of solution accuracy. In experiments, we have confirmed that the algorithm enables us to study the fractality of million-scale networks for the first time. We have observed that its outputs are sufficiently accurate and that its time and space requirements are orders of magnitude smaller than those of previous algorithms.

【Keywords】: Fractals; Algorithm design and analysis; Approximation algorithms; Greedy algorithms; Data mining; Biological system modeling

79. Cut Tree Construction from Massive Graphs.

Paper Link】 【Pages】:775-780

【Authors】: Takuya Akiba ; Yoichi Iwata ; Yosuke Sameshima ; Naoto Mizuno ; Yosuke Yano

【Abstract】: The construction of cut trees (also known as Gomory-Hu trees) for a given graph enables the minimum-cut size of the original graph to be obtained for any pair of vertices. Cut trees are a powerful back-end for graph management and mining, as they support various procedures related to the minimum cut, maximum flow, and connectivity. However, the crucial drawback with cut trees is the computational cost of their construction. In theory, a cut tree is built by applying a maximum flow algorithm for n times, where n is the number of vertices. Therefore, naive implementations of this approach result in cubic time complexity, which is obviously too slow for today's large-scale graphs. To address this issue, in the present study, we propose a new cut-tree construction algorithm tailored to real-world networks. Using a series of experiments, we demonstrate that the proposed algorithm is several orders of magnitude faster than previous algorithms and it can construct cut trees for billion-scale graphs.

【Keywords】: Time complexity; Informatics; Computational efficiency; Heuristic algorithms; Conferences; Data mining; Graph theory

80. Learning from Your Network of Friends: A Trajectory Representation Learning Model Based on Online Social Ties.

Paper Link】 【Pages】:781-786

【Authors】: Basma Alharbi ; Xiangliang Zhang

【Abstract】: Location-Based Social Networks (LBSNs) capture individuals whereabouts for a large portion of the population. To utilize this data for user (location)-similarity based tasks, one must map the raw data into a low-dimensional uniform feature space. However, due to the nature of LBSNs, many users have sparse and incomplete check-ins. In this work, we propose to overcome this issue by leveraging the network of friends, when learning the new feature space. We first analyze the impact of friends on individuals's mobility, and show that individuals trajectories are correlated with thoseof their friends and friends of friends (2-hop friends) in an online setting. Based on our observation, we propose a mixed-membership model that infers global mobility patterns from users' check-ins and their network of friends, without impairing the model's complexity. Our proposed model infers global patterns and learns new representations for both usersand locations simultaneously. We evaluate the inferred patterns and compare the quality of the new user representation against baseline methods on a social link prediction problem.

【Keywords】: Trajectory; Data models; Social network services; Complexity theory; Semantics; Labeling; Sparse matrices

81. A Combinatorial Approach to Role Discovery.

Paper Link】 【Pages】:787-792

【Authors】: Albert Arockiasamy ; Aristides Gionis ; Nikolaj Tatti

【Abstract】: We provide a new formulation for the problem of role discovery in graphs. Our definition is structural: two vertices should be assigned to the same role if the roles of their neighbors, when viewed as multi-sets, are similar enough. An attractive characteristic of our approach is that it is based on optimizing a well-defined objective function, and thus, contrary to previous approaches, the role-discovery task can be studied with the tools of combinatorial optimization. We demonstrate that, when fixing the number of roles to be used, the proposed role-discovery problem is np-hard, while another (seemingly easier) version of the problem is np-hard to approximate. On the positive side, despite the recursive nature of our objective function, we can show that finding a perfect (zero-cost) role assignment with the minimum number of roles can be solved in polynomial time. We do this by connecting the zero-cost role assignment with the notion of equitable partition. For the more practical version of the problem with fixed number of roles we present two natural heuristic methods, and discuss how to make them scalable in large graphs.

【Keywords】: Linear programming; Clustering algorithms; Standards; Heuristic algorithms; Cost function; Vehicle dynamics

82. DESQ: Frequent Sequence Mining with Subsequence Constraints.

Paper Link】 【Pages】:793-798

【Authors】: Kaustubh Beedkar ; Rainer Gemulla

【Abstract】: Frequent sequence mining methods often make use of constraints to control which subsequences should be mined, e.g., length, gap, span, regular-expression, and hierarchy constraints. We show that many subsequence constraints-including and beyond those considered in the literature-can be unified in a single framework. In more detail, we propose a set of simple and intuitive "pattern expressions" to describe subsequence constraints and explore algorithms for efficiently mining frequent subsequences under such general constraints. A unified treatment allows researchers to study jointly many types of subsequence constraints (instead of each one individually) and helps to improve usability of pattern mining systems for practitioners.

【Keywords】: Databases; Data mining; Context; Computational modeling; Electronic mail; Usability; Analytical models

83. EXTRACT: Strong Examples from Weakly-Labeled Sensor Data.

Paper Link】 【Pages】:799-804

【Authors】: Davis W. Blalock ; John V. Guttag

【Abstract】: Thanks to the rise of wearable and connected devices, sensor-generated time series comprise a large and growing fraction of the world's data. Unfortunately, extracting value from this data can be challenging, since sensors report low-level signals (e.g., acceleration), not the high-level events that are typically of interest (e.g., gestures). We introduce a technique to bridge this gap by automatically extracting examples of real-world events in low-level data, given only a rough estimate of when these events have taken place. By identifying sets of features that repeat in the same temporal arrangement, we isolate examples of such diverse events as human actions, power consumption patterns, and spoken words with up to 96% precision and recall. Our method is fast enough to run in real time and assumes only minimal knowledge of which variables are relevant or the lengths of events. Our evaluation uses numerous publicly available datasets and over 1 million samples of manually labeled sensor data.

【Keywords】: Time series analysis; Shape; Data mining; Feature extraction; Sparse matrices; Transforms; Acceleration

84. A Theoretical Analysis of the Fuzzy K-Means Problem.

Paper Link】 【Pages】:805-810

【Authors】: Johannes Blömer ; Sascha Brauer ; Kathrin Bujna

【Abstract】: One of the most popular fuzzy clustering techniques is the fuzzy K-means algorithm (also known as fuzzy-c-means or FCM algorithm). In contrast to the K-means and K-median problem, the underlying fuzzy K-means problem has not been studied from a theoretical point of view. In particular, there are no algorithms with approximation guarantees similar to the famous K-means++ algorithm known for the fuzzy K-means problem. This work initiates the study of the fuzzy K-means problem from an algorithmic and complexity theoretic perspective. We show that optimal solutions for the fuzzy K-means problem cannot, in general, be expressed by radicals over the input points. Surprisingly, this already holds for simple inputs in one-dimensional space. Hence, one cannot expect to compute optimal solutions exactly. We give the first (1+eps)-approximation algorithms for the fuzzy K-means problem. First, we present a deterministic approximation algorithm whose runtime is polynomial in N and linear in the dimension D of the input set, given that K is constant, i.e. a polynomial time approximation scheme (PTAS) for fixed K. We achieve this result by showing that for each soft clustering there exists a hard clustering with similar properties. Second, by using techniques known from coreset constructions for the K-means problem, we develop a deterministic approximation algorithm that runs in time almost linear in N but exponential in the dimension D. We complement these results with a randomized algorithm which imposes some natural restrictions on the sought solution and whose runtime is comparable to some of the most efficient approximation algorithms for K-means, i.e. linear in the number of points and the dimension, but exponential in the number of clusters.

【Keywords】: Clustering algorithms; Approximation algorithms; Runtime; Frequency modulation; Complexity theory; Data mining; Algorithm design and analysis

85. Efficient Sampling-Based Kernel Mean Matching.

Paper Link】 【Pages】:811-816

【Authors】: Swarup Chandra ; Ahsanul Haque ; Latifur Khan ; Charu Aggarwal

【Abstract】: Many real-world applications exhibit scenarios where distributions represented by training and test data are not similar, but related by a covariate shift, i.e., having equal class conditional distribution with unequal covariate distribution. Traditional data mining techniques suffer to learn a good predictive model in the presence of covariate shift. Recent studies have proposed approaches to address this challenge by weighing training instances based on density ratio between test and training data distributions. Kernel Mean Matching (KMM) is a well known method for estimating density ratio, but has time complexity cubic in the size of training data. Therefore, KMM is not suitable in real-world applications, especially in cases where the predictive model needs to be updated periodically with large training data. We address this challenge by taking fixed-size samples from training and test data, performing independent computations on these samples, and combining the results to obtain overall density ratio estimates. Our empirical evaluation demonstrates a large gain in execution time, while also achieving competitive accuracy on numerous benchmark datasets.

【Keywords】: Training; Training data; Kernel; Time complexity; Scalability; Memory management; Data mining

86. DeBot: Twitter Bot Detection via Warped Correlation.

Paper Link】 【Pages】:817-822

【Authors】: Nikan Chavoshi ; Hossein Hamooni ; Abdullah Mueen

【Abstract】: We develop a warped correlation finder to identify correlated user accounts in social media websites such as Twitter. The key observation is that humans cannot be highly synchronous for a long duration, thus, highly synchronous user accounts are most likely bots. Existing bot detection methods are mostly supervised, which requires a large amount of labeled data to train, and do not consider cross-user features. In contrast, our bot detection system works on activity correlation without requiring labeled data. We develop a novel lag-sensitive hashing technique to cluster user accounts into correlated sets in near real-time. Our method, named DeBot, detects thousands of bots per day with a 94% precision and generates reports online everyday. In September 2016, DeBot has accumulated about 544,868 unique bots in the previous one year. We compare our detection technique with per-user techniques and with Twitter's suspension system. We observe that some bots can avoid Twitter's suspension mechanism and remain active for months, and, more alarmingly, we show that DeBot detects bots at a rate higher than the rate Twitter is suspending them.

【Keywords】: Conferences; Data mining

87. Interpretable Clustering via Discriminative Rectangle Mixture Model.

Paper Link】 【Pages】:823-828

【Authors】: Junxiang Chen ; Yale Chang ; Brian Hobbs ; Peter J. Castaldi ; Michael H. Cho ; Edwin K. Silverman ; Jennifer G. Dy

【Abstract】: Clustering is a technique that is usually applied as a tool for exploratory data analysis. Because of the exploratory nature of this task, it would be beneficial if a clustering method generates interpretable results, and allows incorporating domain knowledge. This motivates us to develop a probabilistic discriminative model that learns a rectangular decision rule for each cluster, we call Discriminative Rectangle Mixture (DReaM) model. DReaM gives interpretable clustering results, because the rectangular decision rules discovered explicitly illustrate how one cluster is defined and differs from other clusters. It also facilitates us to take advantage of existing rules because we can choose informative prior distributions for the rectangular rules. Moreover, DReaM allows that the features for generating rules do not have to be the same as the features for discovering cluster structure. We approximate the distribution for the rules discovered via variational inference. Experimental results demonstrate that DReaM gives more interpretable clustering results, and yet its performance is comparable to existing clustering methods when solving traditional clustering. Furthermore, in real applications, DReaM is able to effectively take advantage of domain knowledge, and to generate reasonable clustering results.

【Keywords】: Mathematical model; Probabilistic logic; Clustering methods; Genomics; Bioinformatics; Data models; Decision trees

88. Asymptotic Analysis of Equivalences and Core-Structures in Kronecker-Style Graph Models.

Paper Link】 【Pages】:829-834

【Authors】: Alex J. Chin ; Timothy D. Goodrich ; Michael P. O'Brien ; Felix Reidl ; Blair D. Sullivan ; Andrew van der Poel

【Abstract】: Growing interest in modeling large, complexnetworks has spurred significant research into generative graphmodels. Kronecker-style models (e.g. SKG and R-MAT) are oftenused due to their scalability and ability to mimic key propertiesof real-world networks. Although a few papers theoreticallyestablish these models' behavior for specific parameters, manyclaims used to justify their use are supported only empirically. In this work, we prove several results using asymptotic analysiswhich illustrate that empirical studies may not fully capture thetrue behavior of the models. Paramount to the widespread adoption of Kronecker-stylemodels was the introduction of a linear-time edge-samplingvariant (R-MAT), which existing literature typically treats asinterchangeable with SKG. We prove that although several R-MAT formulations are asymptotically equivalent, their behaviordiverges from that of SKG. Further, we show these resultsare observable even at relatively small graph sizes. Second, weconsider a case where asymptotic analysis reveals unexpectedbehavior within a given model.

【Keywords】: Analytical models; Computational modeling; Data models; Stochastic processes; Generators; Limiting; Sparse matrices

89. Event Grounding from Multimodal Social Network Fusion.

Paper Link】 【Pages】:835-840

【Authors】: Hyunsouk Cho ; Jinyoung Yeo ; Seung-won Hwang

【Abstract】: This paper studies the problem of extracting real world event information from social media streams. Although existing work focuses on event signals of bursty mentions extracted from a single-source of textual streams, these signals are likely to be noisy due to ambiguous occurrences of individual mentions. To extract accurate event signals, we propose a framework capable of "grounding" mentions to unique event using multiple social networks with complementary strength. We show that our framework jointly using multiple sources outperforms state-of-the-arts using publicly available datasets.

【Keywords】: Flickr; Twitter; Grounding; Lattices; Event detection; Metadata

90. Can Active Learning Experience Be Transferred?

Paper Link】 【Pages】:841-846

【Authors】: Hong-Min Chu ; Hsuan-Tien Lin

【Abstract】: Active learning is an important machine learning problem in reducing the human labeling effort. Current active learning strategies are designed from human knowledge, and are applied on each dataset in an immutable manner. In other words, experience about the usefulness of strategies cannot be updated and transferred to improve active learning on other datasets. This paper initiates a pioneering study on whether active learning experience can be transferred. We first propose a novel active learning model that linearly aggregates existing strategies. The linear weights can then be used to represent the active learning experience. We equip the model with the popular linear upper-confidence-bound (LinUCB) algorithm for contextual bandit to update the weights. Finally, we extend our model to transfer the experience across datasets with the technique of biased regularization. Empirical studies demonstrate that the learned experience not only is competitive with existing strategies on most single datasets, but also can be transferred across datasets to improve the performance on future learning tasks.

【Keywords】: Algorithm design and analysis; Labeling; Uncertainty; Context; Context modeling; Probabilistic logic; Joining processes

91. Outlier Detection from Network Data with Subnetwork Interpretation.

Paper Link】 【Pages】:847-852

【Authors】: Xuan Hong Dang ; Arlei Silva ; Ambuj K. Singh ; Ananthram Swami ; Prithwish Basu

【Abstract】: Detecting a small number of outliers from a set of data observations is always challenging. This problem is more difficult in the setting of multiple network samples, where computing the anomalous degree of a network sample is generally not sufficient. In fact, explaining why a given network is exceptional, expressed in the form of subnetwork, is also equally important. We develop a novel algorithm to address these two key problems. We treat each network sample as a potential outlier and identify subnetworks that help discriminate it from nearby samples. The algorithm is developed in the framework of network regression combined with the constraints on both network topology and L1-norm shrinkage to perform subnetwork discovery. Our method thus goes beyond subspace/subgraph discovery. We also show that the developed method converges to a global optimum. Empirical evaluation on various real-world network datasets demonstrates the advantages of our algorithm over various baseline methods.

【Keywords】: Network topology; Linear programming; Diseases; Support vector machines; Databases; Algorithm design and analysis; Topology

92. Incorporating Expert Feedback into Active Anomaly Discovery.

Paper Link】 【Pages】:853-858

【Authors】: Shubhomoy Das ; Weng-Keen Wong ; Thomas G. Dietterich ; Alan Fern ; Andrew Emmott

【Abstract】: Unsupervised anomaly detection algorithms search for outliers and then predict that these outliers are the anomalies. When deployed, however, these algorithms are often criticized for high false positive and high false negative rates. One cause of poor performance is that not all outliers are anomalies and not all anomalies are outliers. In this paper, we describe an Active Anomaly Discovery (AAD) method for incorporating expert feedback to adjust the anomaly detector so that the outliers it discovers are more in tune with the expert user's semantic understanding of the anomalies. The AAD approach is designed to operate in an interactive data exploration loop. In each iteration of this loop, our algorithm first selects a data instance to present to the expert as a potential anomaly and then the expert labels the instance as an anomaly or as a nominal data point. Our algorithm updates its internal model with the instance label and the loop continues until a budget of B queries is spent. The goal of our approach is to maximize the total number of true anomalies in the B instances presented to the expert. We show that when compared to other state-of-the-art algorithms, AAD is consistently one of the best performers.

【Keywords】: Mathematical model; Detectors; Algorithm design and analysis; Data models; Linear programming; Detection algorithms; Feature extraction

93. Spell: Streaming Parsing of System Event Logs.

Paper Link】 【Pages】:859-864

【Authors】: Min Du ; Feifei Li

【Abstract】: System event logs have been frequently used as a valuable resource in data-driven approaches to enhance system health and stability. A typical procedure in system log analytics is to first parse unstructured logs, and then apply data analysis on the resulting structured data. Previous work on parsing system event logs focused on offline, batch processing of raw log files. But increasingly, applications demand online monitoring and processing. We propose an online streaming method Spell, which utilizes a longest common subsequence based approach, to parse system event logs. We show how to dynamically extract log patterns from incoming logs and how to maintain a set of discovered message types in streaming fashion. Evaluation results on large real system logs demonstrate that even compared with the offline alternatives, Spell shows its superiority in terms of both efficiency and effectiveness.

【Keywords】: Printing; Data mining; Data analysis; Stability analysis; Monitoring; Conferences; Batch production systems

94. Modeling Time Lags in Citation Networks.

Paper Link】 【Pages】:865-870

【Authors】: Tao-Yang Fu ; Zhen Lei ; Wang-Chien Lee

【Abstract】: The extant work on network analyses has thus far paid little attention to the heterogeneity in time lags and speed of information propagation along edges. In this paper, we study this novel problem, modeling the time dimension and lags on network edges, in the context of paper and patent citation networks where the variation in the speed of knowledge flows between connected nodes is apparent. We propose to model time lags in knowledge diffusions in citation networks in one of the two ways: deterministic lags and probabilistic lags. Then, we discuss two approaches of computationally working with time lags in edges of citation networks. Experimentally, we study two different applications to demonstrate the importance of the time dimension and lags in citations: (1) HITS algorithm and (2) patent citation recommendation. We conduct experiments on millions of U. S. patent data and Web of Science (WOS) paper data. Our experiments show that incorporating time dimension and lags in edges significantly improve network modeling and analyses.

【Keywords】: Probabilistic logic; Knowledge engineering; Patents; Art; Computational modeling; Analytical models; Aging

95. Efficient and Distributed Algorithms for Large-Scale Generalized Canonical Correlations Analysis.

Paper Link】 【Pages】:871-876

【Authors】: Xiao Fu ; Kejun Huang ; Evangelos E. Papalexakis ; Hyun Ah Song ; Partha Pratim Talukdar ; Nicholas D. Sidiropoulos ; Christos Faloutsos ; Tom M. Mitchell

【Abstract】: Generalized canonical correlation analysis (GCCA) aims at extracting common structure from multiple 'views', i.e., high-dimensional matrices representing the same objects in different feature domains - an extension of classical two-view CCA. Existing (G)CCA algorithms have serious scalability issues, since they involve square root factorization of the correlation matrices of the views. The memory and computational complexity associated with this step grow as a quadratic and cubic function of the problem dimension (the number of samples / features), respectively. To circumvent such difficulties, we propose a GCCA algorithm whose memory and computational costs scale linearly in the problem dimension and the number of nonzero data elements, respectively. Consequently, the proposed algorithm can easily handle very large sparse views whose sample and feature dimensions both exceed 100,000 - while the current approaches can only handle thousands of features / samples. Our second contribution is a distributed algorithm for GCCA, which computes the canonical components of different views in parallel and thus can further reduce the runtime significantly (by ≥ 30% in experiments) if multiple cores are available. Judiciously designed synthetic and real-data experiments using a multilingual dataset are employed to showcase the effectiveness of the proposed algorithms.

【Keywords】: Sparse matrices; Correlation; Distributed algorithms; Algorithm design and analysis; Feature extraction; Complexity theory; Approximation algorithms

96. Service Usage Analysis in Mobile Messaging Apps: A Multi-label Multi-view Perspective.

Paper Link】 【Pages】:877-882

【Authors】: Yanjie Fu ; Junming Liu ; Xiaolin Li ; Xinjiang Lu ; Jingci Ming ; Chu Guan ; Hui Xiong

【Abstract】: The service usage analysis, aiming at identifying customers' messaging behaviors based on encrypted App traffic flows, has become a challenging and emergent task for service providers. Prior literature usually starts from segmenting a traffic sequence into single-usage subsequences, and then classify the subsequences into different usage types. However, they could suffer from inaccurate traffic segmentations and mixed-usage subsequences. To address this challenge, we exploit a multi-label multi-view learning strategy and develop an enhanced frame-work for in-App usage analytics. Specifically, we first devise an enhanced traffic segmentation method to reduce mixed-usage sub-sequences. Besides, we develop a multi-label multi-view logistic classification method, which comprises two alignments. The first alignment is to make use of the classification consistency between packet-length view and time-delay view of traffic subsequences and improve classification accuracy. The second alignment is to combine the classification of single-usage subsequence and the post-classification of mixed-usage subsequences into a unified multi-label logistic classification problem. Finally, we present extensive experiments with real-world datasets to demonstrate the effectiveness of our approach.

【Keywords】: Feature extraction; Mobile communication; Internet; Logistics; Data mining; Data collection; Clustering algorithms

97. A Semi-Supervised AUC Optimization Method with Generative Models.

Paper Link】 【Pages】:883-888

【Authors】: Akinori Fujino ; Naonori Ueda

【Abstract】: This paper presents a semi-supervised learning method for improving the performance of AUC-optimized classifiers by using both labeled and unlabeled samples. In actual binary classification tasks, there is often an imbalance between the numbers of positive and negative samples. For such imbalanced tasks, the area under the ROC curve (AUC) is an effective measure with which to evaluate binary classifiers. The proposed method utilizes generative models to assist the incorporation of unlabeled samples in AUC-optimized classifiers. The generative models provide prior knowledge that helps learn the distribution of unlabeled samples. To evaluate the proposed method in text classification, we employed naive Bayes models as the generative models. Our experimental results using three test collections confirmed that the proposed method provided better classifiers for imbalanced tasks than supervised AUC-optimized classifiers and semi-supervised classifiers trained to maximize the classification accuracy of labeled samples. Moreover, the proposed method improved the effect of using unlabeled samples for AUC optimization especially when we used appropriate generative models.

【Keywords】: Data models; Computational modeling; Optimization methods; Semisupervised learning; Parameter estimation; Probability distribution

98. MeGS: Partitioning Meaningful Subgraph Structures Using Minimum Description Length.

Paper Link】 【Pages】:889-894

【Authors】: Sebastian Goebl ; Annika Tonch ; Christian Böhm ; Claudia Plant

【Abstract】: How can we fully structure a graph into pieces of meaningful information? Into structures that provide us with insights and carry a meaning beyond simple clustering. How can we also exploit these patterns to compress the graph for fast transmission and easier storage? In many applications of graph analysis like network analysis or medical information extraction we are searching for special patterns. Here, it is not sufficient to extract only parts of the relevant information in a graph, but to understand the complete underlying structure. Therefore, we propose our algorithm MeGS (Partitioning Meaningful Subgraph Structures using Minimum Description Length) to fully understand how a graph is constructed. The most common primitives (clique, hub, tree, bipartite, and sparse) serve as models to split a graph into meaningful structures. Using the principle of Minimum Description Length (MDL) structure types and counts are determined by the best fitting model. These structures achieve the best compression of the adjacency matrix. As result, every node is part of exactly one structure and has an interpretable context. No unknown areas remain in the graph. The higher a model compresses its section of the graph, the stronger its match with the corresponding structural assumption. MeGS, a fast and parameter-free split-and-merge algorithm, automatically finds the optimal structures achieving the best compression. We compare to state-of-the-art algorithms to prove MeGS' ability for interpretation and compression.

【Keywords】: Vegetation; Partitioning algorithms; Entropy; Channel coding; Data mining; Information retrieval

99. Probabilistic Formulations of Regression with Mixed Guidance.

Paper Link】 【Pages】:895-900

【Authors】: Aubrey Gress ; Ian Davidson

【Abstract】: Regression problems assume every instance is annotated(labeled) with a real value, a form of annotation we call strong guidance. In order for these annotations to be accurate, they must be the result of a precise experiment or measurement. However, in some cases additional weak guidance might be given by imprecise measurements, a domain expert or even crowd sourcing. Current formulations of regression are unable to use both types of guidance. We propose a regression framework that can also incorporate weak guidance based on relative orderings, bounds, neighboring and similarity relations. Consider learning to predict ages from portrait images, these new types of guidance allow weaker forms of guidance such as stating a person is in their 20s or two people are similar in age. These types of annotations can be easier to generate than strong guidance. We introduce a probabilistic formulation for these forms of weak guidance and show that the resulting optimization problems are convex. Our experimental results show the benefits of these formulations on several data sets.

【Keywords】: Mathematical model; Logistics; Probabilistic logic; Standards; Optimization; Random variables; Maximum likelihood estimation

100. HLGPS: A Home Location Global Positioning System in Location-Based Social Networks.

Paper Link】 【Pages】:901-906

【Authors】: Yulong Gu ; Jiaxing Song ; Weidong Liu ; Lixin Zou

【Abstract】: The rapid spread of mobile internet and location-acquisition technologies have led to the increasing popularity of Location-Based Social Networks(LBSNs). Users in LBSNs can share their life by checking in at various venues at any time. In LBSNs, identifying home locations of users is significant for effective location-based services like personalized search, targeted advertisement, local recommendation and so on. In this paper, we propose a Home Location Global Positioning System called HLGPS to tackle with the home location identification problem in LBSNs. Firstly, HLGPS uses an influence model named as IME to model edges in LBSNs. Then HLGPS uses a global iteration algorithm based on IME model to position home location of users so that the joint probability of generating all the edges in LBSNs is maximum. Extensive experiments on a large real-world LBSN dataset demonstrate that HLGPS significantly outperforms state-of-the-art methods by 14.7%.

【Keywords】: Mathematical model; Social network services; Data models; Urban areas; Atmospheric modeling; Global Positioning System; Data mining

101. Large-Scale Embedding Learning in Heterogeneous Event Data.

Paper Link】 【Pages】:907-912

【Authors】: Huan Gui ; Jialu Liu ; Fangbo Tao ; Meng Jiang ; Brandon Norick ; Jiawei Han

【Abstract】: Heterogeneous events, which are defined as events connecting strongly-typed objects, are ubiquitous in the real world. We propose a HyperEdge-Based Embedding (Hebe) framework for heterogeneous event data, where a hyperedge represents the interaction among a set of involving objects in an event. The Hebe framework models the proximity among objects in an event by predicting a target object given the other participating objects in the event (hyperedge). Since each hyperedge encapsulates more information on a given event, Hebe is robust to data sparseness. In addition, Hebe is scalable when the data size spirals. Extensive experiments on large-scale real-world datasets demonstrate the efficacy and robustness of Hebe.

【Keywords】: Context; Optimization; Predictive models; Business; Robustness; Context modeling; Data models

102. Direct Mining of Subjectively Interesting Relational Patterns.

Paper Link】 【Pages】:913-918

【Authors】: Tias Guns ; Achille Aknin ; Jefrey Lijffijt ; Tijl De Bie

【Abstract】: Data is typically complex and relational. Therefore, the development of relational data mining methods is an increasingly active topic of research. Recent work has resulted in new formalisations of patterns in relational data and in a way to quantify their interestingness in a subjective manner, taking into account the data analyst's prior beliefs about the data. Yet, a scalable algorithm to find such most interesting patterns is lacking. We introduce a new algorithm based on two notions: (1) the use of Constraint Programming, which results in a notably shorter development time, faster runtimes, and more flexibility for extensions such as branch-and-bound search, and (2), the direct search for the most interesting patterns only, instead of exhaustive enumeration of patterns before ranking them. Through empirical evaluation, we find that our novel bounds yield speedups up to several orders of magnitude, especially on dense data with a simple schema. This makes it possible to mine the most subjectively-interesting relational patterns present in databases where this was previously impractical or impossible.

【Keywords】: Data mining; Motion pictures; Itemsets; Algorithm design and analysis; Programming; Relational databases

103. Semi-Supervised Multi-label Dimensionality Reduction.

Paper Link】 【Pages】:919-924

【Authors】: Baolin Guo ; Chenping Hou ; Feiping Nie ; Dongyun Yi

【Abstract】: Multi-label data with high dimensionality arise frequently in data mining and machine learning. It is not only time consuming but also computationally unreliable when we use high-dimensional data directly. Supervised dimensionality reduction approaches are based on the assumption that there are large amounts of labeled data. It is infeasible to label a large number of training samples in practice especially in multi-label learning. To address these challenges, we propose a novel algorithm, namely Semi-Supervised Multi-Label Dimensionality Reduction (SSMLDR), which can utilize the information from both labeled data and unlabeled data in an effective way. First, the proposed algorithm enlarges the multi-label information from the labeled data to the unlabeled data through a special designed label propagation method. It then learns a transformation matrix to perform dimensionality reduction by incorporating the enlarged multi-label information. Extensive experiments on a broad range of datasets validate the effectiveness of our approach against other well-established algorithms.

【Keywords】: Correlation; Algorithm design and analysis; Convergence; Eigenvalues and eigenfunctions; Training; Prediction algorithms; Data mining

104. A Novel Uncertainty Sampling Algorithm for Cost-Sensitive Multiclass Active Learning.

Paper Link】 【Pages】:925-930

【Authors】: Kuan-Hao Huang ; Hsuan-Tien Lin

【Abstract】: Active learning is a setup that allows the learning algorithm to iteratively and strategically query the labels of some instances for reducing human labeling efforts. One fundamental strategy, called uncertainty sampling, measures the uncertainty of each instance when making querying decisions. Traditional active learning algorithms focus on binary or multiclass classification, but few works have studied active learning for cost-sensitive multiclass classification (CSMCC), which allows charging different costs for different types of misclassification errors. The few works are generally based on calculating the uncertainty of each instance by probability estimation, and can suffer from the inaccuracy of the estimation. In this paper, we propose a novel active learning algorithm that relies on a different way of calculating the uncertainty. The algorithm is based on our newly-proposed cost embedding approach (CE) for CSMCC. CE embeds the cost information in the distance measure of a special hidden space with non-metric multidimensional scaling, and deals with both symmetric and asymmetric cost information by our carefully designed mirroring trick. The embedding allows the proposed algorithm, active learning with cost embedding (ALCE), to define a cost-sensitive uncertainty measure from the distance in the hidden space. Extensive experimental results demonstrate that ALCE selects more useful instances by taking the cost information into account through the embedding and is superior to existing cost-sensitive active learning algorithms.

【Keywords】: Uncertainty; Algorithm design and analysis; Measurement uncertainty; Estimation; Prediction algorithms; Symmetric matrices; Extraterrestrial measurements

105. SOAL: Second-Order Online Active Learning.

Paper Link】 【Pages】:931-936

【Authors】: Shuji Hao ; Peilin Zhao ; Jing Lu ; Steven C. H. Hoi ; Chunyan Miao ; Chi Zhang

【Abstract】: This paper investigates the problem of online active learning for training classification models from sequentially arriving data. This is more challenging than conventional online learning tasks since the learner not only needs to figure out how to effectively update the classifier but also needs to decide when is the best time to query the label of an incoming instance given limited label budget. The existing online active learning approaches are often based on first-order online learning methods which generally fall short in slow convergence rate and sub-optimal exploitation of available information when querying the labeled data. To overcome the limitations, in this paper, we present a new framework of Second-order Online Active Learning (SOAL), which fully exploits both first-order and second-order information to achieve high learning accuracy with low labeling cost. We conduct both theoretical analysis and empirical studies for evaluating the proposed SOAL algorithm extensively. The encouraging results show clear advantages of the proposed algorithm over a family of state-of-the-art online active learning algorithms.

【Keywords】: Computational modeling; Labeling; Machine learning algorithms; Training; Mathematical model; Learning systems; Algorithm design and analysis

106. Learning Compatibility Across Categories for Heterogeneous Item Recommendation.

Paper Link】 【Pages】:937-942

【Authors】: Ruining He ; Charles Packer ; Julian McAuley

【Abstract】: Identifying relationships between items is a key task of an online recommender system, in order to help users discover items that are functionally complementary or visually compatible. In domains like clothing recommendation, this task is particularly challenging since a successful system should be capable of handling a large corpus of items, a huge amount of relationships among them, as well as the high-dimensional and semantically complicated features involved. Furthermore, the human notion of "compatibility" to capture goes beyond mere similarity: For two items to be compatible-whether jeans and a t-shirt, or a laptop and a charger-they should be similar in some ways, but systematically different in others. In this paper we propose a novel method, Monomer, to learn complicated and heterogeneous relationships between items in product recommendation settings. Recently, scalable methods have been developed that address this task by learning similarity metrics on top of the content of the products involved. Here our method relaxes the metricity assumption inherent in previous work and models multiple localized notions of 'relatedness,' so as to uncover ways in which related items should be systematically similar, and systematically different. Quantitatively, we show that our system achieves state-of-the-art performance on large-scale compatibility prediction tasks, especially in cases where there is substantial heterogeneity between related items.

【Keywords】: Visualization; Extraterrestrial measurements; Training; Computational modeling; Recommender systems; Data models

107. HNP3: A Hierarchical Nonparametric Point Process for Modeling Content Diffusion over Social Media.

Paper Link】 【Pages】:943-948

【Authors】: Seyyed Abbas Hosseini ; Ali Khodadadi ; Ali Arabzadeh ; Hamid R. Rabiee

【Abstract】: This paper introduces a novel framework for modeling temporal events with complex longitudinal dependency that are generated by dependent sources. This framework takes advantage of multidimensional point processes for modeling time of events. The intensity function of the proposed process is a mixture of intensities, and its complexity grows with the complexity of temporal patterns of data. Moreover, it utilizes a hierarchical dependent nonparametric approach to model marks of events. These capabilities allow the proposed model to adapt its temporal and topical complexity according to the complexity of data, which makes it a suitable candidate for real world scenarios. An online inference algorithm is also proposed that makes the framework applicable to a vast range of applications. The framework is applied to a real world application, modeling the diffusion of contents over networks. Extensive experiments reveal the effectiveness of the proposed framework in comparison with state-of-the-art methods.

【Keywords】: Adaptation models; Data models; Complexity theory; Social network services; Inference algorithms; History; Computational modeling

108. Improved and Scalable Bradley-Terry Model for Collaborative Ranking.

Paper Link】 【Pages】:949-954

【Authors】: Jun Hu ; Ping Li

【Abstract】: In collaborative ranking, the Bradley-Terry (BT) model is widely used for modeling pairwise user preferences. However, when this model is combined with matrix factorization on sparsely observed ratings, a challenging identifiability issue arises since the optimization will involve non-convex constraints. Besides, in some situations, fitting the Bradley-Terry model yields a numerical challenge as it may include an objective function that is unbounded from below. In this paper, we will discuss and develop a simple strategy to resolve these issues. Specifically, we propose an Improved-BT model by adding a penalty term, and we develop two parallel algorithms to make Improved-BT model scalable. Through extensive experiments on benchmark datasets, we show that our proposed method outperforms many considered state-of-the-art collaborative ranking approaches in terms of both ranking performance and time efficiency.

【Keywords】: Numerical models; Collaboration; Linear programming; Sparse matrices; Computational modeling; Optimization; Predictive models

109. Sparse Gaussian Markov Random Field Mixtures for Anomaly Detection.

Paper Link】 【Pages】:955-960

【Authors】: Tsuyoshi Idé ; Ankush Khandelwal ; Jayant Kalagnanam

【Abstract】: We propose a new approach to anomaly detection from multivariate noisy sensor data. We address two major challenges: To provide variable-wise diagnostic information and to automatically handle multiple operational modes. Our task is a practical extension of traditional outlier detection, which is to compute a single scalar for each sample. To consistently define the variable-wise anomaly score, we leverage a predictive conditional distribution. We then introduce a mixture of Gaussian Markov random field and its Bayesian inference, resulting in a sparse mixture of sparse graphical models. Our anomaly detection method is capable of automatically handling multiple operational modes while removing unwanted nuisance variables. We demonstrate the utility of our approach using real equipment data from the oil industry.

【Keywords】: Mathematical model; Data mining; Gaussian distribution; Markov random fields; Noise measurement; Bayes methods

110. A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting.

Paper Link】 【Pages】:961-966

【Authors】: Muhammad Imran ; Sanjay Chawla ; Carlos Castillo

【Abstract】: An emerging challenge in the online classification of social media data streams is to keep the categories used for classification up-to-date. In this paper, we propose an innovative framework based on an Expert-Machine-Crowd (EMC) triad to help categorize items by continuously identifying novel concepts in heterogeneous data streams often riddled with outliers. We unify constrained clustering and outlier detection by formulating a novel optimization problem: COD-Means. We design an algorithm to solve the COD-Means problem and show that COD-Means will not only help detect novel categories but also seamlessly discover human annotation errors and improve the overall quality of the categorization process. Experiments on diverse real data sets demonstrate that our approach is both effective and efficient.

【Keywords】: Clustering algorithms; Algorithm design and analysis; Labeling; Social network services; Taxonomy; Electromagnetic compatibility; Optimization

111. Learning Deep Networks from Noisy Labels with Dropout Regularization.

Paper Link】 【Pages】:967-972

【Authors】: Ishan Jindal ; Matthew S. Nokleby ; Xuewen Chen

【Abstract】: Large datasets often have unreliable labels-such as those obtained from Amazon's Mechanical Turk or social media platforms-and classifiers trained on mislabeled datasets often exhibit poor performance. We present a simple, effective technique for accounting for label noise when training deep neural networks. We augment a standard deep network with a softmax layer that models the label noise statistics. Then, we train the deep network and noise model jointly via end-to-end stochastic gradient descent on the (perhaps mislabeled) dataset. The augmented model is underdetermined, so in order to encourage the learning of a non-trivial noise model, we apply dropout regularization to the weights of the noise model during training. Numerical experiments on noisy versions of the CIFAR-10 and MNIST datasets show that the proposed dropout technique outperforms state-of-the-art methods.

【Keywords】: Noise measurement; Training; Standards; Neural networks; Numerical models; Stochastic processes; Computational modeling

112. Personalized Ranking in Signed Networks Using Signed Random Walk with Restart.

Paper Link】 【Pages】:973-978

【Authors】: Jinhong Jung ; Woojeong Jin ; Lee Sael ; U. Kang

【Abstract】: How can we rank users in signed social networks? Relationships between nodes in a signed network are represented as positive (trust) or negative (distrust) edges. Many social networks have adopted signed networks to express trust between users. Consequently, ranking friends or enemies in signed networks has received much attention from the data mining community. The ranking problem, however, is challenging because it is difficult to interpret negative edges. Traditional random walk based methods such as PageRank and Random Walk with Restart cannot provide effective rankings in signed networks since they assume only positive edges. Although several methods have been proposed by modifying traditional ranking models, they also fail to account for proper rankings due to the lack of ability to consider complex edge relations. In this paper, we propose Signed Random Walk with Restart, a novel model for personalized ranking in signed networks. We introduce a signed random surfer so that she considers negative edges by changing her sign for walking. Our model provides proper rankings reflecting signed edges based on the signed surfer. Through extensive experiments, we demonstrate that SRWR achieves the best accuracy (up to 87%) for sign prediction, and predicts trolls 4× more accurately than other ranking models.

【Keywords】: Computational modeling; Iterative methods; Nickel; Data mining; Social network services; Mathematical model; Computer science

113. ExploreKit: Automatic Feature Generation and Selection.

Paper Link】 【Pages】:979-984

【Authors】: Gilad Katz ; Eui Chul Richard Shin ; Dawn Song

【Abstract】: Feature generation is one of the challenging aspects of machine learning. We present ExploreKit, a framework for automated feature generation. ExploreKit generates a large set of candidate features by combining information in the original features, with the aim of maximizing predictive performance according to user-selected criteria. To overcome the exponential growth of the feature space, ExploreKit uses a novel machine learning-based feature selection approach to predict the usefulness of new candidate features. This approach enables efficient identification of the new features and produces superior results compared to existing feature selection solutions. We demonstrate the effectiveness and robustness of our approach by conducting an extensive evaluation on 25 datasets and 3 different classification algorithms. We show that ExploreKit can achieve classification-error reduction of 20% overall. Our codeis available at https://github.com/giladkatz/ExploreKit.

【Keywords】: Machine learning algorithms; Machine learning; Prediction algorithms; Pregnancy; Diabetes; Space exploration; Analytical models

114. Steering Social Media Promotions with Effective Strategies.

Paper Link】 【Pages】:985-990

【Authors】: Kun Kuang ; Meng Jiang ; Peng Cui ; Shiqiang Yang

【Abstract】: On social media platforms, companies, organizations and individuals are using the function of sharing or retweeting information to promote their products, policies, and ideas. While a growing body of research has focused on identifying the promoters from millions of users, the promoters themselves are seeking to know what strategies can improve promotional effectiveness, which is rarely studied in literature. In this work, we study a new problem of promotional strategy effect estimation which is challenging in identifying and quantifying promotional strategies, as well as estimating effectiveness of promotional strategies with selection bias in observational data. Here we study a series of strategies on both context and content levels. To alleviate the selection bias issue, we propose a method based on Propensity Score Matching (PSM) to evaluate the effect of each promotional strategy. Our data study provides three interpretable and insightful ideas on steering social media promotions, including (1) three significant and stable strategies, (2) a critical trade-off, and (3) different concerns for promoters of different popularity. These results provided comprehensive suggestions to the practitioners to steer social media promotions with effective strategies.

【Keywords】: Estimation; Context; Twitter; Companies; Data mining; Computer science

115. Mining Statistically Significant Attribute Associations in Attributed Graphs.

Paper Link】 【Pages】:991-996

【Authors】: Jihwan Lee ; Keehwan Park ; Sunil Prabhakar

【Abstract】: Graphs are widely used to represent many different kinds of real world data such as social networks, protein-protein interactions, and road networks. In many cases, each node in a graph is associated with a set of its attributes and it is critical to not only consider the link structure of a graph but also use the attribute information to achieve more meaningful results in various graph mining tasks. Most previous works dealing with attributed graphs take into account attribute relationships only between individually connected nodes. However, it should be greatly valuable to find out which sets of attributes are associated with each other and whether or not they are statistically significant over an entire graph. Mining such significant associations, we can uncover novel relationships among the sets of attributes in the graph. We propose an algorithm that can find those attribute associations efficiently and effectively, and show experimental results that confirm the high efficacy of the proposed algorithm.

【Keywords】: Probability; Data mining; Computer science; Electronic mail; Social network services; Conferences; Proteins

116. Time-Aware User Identification with Topic Models.

Paper Link】 【Pages】:997-1002

【Authors】: Clément Lesaege ; François Schnitzler ; Anne Lambert ; Jean-Ronan Vigouroux

【Abstract】: Accounts are often shared by multiple users, each of them having different item consumption and temporal habits. Identifying of the active user can lead to improvements in a variety of services by switching from account personalized services to user personalized services. To do so, we develop a topic model extending the Latent Dirichlet Allocation using a hidden variable representing the active user and assuming consumption times to be generated by latent time topics. We create a new dataset of composite accounts from real users to test the identification capabilities of our model. We show that our model is able to learn temporal patterns from the whole set of accounts and infer the active user using both the consumption time and the consumed item.

【Keywords】: Biological system modeling; Twitter; Motion pictures; Parameter estimation; TV; History; Resource management

117. Toward Time-Evolving Feature Selection on Dynamic Networks.

Paper Link】 【Pages】:1003-1008

【Authors】: Jundong Li ; Xia Hu ; Ling Jian ; Huan Liu

【Abstract】: Recent years have witnessed the prevalence of networked data in various domains. Among them, a large number of networks are not only topologically structured but also have a rich set of features on nodes. These node features are usually of high dimensionality with noisy, irrelevant and redundant information, which may impede the performance of other learning tasks. Feature selection is useful to alleviate these critical issues. Nonetheless, a vast majority of existing feature selection algorithms are predominantly designed in a static setting. In reality, real-world networks are naturally dynamic, characterized by both topology and content changes. It is desirable to capture these changes to find relevant features tightly hinged with network structure continuously, which is of fundamental importance for many applications such as disaster relief and viral marketing. In this paper, we study a novel problem of time-evolving feature selection for dynamic networks in an unsupervised scenario. Specifically, we propose a TeFS framework by leveraging the temporal evolution property of dynamic networks to update the feature selection results incrementally. Experimental results show the superiority of TeFS over the state-of-the-art batch-mode unsupervised feature selection algorithms.

【Keywords】: Heuristic algorithms; Feature extraction; Optimization; Data mining; Noise measurement; Network topology; Linear programming

118. Concept Based Short Text Stream Classification with Topic Drifting Detection.

Paper Link】 【Pages】:1009-1014

【Authors】: Pei-Pei Li ; Lu He ; Xuegang Hu ; Yuhong Zhang ; Lei Li ; Xindong Wu

【Abstract】: Short text stream classification is a challenging and significant task due to the characteristics of short length, weak signal, high velocity and especially topic drifting in short text stream. However, this challenge has received little attention from the research community. Motivated by this, we propose a new feature extension approach for short text stream classification using a large scale, general purpose semantic network obtained from a web corpus. Our approach is built on an incremental ensemble classification model. First, in terms of the open semantic network, we introduce more semantic contexts in short texts to make up of the data sparsity. Meanwhile, we disambiguate terms by their semantics to reduce the noise impact. Second, to effectively track hidden topic drifts, we propose a concept cluster based topic drifting detection method. Finally, extensive experiments demonstrate that our approach can detect topic drifts effectively compared to several well-known concept drifting detection methods in data streams. Meanwhile, our approach can perform best in the classification of text data streams compared to several state-of-the-art short text classification approaches.

【Keywords】: Semantics; Context; Clustering algorithms; Taxonomy; Mathematical model; Europe; Electronic mail

119. Regularized Large Margin Distance Metric Learning.

Paper Link】 【Pages】:1015-1022

【Authors】: Ya Li ; Xinmei Tian ; Dacheng Tao

【Abstract】: Distance metric learning plays an important role in many applications, such as classification and clustering. In this paper, we propose a novel distance metric learning using two hinge losses in the objective function. One is the constraint of the pairs which makes the similar pairs (the same label) closer and the dissimilar (different labels) pairs separated as far as possible. The other one is the constraint of the triplets which makes the largest distance between pairs intra the class larger than the smallest distance between pairs inter the classes. Previous works only consider one of the two kinds of constraints. Additionally, different from the triplets used in previous works, we just need a small amount of such special triplets. This improves the efficiency of our proposed method. Consider the situation in which we might not have enough labeled samples, we extend the proposed distance metric learning into a semi-supervised learning framework. Experiments are conducted on several landmark datasets and the results demonstrate the effectiveness of our proposed method.

【Keywords】: Semisupervised learning; Euclidean distance; Fasteners; Principal component analysis; Linear programming

120. Mutual Reinforcement of Academic Performance Prediction and Library Book Recommendation.

Paper Link】 【Pages】:1023-1028

【Authors】: Defu Lian ; Yuyang Ye ; Wenya Zhu ; Qi Liu ; Xing Xie ; Hui Xiong

【Abstract】: The prediction of academic performance is one of the most important tasks in educational data mining, and has been widely studied in MOOCs and intelligent tutoring systems. Academic performance could be affected with factors like personality, skills, social environment, the use of library books and so on. However, it is still less investigated that how could the use of library books affect academic performance of college students and even leverage book-loan history for predicting academic performance. To this end, we propose a supervised content-aware matrix factorization for mutual reinforcement of academic performance prediction and library book recommendation. This model not only addresses the sparsity challenge by explainable dimension reduction techniques, but also promotes library book recommendation by recommending "right" books for students based on their performance levels and book meta information. Finally, we evaluate the proposed model on three years of the book-loan history and cumulative grade point average of 13,047 undergraduate students in one university. The results show that the proposed model outperforms the competing baselines on both tasks, and that academic performance is not only predictable from the book-loan history but also improves the recommendation of library books for students.

【Keywords】: Libraries; History; Hidden Markov models; Predictive models; Prediction algorithms; Linear programming; Data mining

121. Regularized Content-Aware Tensor Factorization Meets Temporal-Aware Location Recommendation.

Paper Link】 【Pages】:1029-1034

【Authors】: Defu Lian ; Zhenyu Zhang ; Yong Ge ; Fuzheng Zhang ; Nicholas Jing Yuan ; Xing Xie

【Abstract】: Although weighted tensor factorization tailored to implicit feedback has shown its superior performance in temporal-aware location recommendation, it suffers from three critical challenges. First, it doesn't distinguish the confidence of negative preference for time-dependent unvisited locations from that for fully unvisited ones. Second, discontinuity arises from time discretization, and thus an infinitely large margin may exist between different bins of time. Third, geographical constraints of neighbor locations are not taken into account. To address these challenges, we propose a regularized content-aware tensor factorization (RCTF) algorithm, which exploits three strategies to address the corresponding challenges. First, it introduces a novel interaction regularization, second, it represents each bin of time by a derived feature vector from eigen decomposition of a time-bin similarity matrix, to capture the proximity of neighbor bins of time, third, it encodes geographical information of locations by discrete spatial distributions, so that spatial proximity constraints can be satisfied by simply feeding them into location content. The proposed algorithm is then evaluated for time-aware location recommendation on two large scale location-based social network datasets. The experimental results show the superiority of the proposed algorithm to several competing time-aware recommendation baselines, and verify the significant benefit of three strategies in the proposed algorithm.

【Keywords】: Tensile stress; Matrix decomposition; Graphical models; Distribution functions; Collaboration; Symmetric matrices; Social network services

122. Whether This Participant will Attract You to This Event? Exploiting Participant Influence for Event Recommendation.

Paper Link】 【Pages】:1035-1040

【Authors】: Yi Liao ; Xinshi Lin ; Wai Lam

【Abstract】: When a user is making a decision on whether to participate an event in Event-based Social Networks (EBSN), one of the common considerations is who have agreed to join this event. The reason is that existing participants of the event affect the decision of the user, to which we refer as participant influence. However, participant influence is not well studied by previous works. In this paper, we propose an event recommendation model which considers participant influence, exploiting the influence of existing participants, on the decisions of new participants. Specifically, we investigate participant influence in relation to several commonly used contextual aspects of the event based on Poisson factorization. We have conducted extensive experiments on some datasets extracted from a real-world EBSN. The results demonstrate that the consideration of participant influence can improve event recommendation.

【Keywords】: Tensile stress; Mathematical model; Context; Sparse matrices; Social network services; Probabilistic logic; Vocabulary

123. HIVE-COTE: The Hierarchical Vote Collective of Transformation-Based Ensembles for Time Series Classification.

Paper Link】 【Pages】:1041-1046

【Authors】: Jason Lines ; Sarah Taylor ; Anthony Bagnall

【Abstract】: There have been many new algorithms proposed over the last five years for solving time series classification (TSC) problems. A recent experimental comparison of the leading TSC algorithms has demonstrated that one approach is significantly more accurate than all others over 85 datasets. That approach, the Flat Collective of Transformation-based Ensembles (Flat-COTE), achieves superior accuracy through combining predictions of 35 individual classifiers built on four representations of the data into a flat hierarchy. Outside of TSC, deep learning approaches such as convolutional neural networks (CNN) have seen a recent surge in popularity and are now state of the art in many fields. An obvious question is whether CNNs could be equally transformative in the field of TSC. To test this, we implement a common CNN structure and compare performance to Flat-COTE and a recently proposed time series-specific CNN implementation. We find that Flat-COTE is significantly more accurate than both deep learning approaches on 85 datasets. These results are impressive, but Flat-COTE is not without deficiencies. We improve the collective by adding new components and proposing a modular hierarchical structure with a probabilistic voting scheme that allows us to encapsulate the classifiers built on each transformation. We add two new modules representing dictionary and interval-based classifiers, and significantly improve upon the existing frequency domain classifiers with a novel spectral ensemble. The resulting classifier, the Hierarchical Vote Collective of Transformation-based Ensembles (HIVE-COTE) is significantly more accurate than Flat-COTE and represents a new state of the art for TSC. HIVE-COTE captures more sources of possible discriminatory features in time series and has a more modular, intuitive structure.

【Keywords】: Time series analysis; Machine learning; Machine learning algorithms; Training; Classification algorithms; Prediction algorithms; Dictionaries

124. House Price Modeling over Heterogeneous Regions with Hierarchical Spatial Functional Analysis.

Paper Link】 【Pages】:1047-1052

【Authors】: Bang Liu ; Borislav Mavrin ; Di Niu ; Linglong Kong

【Abstract】: Online real-estate information systems such as Zillow and Trulia have gained increasing popularity in recent years. One important feature offered by these systems is the online home price estimate through automated data-intensive computation based on housing information and comparative market value analysis. State-of-the-art approaches model house prices as a combination of a latent land desirability surface and a regression from house features. However, by using uniformly damping kernels, they are unable to handle irregularly shaped regions or capture land value discontinuities within the same region due to the existence of implicit sub-communities, which are common in real-world scenarios. In this paper, we explore the novel application of recent advances in spatial functional analysis to house price modeling and propose the Hierarchical Spatial Functional Model (HSFM), which decomposes house values into land desirability at both the global scale and hidden local scales as well as the feature regression component. We propose statistical learning algorithms based on finite-element spatial functional analysis and spatial constrained clustering to train our model. Extensive evaluations based on housing data in a major Canadian city show that our proposed approach can reduce the mean relative house price estimation error down to 6.60%.

【Keywords】: Computational modeling; Solid modeling; Data models; Analytical models; Biological system modeling; Kernel; Rivers

125. Context-Aware Sequential Recommendation.

Paper Link】 【Pages】:1053-1058

【Authors】: Qiang Liu ; Shu Wu ; Diyi Wang ; Zhaokang Li ; Liang Wang

【Abstract】: Since sequential information plays an important role in modeling user behaviors, various sequential recommendation methods have been proposed. Methods based on Markov assumption are widely-used, but independently combine several most recent components. Recently, Recurrent Neural Networks (RNN) based methods have been successfully applied in several sequential modeling tasks. However, for real-world applications, these methods have difficulty in modeling the contextual information, which has been proved to be very important for behavior modeling. In this paper, we propose a novel model, named Context-Aware Recurrent Neural Networks (CA-RNN). Instead of using the constant input matrix and transition matrix in conventional RNN models, CA-RNN employs adaptive context-specific input matrices and adaptive context-specific transition matrices. The adaptive context-specific input matrices capture external situations where user behaviors happen, such as time, location, weather and so on. And the adaptive context-specific transition matrices capture how lengths of time intervals between adjacent behaviors in historical sequences affect the transition of global sequential features. Experimental results show that the proposed CA-RNN model yields significant improvements over state-of-the-art sequential recommendation methods and context-aware recommendation methods on two public datasets, i.e., the Taobao dataset and the Movielens-1M dataset.

【Keywords】: Context; Context modeling; Adaptation models; Recurrent neural networks; Mathematical model; History; Business process re-engineering

126. Structure-Preserved Multi-source Domain Adaptation.

Paper Link】 【Pages】:1059-1064

【Authors】: Hongfu Liu ; Ming Shao ; Yun Fu

【Abstract】: Domain adaptation has achieved promising results in many areas, such as image classification and object recognition. Although a lot of algorithms have been proposed to solve the task with different domain distributions, it remains a challenge for multi-source unsupervised domain adaptation. In addition, most of the existing algorithms learn a classifier on the source domain and predict the labels for the target data, which indicates that only the knowledge derived from the hyperplane is transferred to the target domain and the structure information is ignored. In light of this, we propose a novel algorithm for multi-source unsupervised domain adaptation. Generally speaking, we aim to preserve the whole structure from source domains and transfer it to serve the task on the target domain. The source and target data are put together for clustering, which simultaneously explores the structures of the source and target domains. The structure-preserved information from source domain further guides the clustering process on the target domain. Extensive experiments on two widely used databases on object recognition and face identification show the substantial improvement of our proposed approach over several state-of-the-art methods. Especially, our algorithm can take use of multi-source domains and achieve robust and better performance compared with the single source domain adaptation methods.

【Keywords】: Linear programming; Clustering algorithms; Prediction algorithms; Computers; Databases; Robustness; Conferences

127. Sublinear Dual Coordinate Ascent for Regularized Loss Minimization.

Paper Link】 【Pages】:1065-1070

【Authors】: Liu Liu ; Dacheng Tao

【Abstract】: We present a sublinear version of the dual coordinate ascent method for solving a group of regularized loss minimization problems in machine learning. The proposed method seamlessly integrates sampling techniques, the dual coordinate ascent method, and a multiplicative update algorithm. The sampling techniques choose the "expected" examples, and estimate the corresponding inner products. The dual coordinate ascent method generates an updated iterative step, which outperforms the time-learning step used in the previous sublinear perceptron algorithm. The multiplicative update algorithm updates the example weighting. The proposed method is implemented with an iterative step of order O(log(n)), where n is the size of examples, and achieves a better result than other methods, with high probability. We present a theoretical analysis of the sublinear iterative in order to justify its benefits. We then apply the proposed optimization method to support vector machine and conduct experiments on three large-scale datasets. Our experimental results validate our theoretical findings.

【Keywords】: Runtime; Cost function; Minimization; Algorithm design and analysis; Machine learning algorithms; Australia

128. Efficient and Scalable Detection of Overlapping Communities in Big Networks.

Paper Link】 【Pages】:1071-1076

【Authors】: Tianshu Lyu ; Lidong Bing ; Zhao Zhang ; Yan Zhang

【Abstract】: Community detection is a hot topic for researchers in the fields including graph theory, social networks and biological networks. Generally speaking, a community refers to a group of densely linked nodes in the network. Nodes usually have more than one community label, indicating their multiple roles or functions in the network. Unfortunately, existing solutions aiming at overlapping-community-detection are not capable of scaling to large-scale networks with millions of nodes and edges. In this paper, we propose a fast overlapping-communitydetection algorithm - FOX. In the experiment on a network with 3.9 millions nodes and 20 millions edges, the detection finishes in 14 minutes and provides the most qualified results. The second fastest algorithm, however, takes ten times longer to run. As for another network with 22 millions nodes and 127 millions edges, our algorithm is the only one that can provide an overlapping community detection result and it only takes 238 minutes. Our algorithm draws lessons from potential games, a concept in game theory. We measure the closeness of a node to a community by counting the number of triangles formed by the node and two other nodes form the community. Potential games ensure that the algorithm can reach convergence. We also extend the exploitation of triangle to open-triangle, which enlarges the scale of the detected communities.

【Keywords】: Games; Approximation algorithms; Image edge detection; Game theory; Heuristic algorithms; Clustering algorithms; Social network services

129. Factorizing Complex Discrete Data "with Finesse".

Paper Link】 【Pages】:1077-1082

【Authors】: Samuel Maurus ; Claudia Plant

【Abstract】: Can we mine latent patterns from discrete, non-numeric heterogeneous data? Many modern data sets contain heterogeneous non-numerical information measured over Boolean, ordinal and ternary scales. Values for features like these are "mixable" in the sense that they have intuitive non-linear analogs to classical "addition" (e.g. logical OR for Boolean data). We present a novel, general and extensible matrix factorization framework for any such "mixable" features. The framework lets us support heterogeneous data and encourages us to deduce other interesting "mixable" features, like those which encapsulate sub-trees over an ontology. We present Finesse, an algorithm with linear run-time complexity in the size of the data. Finesse outperforms state-of-the-art techniques in the special cases in terms of effectiveness and efficiency, and yields insightful patterns from its novel application to large real-world heterogeneous data.

【Keywords】: Frequency modulation; Ontologies; Complexity theory; Vegetation; Context; Linear programming; Conferences

130. Towards Scalable Network Delay Minimization.

Paper Link】 【Pages】:1083-1088

【Authors】: Sourav Medya ; Petko Bogdanov ; Ambuj K. Singh

【Abstract】: Reduction of end-to-end network delays is an optimization task with applications in multiple domains. Low delays enable improved information flow in social networks, quick spread of ideas in collaboration networks, low travel times for vehicles on road networks and increased rate of packets in communication networks. Delay reduction can be achieved by both improving the propagation capabilities of individual nodes and adding additional edges in the network. One of the main challenges in such design problems is that the effects of local changes are not independent, and as a consequence, there is a combinatorial search space of possible improvements. Thus, minimizing the cumulative propagation delay requires novel scalable and data-driven approaches. In this paper, we consider the problem of network delay minimization via node upgrades. Although the problem is NP-hard, we show that probabilistic approximation for a restricted version can be obtained. We design scalable and high-quality techniques for the general setting based on sampling that are targeted to different models of delay distribution. Our methods scale almost linearly with the graph size and consistently outperform competitors in quality.

【Keywords】: Delays; Airports; Minimization; Complexity theory; Social network services; Approximation algorithms; Computational modeling

131. Foundations of Perturbation Robust Clustering.

Paper Link】 【Pages】:1089-1094

【Authors】: Jarrod Moore ; Margareta Ackerman

【Abstract】: Clustering is a fundamental data mining tool that aims to divide data into groups of similar items. Intuition about clustering reflects the ideal case - exact data sets endowed with flawless dissimilarity between individual instances. In practice however, these cases are in the minority, and clustering applications are typically characterized by noisy data sets with approximate pairwise dissimilarities. As such, the efficacy of clustering methods necessitates robustness to perturbations. In this paper, we address foundational questions on perturbation robustness, studying to what extent can clustering techniques exhibit this desirable characteristic. Our results also demonstrate the type of cluster structures required for robustness of popular clustering paradigms.

【Keywords】: Robustness; Clustering algorithms; Algorithm design and analysis; Hamming distance; Classification algorithms; Data mining; Additives

132. Faster Kernels for Graphs with Continuous Attributes via Hashing.

Paper Link】 【Pages】:1095-1100

【Authors】: Christopher Morris ; Nils M. Kriege ; Kristian Kersting ; Petra Mutzel

【Abstract】: While state-of-the-art kernels for graphs with discrete labels scale well to graphs with thousands of nodes, the few existing kernels for graphs with continuous attributes, unfortunately, do not scale well. To overcome this limitation, we present hash graph kernels, a general framework to derive kernels for graphs with continuous attributes from discrete ones. The idea is to iteratively turn continuous attributes into discrete labels using randomized hash functions. We illustrate hash graph kernels for the Weisfeiler-Lehman subtree kernel and for the shortest-path kernel. The resulting novel graph kernels are shown to be, both, able to handle graphs with continuous attributes and scalable to large graphs and data sets. This is supported by our theoretical analysis and demonstrated by an extensive experimental evaluation.

【Keywords】: Kernel; Bioinformatics; Approximation algorithms; Conferences; Data mining; Social network services; Image analysis

133. Optimizing the Multiclass F-Measure via Biconcave Programming.

Paper Link】 【Pages】:1101-1106

【Authors】: Harikrishna Narasimhan ; Weiwei Pan ; Purushottam Kar ; Pavlos Protopapas ; Harish G. Ramaswamy

【Abstract】: The F-measure and its variants are performance measures of choice for evaluating classification and retrieval tasks in the presence of severe class imbalance. It is thus highly desirable to be able to directly optimize these performance measures on large-scale data. Recent advances have shown that this is possible in the simple binary classification setting. However, scant progress exists in multiclass settings with a large number of classes where, in addition, class-imbalance is much more severe. The lack of progress is especially conspicuous for the macro-averaged F-measure, which is the widely preferred F-measure variant in multiclass settings due to its equal emphasis on rare classes. Known methods of optimization scale poorly for macro F-measure, often requiring run times that are exponential in the number of classes. We develop BEAM-F, the first efficient method for directly optimizing the macro F-measure in multiclass settings. The challenge here is the intractability of optimizing a sum of fractional-linear functions over the space of confusion matrices. We overcome this difficulty by formulating the problem as a biconcave maximization program and solve it using an efficient alternating maximization approach that involves a Frank-Wolfe based iterative solver. Our approach offers guaranteed convergence to a stationary point and experiments show that, for a range synthetic data sets and real-world applications, our method offers superior performance on problems exhibiting large class imbalance.

【Keywords】: Optimization; Tuning; Support vector machines; Harmonic analysis; Stochastic processes; Standards; Training

134. Budgeted Batch Bayesian Optimization.

Paper Link】 【Pages】:1107-1112

【Authors】: Vu Nguyen ; Santu Rana ; Sunil Kumar Gupta ; Cheng Li ; Svetha Venkatesh

【Abstract】: Parameter settings profoundly impact the performance of machine learning algorithms and laboratory experiments. The classical trial-error methods are exponentially expensive in large parameter spaces, and Bayesian optimization (BO) offers an elegant alternative for global optimization of black box functions. In situations where the functions can be evaluated at multiple points simultaneously, batch Bayesian optimization is used. Current batch BO approaches are restrictive in fixing the number of evaluations per batch, and this can be wasteful when the number of specified evaluations is larger than the number of real maxima in the underlying acquisition function. We present the budgeted batch Bayesian optimization (B3O) for hyper-parameter tuning and experimental design - we identify the appropriate batch size for each iteration in an elegant way. In particular, we use the infinite Gaussian mixture model (IGMM) for automatically identifying the number of peaks in the underlying acquisition functions. We solve the intractability of estimating the IGMM directly from the acquisition function by formulating the batch generalized slice sampling to efficiently draw samples from the acquisition function. We perform extensive experiments for benchmark functions and two real world applications - machine learning hyper-parameter tuning and experimental design for alloy hardening. We show empirically that the proposed B3O outperforms the existing fixed batch BO approaches in finding the optimum whilst requiring a fewer number of evaluations, thus saving cost and time.

【Keywords】: Optimization; Bayes methods; Machine learning algorithms; Tuning; Gaussian mixture model; Metals

135. One-Pass Logistic Regression for Label-Drift and Large-Scale Classification on Distributed Systems.

Paper Link】 【Pages】:1113-1118

【Authors】: Vu Nguyen ; Tu Dinh Nguyen ; Trung Le ; Svetha Venkatesh ; Dinh Q. Phung

【Abstract】: Logistic regression (LR) for classification is the workhorse in industry, where a set of predefined classes is required. The model, however, fails to work in the case where the class labels are not known in advance, a problem we term label-drift classification. Label-drift classification problem naturally occurs in many applications, especially in the context of streaming settings where the incoming data may contain samples categorized with new classes that have not been previously seen. Additionally, in the wave of big data, traditional LR methods may fail due to their expense of running time. In this paper, we introduce a novel variant of LR, namely one-pass logistic regression (OLR) to offer a principled treatment for label-drift and large-scale classifications. To handle largescale classification for big data, we further extend our OLR to a distributed setting for parallelization, termed sparkling OLR (Spark-OLR). We demonstrate the scalability of our proposed methods on large-scale datasets with more than one hundred million data points. The experimental results show that the predictive performances of our methods are comparable orbetter than those of state-of-the-art baselines whilst the executiontime is much faster at an order of magnitude. In addition, the OLR and Spark-OLR are invariant to data shuffling and have no hyperparameter to tune that significantly benefits data practitioners and overcomes the curse of big data cross-validationto select optimal hyperparameters.

【Keywords】: Logistics; Bayes methods; Estimation; Big data; Data models; Industries; Context

136. Self-Grouping Multi-network Clustering.

Paper Link】 【Pages】:1119-1124

【Authors】: Jingchao Ni ; Wei Cheng ; Wei Fan ; Xiang Zhang

【Abstract】: Joint clustering of multiple networks has been shown to be more accurate than performing clustering on individual networks separately. Many multi-view and multi-domain network clustering methods have been developed for joint multi-network clustering. These methods typically assume there is a common clustering structure shared by all networks, and different networks can provide complementary information on this underlying clustering structure. However, this assumption is too strict to hold in many emerging real-life applications, where multiple networks have diverse data distributions. More popularly, the networks in consideration belong to different underlying groups. Only networks in the same underlying group share similar clustering structures. Better clustering performance can be achieved by considering such groups differently. As a result, an ideal method should be able to automatically detect network groups so that networks in the same group share a common clustering structure. To address this problem, we propose a novel method, ComClus, to simultaneously group and cluster multiple networks. ComClus treats node clusters as features of networks and uses them to differentiate different network groups. Network grouping and clustering are coupled and mutually enhanced during the learning process. Extensive experimental evaluation on a variety of synthetic and real datasets demonstrates the effectiveness of our method.

【Keywords】: Clustering methods; Tensile stress; Measurement; Symmetric matrices; Conferences; Data mining; Electrical engineering

137. A Scalable Framework for Stylometric Analysis Query Processing.

Paper Link】 【Pages】:1125-1130

【Authors】: Sarana Nutanong ; Chenyun Yu ; Raheem Sarwar ; Peter Xu ; Dickson Chow

【Abstract】: Stylometry is the statistical analyses of variationsin the author's literary style. The technique has been used inmany linguistic analysis applications, such as, author profiling, authorship identification, and authorship verification. Over thepast two decades, authorship identification has been extensivelystudied by researchers in the area of natural language processing. However, these studies are generally limited to (i) a small number of candidate authors, and (ii) documents with similar lengths. In this paper, we propose a novel solution by modeling authorship attribution as a set similarity problem to overcome the two stated limitations. We conducted extensive experimental studies on a real dataset collected from an online book archive, Project Gutenberg. Experimental results show that in comparison to existing stylometry studies, our proposed solution can handlea larger number of documents of different lengths written by alarger pool of candidate authors with a high accuracy.

【Keywords】: Plagiarism; Feature extraction; Syntactics; Probabilistic logic; Writing; Data mining; Query processing

138. Compressing Random Forests.

Paper Link】 【Pages】:1131-1136

【Authors】: Amichai Painsky ; Saharon Rosset

【Abstract】: Ensemble methods are considered among the state-of-the-art predictive modeling approaches. Applied to modern big data, these methods often require a large number of sub-learners, where the complexity of each learner typically grows with the size of the dataset. This phenomenon results in an increasing demand for storage space, which may be very costly. This problem mostly manifests in a subscriber based environment, where a user-specific ensemble needs to be stored on a personal device with strict storage limitations (such as a cellular device). In this work we introduce a novel method for lossless compression of tree-based ensemble methods, focusing on Random Forests. Our suggested method is based on probabilistic modeling of the ensemble's trees, followed by model clustering via Bregman divergence. This allows us to find a minimal set of models that provides an accurate description of the trees, and at the same time is small enough to store and maintain. Our compression scheme demonstrates high compression rates on a variety of modern datasets. Importantly, our scheme enables predictions from the compressed format and a perfect reconstruction of the original ensemble.

【Keywords】: Vegetation; Dictionaries; Probabilistic logic; Encoding; Mathematical model; Periodic structures; Probability distribution

139. A Fast Factorization-Based Approach to Robust PCA.

Paper Link】 【Pages】:1137-1142

【Authors】: Chong Peng ; Zhao Kang ; Qiang Cheng

【Abstract】: Robust principal component analysis (RPCA) has been widely used for recovering low-rank matrices in many data mining and machine learning problems. It separates a data matrix into a low-rank part and a sparse part. The convex approach has been well studied in the literature. However, state-of-the-art algorithms for the convex approach usually have relatively high complexity due to the need of solving (partial) singular value decompositions of large matrices. A non-convex approach, AltProj, has also been proposed with lighter complexity and better scalability. Given the true rank r of the underlying low rank matrix, AltProj has a complexity of O(r2dn), where d × n is the size of data matrix. In this paper, we propose a novel factorization-based model of RPCA, which has a complexity of O(kdn), where k is an upper bound of the true rank. Our method does not need the precise value of the true rank. From extensive experiments, we observe that AltProj can work only when r is precisely known in advance, however, when the needed rank parameter r is specified to a value different from the true rank, AltProj cannot fully separate the two parts while our method succeeds. Even when both work, our method is about 4 times faster than AltProj. Our method can be used as a light-weight, scalable tool for RPCA in the absence of the precise value of the true rank.

【Keywords】: Conferences; Data mining

140. Background Check: A General Technique to Build More Reliable and Versatile Classifiers.

Paper Link】 【Pages】:1143-1148

【Authors】: Miquel Perello-Nieto ; Telmo de Menezes e Silva Filho ; Meelis Kull ; Peter A. Flach

【Abstract】: We introduce a powerful technique to make classifiers more reliable and versatile. Background Check equips classifiers with the ability to assess the difference of unlabelled test data from the training data. In particular, Background Check gives classifiers the capability to (i) perform cautious classification with a reject option, (ii) identify outliers, and (iii) better assess the confidence in their predictions. We derive the method from first principles and consider four particular relationships between background and foreground distributions. One of these assumes an affine relationship with two parameters, and we show how this bivariate parameter space naturally interpolates between the above capabilities. We demonstrate the versatility of the approach by comparing it experimentally with published special-purpose solutions for outlier detection and confident classification on 41 benchmark datasets. Results show that Background Check can match and in many cases surpass the performances of specialised approaches.

【Keywords】: Training data; Estimation; Standards; Reliability; Data models; Probability; Conferences

141. Product-Based Neural Networks for User Response Prediction.

Paper Link】 【Pages】:1149-1154

【Authors】: Yanru Qu ; Han Cai ; Kan Ren ; Weinan Zhang ; Yong Yu ; Ying Wen ; Jun Wang

【Abstract】: Predicting user responses, such as clicks and conversions, is of great importance and has found its usage inmany Web applications including recommender systems, webs earch and online advertising. The data in those applications is mostly categorical and contains multiple fields, a typical representation is to transform it into a high-dimensional sparse binary feature representation via one-hot encoding. Facing with the extreme sparsity, traditional models may limit their capacity of mining shallow patterns from the data, i.e. low-order feature combinations. Deep models like deep neural networks, on the other hand, cannot be directly applied for the high-dimensional input because of the huge feature space. In this paper, we propose a Product-based Neural Networks (PNN) with an embedding layer to learn a distributed representation of the categorical data, a product layer to capture interactive patterns between interfieldcategories, and further fully connected layers to explore high-order feature interactions. Our experimental results on two-large-scale real-world ad click datasets demonstrate that PNNs consistently outperform the state-of-the-art models on various metrics.

【Keywords】: Complexity theory; Predictive models; Advertising; Artificial neural networks; Encoding; Data models

142. Efficient Algorithms for the Three Locus Problem in Genome-Wide Association Study.

Paper Link】 【Pages】:1155-1160

【Authors】: Sanguthevar Rajasekaran ; Subrata Saha

【Abstract】: Using the recent advances in sequencing technology thousands of genomes have been sequenced. This sequence data can be fruitfully employed in diagnosis, drug design, etc. Genome-wide Association Study (GWAS) focuses on this important problem of extracting useful information from genomic data. As an example, a comparison of different genomes could throw light on causes for different diseases. Human variabilities happen due to single nucleotide polymorphisms (SNPs). Thus it might suffice to focus on these SNPs while comparing different genomes. One of the important problems in GWAS is that of identifying the correlation between genotypes (SNPs for example) and phenotypes (i.e., different characteristics such as addiction, the presence of cancer, etc.) Different approaches exist for addressing this problem. One important approach is via modeling this problem as the k-locus problem (k being any integer). The case of k = 1 has been studied widely. Some algorithms also exist for solving the case of k = 2. The real cause for a disease could be more than two SNPs. The case of k > 2 has not been studied in the literature. For the first time, in this paper we present an efficient algorithm for solving the 3-locus problem that is several orders of magnitude faster than the brute force algorithm. All the software can be obtained from: engr.uconn.edu/~rajasek/ThreeLocus.

【Keywords】: Correlation; Genomics; Hamming distance; Force; Bioinformatics; Diseases; Data mining

143. Scalable Block Scheduling for Efficient Multi-database Record Linkage.

Paper Link】 【Pages】:1161-1166

【Authors】: Thilina Ranbaduge ; Dinusha Vatsalan ; Peter Christen

【Abstract】: Record linkage (RL) is a task in data integration that aims to identify matching records that refer to the same entity from different databases. When records from more than two databases are to be linked RL is significantly challenged by the intrinsic exponential growth in the number of potential record comparisons to be conducted. We propose a scalable meta blocking protocol to be used for Multi-Database RL (MDRL) to significantly reduce the complexity of the matching (comparison and classification) phase. Our approach uses a graph structure to schedule the comparison of pairs of blocks with the aim of minimizing the number of repeated and superfluous comparisons between records. We provide an analysis of our approach and conduct an empirical study on large real-world databases.

【Keywords】: Protocols; Couplings; Distributed databases; Encoding; Data mining; Schedules

144. Sequential Ensemble Learning for Outlier Detection: A Bias-Variance Perspective.

Paper Link】 【Pages】:1167-1172

【Authors】: Shebuti Rayana ; Wen Zhong ; Leman Akoglu

【Abstract】: Ensemble methods for classification have been effectively used for decades, while for outlier detection it has only been studied recently. In this work, we design a new ensemble approach for outlier detection in multi-dimensional point data, which provides improved accuracy by reducing error through both bias and variance by considering outlier detection as a binary classification task with unobserved labels. In this paper, we propose a sequential ensemble approach called CARE that employs a two-phase aggregation of the intermediate results in each iteration to reach the final outcome. Unlike existing outlier ensembles, our ensemble incorporates both the parallel and sequential building blocks to reduce bias as well as variance by (i) successively eliminating outliers from the original dataset to build a better data model on which outlierness is estimated (sequentially), and (ii) combining the results from individual base detectors and across iterations (parallelly). Through extensive experiments on 16 real-world datasets mainly from the UCI machine learning repository [1], we show that CARE performs significantly better than or at least similar to the individual baselines as well as the existing state-of-the-art outlier ensembles.

【Keywords】: Detectors; Error analysis; Feature extraction; Bagging; Data models; Aggregates; Complexity theory

145. Learning Independent, Diverse Binary Hash Functions: Pruning and Locality.

Paper Link】 【Pages】:1173-1178

【Authors】: Ramin Raziperchikolaei ; Miguel Á. Carreira-Perpiñán

【Abstract】: Information retrieval in large databases of complex objects, such as images, audio or documents, requires approximate search algorithms in practice, in order to return semantically similar objects to a given query in a reasonable time. One practical approach is supervised binary hashing, where each object is mapped onto a small binary vector so that Hamming distances approximate semantic similarities, and the search is done in the binary space more efficiently. Much work has focused on designing objective functions and optimization algorithms for learning b-bit hash functions from a dataset. Recent work has shown that comparable or better results can be obtained by training b hash functions independently from each other and making them cooperate by introducing diversity with ensemble learning techniques. We show that this can be further improved by two techniques: pruning an ensemble of hash functions, and learning local hash functions. We show how it is possible to train our improved algorithms in datasets orders of magnitude larger than those used by most works on supervised binary hashing.

【Keywords】: Training; Linear programming; Optimization; Binary codes; Laplace equations; Databases; Search problems

146. A Rotation Invariant Latent Factor Model for Moveme Discovery from Static Poses.

Paper Link】 【Pages】:1179-1184

【Authors】: Matteo Ruggero Ronchi ; Joon Sik Kim ; Yisong Yue

【Abstract】: We tackle the problem of learning a rotation invariant latent factor model when the training data is comprised of lower-dimensional projections of the original feature space. The main goal is the discovery of a set of 3-D bases poses that can characterize the manifold of primitive human motions, or movemes, from a training set of 2-D projected poses obtained from still images taken at various camera angles. The proposed technique for basis discovery is data-driven rather than hand-designed. The learned representation is rotation invariant, and can reconstruct any training instance from multiple viewing angles. We apply our method to modeling human poses in sports (via the Leeds Sports Dataset), and demonstrate the effectiveness of the learned bases in a range of applications such as activity classification, inference of dynamics from a single frame, and synthetic representation of movements.

【Keywords】: Training; Data models; Training data; Solid modeling; Image reconstruction; Matrix decomposition; Activity recognition

147. Patterns in Cognitive Rehabilitation of Traumatic Brain Injury Patients: A Text Mining Approach.

Paper Link】 【Pages】:1185-1190

【Authors】: Alejandro García-Rudolph ; Alberto García-Molina ; Eloy Opisso ; Josep María Tormos

【Abstract】: Traumatic Brain Injury (TBI) is a leading cause of disability worldwide, there is one TBI case every 15 seconds and in every 5 minutes someone becomes permanently disabled due to it. Brain injuries lack of surgical or pharmacological therapies, therefore Cognitive Rehabilitation (CR) is the generally adopted treatment. Computerized CR tasks are increasingly replacing traditional "paper and pencil" approaches. Nevertheless, CR plans are manually designed by clinicians from scratch based on their own experience. There is very little research on the amount and type of practice that occurs during computerized CR treatments and its relationship to patients' outcomes. While task repetition is not the only important feature, it is becoming clear that neuroplastic change and functional improvement occur after specific tasks are performed, but do not occur with others. In this work we focus on the preprocessing, patterns and knowledge extraction phases of a Knowledge Discovery in Databases (KDD) framework. We propose considering CR programs as sequences of sessions and pattern searching (association rules, classification models, clustering and shallow neural models) to support clinicians in the selection of specific interventions (e.g. tasks assignations). The proposed framework is applied to 40000 tasks executions from real clinical setting. Results show different execution patterns on patients with positive and negative responses to treatment, predictive models outperform previous recent research, therapists are provided with new insights and tools for tasks selection criteria and design of CR programs.

【Keywords】: Brain injuries; Text mining; Computational modeling; Tag clouds; Visualization; Medical treatment

148. Detecting Change Processes in Dynamic Networks by Frequent Graph Evolution Rule Mining.

Paper Link】 【Pages】:1191-1196

【Authors】: Erik Scharwächter ; Emmanuel Müller ; Jonathan F. Donges ; Marwan Hassani ; Thomas Seidl

【Abstract】: The analysis of the temporal evolution of dynamic networks is a key challenge for understanding complex processes hidden in graph structured data. Graph evolution rules capture such processes on the level of small subgraphs by describing frequently occurring structural changes within a network. Existing rule discovery methods make restrictive assumptions on the change processes present in networks. We propose EvoMine, a frequent graph evolution rule mining method that, for the first time, supports networks with edge insertions and deletions as well as node and edge relabelings. EvoMine defines embedding-based and event-based support as two novel measures to assess the frequency of rules. These measures are based on novel mappings from dynamic networks to databases of union graphs that retain all evolution information relevant for rule mining. Using these mappings the rule mining problem can be solved by frequent subgraph mining. We evaluate our approach and two baseline algorithms on several real datasets. To the best of our knowledge, this is the first empirical comparison of rule mining algorithmsfor dynamic networks.

【Keywords】: Databases; Frequency measurement; Data mining; Heuristic algorithms; Network topology; Topology; Conferences

149. Reliable Semi-supervised Learning.

Paper Link】 【Pages】:1197-1202

【Authors】: Junming Shao ; Chen Huang ; Qinli Yang ; Guangchun Luo

【Abstract】: In this paper, we propose a Reliable Semi-Supervised Learning framework, called ReSSL, for both static and streaming data. Instead of relaxing different assumptions, we do model the reliability of cluster assumption, quantify the distinct importance of clusters (or evolving micro-clusters on data streams), and integrate the cluster-level information and labeled data for prediction with a lazy learning framework. Extensive experiments demonstrate that our method has good performance compared to state-of-the-art algorithms on data sets in both static and real streaming environments.

【Keywords】: Reliability; Semisupervised learning; Clustering algorithms; Data models; Training data; Supervised learning; Particle separators

150. Online Unsupervised Multi-view Feature Selection.

Paper Link】 【Pages】:1203-1208

【Authors】: Weixiang Shao ; Lifang He ; Chun-Ta Lu ; Xiaokai Wei ; Philip S. Yu

【Abstract】: In this paper, we propose an Online unsupervised Multi-View Feature Selection method, OMVFS, which deals with large-scale/streaming multi-view data in an online fashion. OMVFS embeds unsupervised feature selection into a clustering algorithm via nonnegative matrix factorization with sparse learning. It further incorporates the graph regularization to preserve the local structure information and help select discriminative features. Instead of storing all the historical data, OMVFS processes the multi-view data chunk by chunk and aggregates all the necessary information into several small matrices. By using the buffering technique, the proposed OMVFS can reduce the computational and storage cost while taking advantage of the structure information. Furthermore, OMVFS can capture the concept drifts in the data streams. Extensive experiments on four real-world datasets show the effectiveness and efficiency of the proposed OMVFS method. More importantly, OMVFS is about 100 times faster than the off-line methods.

【Keywords】: Linear programming; Optimization; Aggregates; Electronic mail; Sparse matrices; Buffer storage; Laplace equations

151. Prefix and Suffix Invariant Dynamic Time Warping.

Paper Link】 【Pages】:1209-1214

【Authors】: Diego Furtado Silva ; Gustavo E. A. P. A. Batista ; Eamonn J. Keogh

【Abstract】: While there exist a plethora of classification algorithms for most data types, there is an increasing acceptance that the unique properties of time series mean that the combination of nearest neighbor classifiers and Dynamic Time Warping (DTW) is very competitive across a host of domains, from medicine to astronomy to environmental sensors. While there has been significant progress in improving the efficiency and effectiveness of DTW in recent years, in this work we demonstrate that an underappreciated issue can significantly degrade the accuracy of DTW in real-world deployments. This issue has probably escaped the attention of the very active time series research community because of its reliance on static highly contrived benchmark datasets, rather than real world dynamic datasets where the problem tends to manifest itself. In essence, the issue is that DTW's eponymous invariance to warping is only true for the main "body" of the two time series being compared. However, for the "head" and "tail" of the time series, the DTW algorithm affords no warping invariance. The effect of this is that tiny differences at the beginning or end of the time series (which may be either consequential or simply the result of poor "cropping") will tend to contribute disproportionally to the estimated similarity, producing incorrect classifications. In this work, we show that this effect is real, and reduces the performance of the algorithm. We further show that we can fix the issue with a subtle redesign of the DTW algorithm, and that we can learn an appropriate setting for the extra parameter we introduced. We further demonstrate that our generalization is amiable to all the optimizations that make DTW tractable for large datasets.

【Keywords】: Time series analysis; Classification algorithms; Heuristic algorithms; Training; Data mining; Time measurement; Weapons

152. Mining Summaries for Knowledge Graph Search.

Paper Link】 【Pages】:1215-1220

【Authors】: Qi Song ; Yinghui Wu ; Xin Luna Dong

【Abstract】: Mining and searching heterogeneous and large knowledge graphs is challenging under real-world resource constraints such as response time. This paper studies a framework that discover to facilitate knowledge graph search. 1) We introduce a class of summaries characterized by graph patterns. In contrast to conventional summaries defined by frequent subgraphs, the summaries are capable of adaptively summarize entities with similar neighbors up to a bounded hop. 2) We formulate the computation of graph summarization as a bi-criteria pattern mining problem. Given a knowledge graph G, the problem is to discover k diversified summaries that maximizes the informativeness measure. Although this problem is NP-hard, we show that it is 2-approximable. We also introduce an online mining algorithm that trade-off speed and accuracy, under given resource constraints. 3) We develop query evaluation algorithms that make use of the summaries as views. These algorithms efficiently compute (approximate) answers with high accuracy, and only refer to a small number of summaries. Our experimental study verifies that online mining over large knowledge graphs is feasible, and can suggest bounded search in knowledge graphs.

【Keywords】: Query processing; Approximation algorithms; Data mining; Redundancy; Time factors; Knowledge based systems; Inspection

153. Structure Selection for Convolutive Non-negative Matrix Factorization Using Normalized Maximum Likelihood Coding.

Paper Link】 【Pages】:1221-1226

【Authors】: Atsushi Suzuki ; Kohei Miyaguchi ; Kenji Yamanishi

【Abstract】: Convolutive non-negative matrix factorization (CNMF) is a promising method for extracting features from sequential multivariate data. Conventional algorithms for CNMF require that the structure, or the number of bases for expressing the data, be specified in advance. We are concerned with the issue of how we can select the best structure of CNMF from given data. We first introduce a framework of probabilistic modeling of CNMF and reduce this issue to statistical model selection. The problem is here that conventional model selection criteria such as AIC, BIC, MDL cannot straightforwardly be applied since the probabilistic model for CNMF is irregular in the sense that parameters are not uniquely identifiable. We overcome this problem to propose a novel criterion for best structure selection for CNMF. The key idea is to apply the technique of latent variable completion in combination with normalized maximum likelihood coding criterion under the minimum description length principle. We empirically demonstrate the effectiveness of our method using artificial and real data sets.

【Keywords】: Probabilistic logic; Feature extraction; Convolution; Encoding; Data mining; Information science; Data models

154. Modeling Real Estate for School District Identification.

Paper Link】 【Pages】:1227-1232

【Authors】: Fei Tan ; Chaoran Cheng ; Zhi Wei

【Abstract】: The affiliated school district of a real estate property is often a crucial concern. How to automate the identification of residential homes located in a favorable educational environment, however, is largely unexplored until now. The availability of heterogeneous estate-related data offers a great opportunity for this task. Nevertheless, it is such heterogeneity that poses significant challenges to their amalgamation in a unified fashion. To this end, we develop G-LRMM model to integrate digital price, textual comments, and geographical location information together. The proposed approach is able to capture the in-depth interaction among multi-type data greatly. The evaluation on the dataset of Beijing property market justifies the benefits of our approach over baselines. The further comparison among different components is also conducted and demonstrates their important roles. Moreover, the proposed model can offer useful insights into modeling heterogeneous data sources.

【Keywords】: Mixture models; Data models; Education; Appraisal; Parameter estimation; Mathematical model; Computational modeling

155. Feature Grouping Using Weighted l1 Norm for High-Dimensional Data.

Paper Link】 【Pages】:1233-1238

【Authors】: Bhanukiran Vinzamuri ; Karthik K. Padthe ; Chandan K. Reddy

【Abstract】: Building effective predictive models from high-dimensional data is an important problem in several domains such as in bioinformatics, healthcare analytics and general regression analysis. Extracting feature groups automatically from such data with several correlated features is necessary, in order to use regularizers such as the group lasso which can exploit this deciphered grouping structure to build effective prediction models. Elastic net, fused-lasso and Octagonal Shrinkage Clustering Algorithm for Regression (oscar) are some of the popular feature grouping methods proposed in the literature which recover both sparsity and feature groups from the data. However, their predictive ability is affected adversely when the regression coefficients of adjacent feature groups are similar, but not exactly equal. This happens as these methods merge such adjacent feature groups erroneously, which is also called the misfusion problem. In order to solve this problem, in this paper, we propose a weighted l1 norm-based approach which is effective at recovering feature groups, despite the proximity of the coefficients of adjacent feature groups, building extremely accurate predictive models. This convex optimization problem is solved using the fast iterative soft-thresholding algorithm (FISTA). We depict how our approach is more effective at resolving the misfusion problem on synthetic datasets compared to existing feature grouping methods such as the elastic net, fused-lasso and oscar. We also evaluate the goodness of the model on real-world breast cancer gene expression and the 20-Newsgroups datasets.

【Keywords】: high-dimensional data; regression; regularization; feature grouping

156. Learning Task Relational Structure for Multi-task Feature Learning.

Paper Link】 【Pages】:1239-1244

【Authors】: De Wang ; Feiping Nie ; Heng Huang

【Abstract】: In multi-task learning, it is paramount to discover the relational structure of tasks and utilize the learned task structure. Previous works have been using the low-rank latent feature subspace to capture the task relations, and some of them aim to learn the group based relational structure of tasks. However, in many cases, the low-rank subspace may not exist for the specific group of tasks, thus using this paradigm would not work. To discover the task relational structures, we propose a novel multi-task learning method using the structured sparsity-inducing norms to automatically uncover the relations of tasks. Instead of imposing the low-rank constraint, our new model uses a more meaningful assumption, in which the tasks from the same relational group should share the common feature subspace. We can discover the group relational structure of tasks and learn the shared feature subspace for each task group, which help to improve the predictive performance. Our proposed algorithm avoids the high computational complexity of integer programming, thus it converges very fast. Empirical studies conducted on both synthetic and real-world data show that our method consistently outperforms related multi-task learning methods.

【Keywords】: Linear programming; Knowledge transfer; Optimization; Algorithm design and analysis; Learning systems; Computational modeling; Computational efficiency

157. Multi-view Clustering via Concept Factorization with Local Manifold Regularization.

Paper Link】 【Pages】:1245-1250

【Authors】: Hao Wang ; Yan Yang ; Tianrui Li

【Abstract】: Real-world datasets often have representations in multiple views or come from multiple sources. Exploiting consistent or complementary information from multi-view data, multi-view clustering aims to get better clustering quality rather than relying on the individual view. In this paper, we propose a novel multi-view clustering method called multi-view concept clustering based on concept factorization with local manifold regularization, which drives a common consensus representation for multiple views. The local manifold regularization is incorporated into concept factorization to preserve the locally geometrical structure of the data space. Moreover, the weight of each view is learnt automatically and a co-normalized approach is designed to make fusion meaningful in terms of driving the common consensus representation. An iterative optimization algorithm based on the multiplicative rules is developed to minimize the objective function. Experimental results on nine reality datasets involving different fields demonstrate that the proposed method performs better than several state-of-the-art multi-view clustering methods.

【Keywords】: Manifolds; Optimization; Linear programming; Kernel; Sparse matrices; Symmetric matrices; Loss measurement

158. A Fast Distributed Classification Algorithm for Large-Scale Imbalanced Data.

Paper Link】 【Pages】:1251-1256

【Authors】: Huihui Wang ; Yang Gao ; Yinghuan Shi ; Hao Wang

【Abstract】: The Alternating Direction Method of Multipliers (ADMM) has been developed recently for distributed classification. Nevertheless, the widely-existing class imbalance problem has not been well investigated. Furthermore, previous imbalanced classification methods lack of efforts in studying the complex imbalance problem in a distributed environment. In this paper, we consider the imbalance problem as distributed data imbalance which includes three imbalance issues: (i) within-node class imbalance, (ii)between-node class imbalance, and (iii) between-node structure imbalance. In order to adequately deal with imbalanced data as well as improve time efficiency, a novel distributed Cost-Sensitive classification algorithm via Group-based ADMM (CS-GADMM) is proposed. Briefly, CS-GADMM derives the classification problem as a series of sub-problems with within-node class imbalance. To alleviate the time delay caused by between-node class imbalance, we propose a extension of dual coordinate descent method for the sub-problem optimization. Meanwhile, for between-node structure imbalance, we discreetly study the relationship between local functions, and combine the resulting local variables intra-group to update the global variables for prediction. The experimental results on various imbalanced datasets validate that CS-GADMM could be a efficient algorithm for imbalanced classification.

【Keywords】: Distributed databases; Optimization; Delay effects; Convergence; Support vector machines; Data mining; Software

159. Differential Location Privacy for Sparse Mobile Crowdsensing.

Paper Link】 【Pages】:1257-1262

【Authors】: Leye Wang ; Daqing Zhang ; Dingqi Yang ; Brian Y. Lim ; Xiaojuan Ma

【Abstract】: Sparse Mobile Crowdsensing (MCS) has become a compelling approach to acquire and make inference on urban-scale sensing data. However, participants risk their location privacy when reporting data with their actual sensing positions. To address this issue, we adopt e-differential-privacy in Sparse MCS to provide a theoretical guarantee for participants' location privacy regardless of an adversary's prior knowledge. Furthermore, to reduce the data quality loss caused by differential location obfuscation, we propose a privacypreserving framework with three components. First, we learn a data adjustment function to fit the original sensing data to the obfuscated location. Second, we apply a linear program to select an optimal location obfuscation function, which aims to minimize the uncertainty in data adjustment. We also propose a fast approximated variant. Third, we propose an uncertaintyaware inference algorithm to improve the inference accuracy of obfuscated data. Evaluations with real environment and traffic datasets show that our optimal method reduces the data quality loss by up to 42% compared to existing differential privacy methods.

【Keywords】: Uncertainty; Privacy; Data privacy; Servers; Mobile communication; Temperature sensors

160. Robust Convex Clustering Analysis.

Paper Link】 【Pages】:1263-1268

【Authors】: Qi Wang ; Pinghua Gong ; Shiyu Chang ; Thomas S. Huang ; Jiayu Zhou

【Abstract】: Clustering is an unsupervised learning approach that explores data and seeks groups of similar objects. Many classical clustering models such as k-means and DBSCAN are based on heuristics algorithms and suffer from local optimal solutions and numerical instability. Recently convex clustering has received increasing attentions, which leverages the sparsity inducing norms and enjoys many attractive theoretical properties. However, convex clustering is based on Euclidean distance and is thus not robust against outlier features. Since the outlier features are very common especially when dimensionality is high, the vulnerability has greatly limited the applicability of convex clustering to analyze many real-world datasets. In this paper, we address the challenge by proposing a novel robust convex clustering method that simultaneously performs convex clustering and identifies outlier features. Specifically, the proposed method learns to decompose the data matrix into a clustering structure component and a group sparse component that captures feature outliers. We develop a block coordinate descent algorithm which iteratively performs convex clustering after outliers features are identified and eliminated. We also propose an efficient algorithm for solving the convex clustering by exploiting the structures on its dual problem. Moreover, to further illustrate the statistical stability, we present the theoretical performance bound of the proposed clustering method. Empirical studies on synthetic data and real-world data demonstrate that the proposed robust convex clustering can detect feature outliers as well as improve cluster quality.

【Keywords】: Robustness; Clustering algorithms; Clustering methods; Optimization; Sparse matrices; Convex functions; Unsupervised learning

161. Bayesian Rule Sets for Interpretable Classification.

Paper Link】 【Pages】:1269-1274

【Authors】: Tong Wang ; Cynthia Rudin ; Finale Doshi-Velez ; Yimin Liu ; Erica Klampfl ; Perry MacNeille

【Abstract】: A Rule Set model consists of a small number of short rules for interpretable classification, where an instance is classified as positive if it satisfies at least one of the rules. The rule set provides reasons for predictions, and also descriptions of a particular class. We present a Bayesian framework for learning Rule Set models, with prior parameters that the user can set to encourage the model to have a desired size and shape in order to conform with a domain-specific definition of interpretability. We use an efficient inference approach for searching for the MAP solution and provide theoretical bounds to reduce computation. We apply Rule Set models to ten UCI data sets and compare the performance with other interpretable and non-interpretable models.

【Keywords】: Bayes methods; Computational modeling; Data models; Simulated annealing; Search problems; Predictive models; Data mining

162. The Development of a Smart Taxicab Scheduling System: A Multi-source Data Fusion Perspective.

Paper Link】 【Pages】:1275-1280

【Authors】: Yang Wang ; Binxin Liang ; Wei Zheng ; Liusheng Huang ; Hengchang Liu

【Abstract】: Recent advances in vehicular networks, GPS and smartphone technologies have changed the paradigm of intelligent taxicab systems. Indeed, taxicab trajectories and online calling information have enabled us to provide more efficient and personalized services. However, existing approaches are not sufficient in exploiting cooperative scheduling techniques and utilizing real time calling information. To this end, in this paper, we model the time-varying regularities of traffic flows, activity ratios of passengers, and unoccupied taxicabs of road segments by mining statistical data on taxicab trajectories. Along this line, we propose a novel approach to calculate the expected revenue of possible routes for individual taxicabs while considering the influence of others, and at the same time, advance a dynamic taxicab scheduling mechanism with online taxicab calling information. Finally, we evaluate our algorithm on real-world taxicab data. Experimental results demonstrate that our approach outperforms existing alternative solutions in terms of average revenue of taxi drivers.

【Keywords】: Roads; Public transportation; Vehicles; Real-time systems; Trajectory; Scheduling

163. Selecting Valuable Customers for Merchants in E-Commerce Platforms.

Paper Link】 【Pages】:1281-1286

【Authors】: Yijun Wang ; Le Wu ; Zongda Wu ; Enhong Chen ; Qi Liu

【Abstract】: An e-commerce website provides a platform for merchants to sell products to customers. While most existing research focuses on providing customers with personalized product suggestions by recommender systems, in this paper, we consider the role of merchants and introduce a parallel problem, i.e., how to select the most valuable customers for a merchant? Accurately answering this question can not only help merchants to gain more profits, but also benefit the ecosystem of e-commence platforms. To deal with this problem, we propose a general approach by taking into consideration the interest and profit of each customer to the merchant, i.e., select the customers who are not only interested in the merchant to ensure the visit of the merchant, but also capable of making good profits. Specifically, we first generate candidate customers for a given merchant by using traditional recommendation techniques. Then we select a set of the valuable customers from candidate customers, which has the balanced maximization between the interest and the profit metrics. Given the NP-hardness of the balanced maximization formulation, we further introduce efficient techniques to solve this maximization problem by exploiting the inherent submodularity property. Finally, extensive experimental results on a real-world dataset demonstrate the effectiveness of our proposed approach.

【Keywords】: Recommender systems; Measurement; Mathematical model; Data mining; Collaboration; TV; Business

164. Canonical Consistent Weighted Sampling for Real-Value Weighted Min-Hash.

Paper Link】 【Pages】:1287-1292

【Authors】: Wei Wu ; Bin Li ; Ling Chen ; Chengqi Zhang

【Abstract】: Min-Hash, as a member of the Locality Sensitive Hashing (LSH) family for sketching sets, plays an important role in the big data era. It is widely used for efficiently estimating similarities of bag-of-words represented data and has been extended to dealing with multi-sets and real-value weighted sets. Improved Consistent Weighted Sampling (ICWS) has been recognized as the state-of-the-art for real-value weighted Min-Hash. However, the algorithmic implementation of ICWS is flawed because it violates the uniformity of the Min-Hash scheme. In this paper, we propose a Canonical Consistent Weighted Sampling (CCWS) algorithm, which not only retains the same theoretical complexity as ICWS but also strictly complies with the definition of Min-Hash. The experimental results demonstrate that the proposed CCWS algorithm runs faster than the state-of-the-arts while achieving similar classification performance on a number of real-world text data sets.

【Keywords】: Quantization (signal); Indexes; Complexity theory; Approximation algorithms; Australia; Electronic mail; Big data

165. Spectrum-Revealing Cholesky Factorization for Kernel Methods.

Paper Link】 【Pages】:1293-1298

【Authors】: Jianwei Xiao ; Ming Gu

【Abstract】: Kernel methods represent some of the most popular machine learning tools for data analysis. Since exact kernel methods can be prohibitively expensive for large problems, reliable low-rank matrix approximations and high-performance implementations have become indispensable for practical applications of kernel methods. In this work, we introduce spectrum-revealing Cholesky factorization, a reliable low-rank matrix factorization, for kernel matrix approximation. We also develop an efficient and effective randomized algorithm for computing this factorization. Our numerical experiments demonstrate that this algorithm is as effective as other Cholesky factorization based kernel methods on machine learning problems, but significantly faster.

【Keywords】: Kernel; Approximation algorithms; Reliability; Algorithm design and analysis; Machine learning algorithms; Complexity theory; Prediction algorithms

166. Topic Discovery for Short Texts Using Word Embeddings.

Paper Link】 【Pages】:1299-1304

【Authors】: Guangxu Xun ; Vishrawas Gopalakrishnan ; Fenglong Ma ; Yaliang Li ; Jing Gao ; Aidong Zhang

【Abstract】: Discovering topics in short texts, such as news titles and tweets, has become an important task for many content analysis applications. However, due to the lack of rich context information in short texts, the performance of conventional topic models on short texts is usually unsatisfying. In this paper, we propose a novel topic model for short text corpus using word embeddings. Continuous space word embeddings, which is proven effective at capturing regularities in language, is incorporated into our model to provide additional semantics. Thus we model each short document as a Gaussian topic over word embeddings in the vector space. In addition, considering that background words in a short text are usually not semantically related, we introduce a discrete background mode over word types to complement the continuous Gaussian topics. We evaluate our model on news titles from data sources like abcnews, showing that our model is able to extract more coherent topics from short texts compared with the baseline methods and learn better topic representation for each short document.

【Keywords】: Semantics; Gaussian distribution; Internet; Encyclopedias; Electronic publishing; Context

167. Dynamic Contextual Multi Arm Bandits in Display Advertisement.

Paper Link】 【Pages】:1305-1310

【Authors】: Hongxia Yang ; Quan Lu

【Abstract】: We model the ad selection task as a multi-armed bandit problem. Standard assumptions in the multi-armed bandit (MAB) setting are that samples drawn from each arm are independent and identically distributed, rewards (or conversion rates in our scenario) are stationary and rewards feedback are immediate. Although the payoff function of an arm is allowed to evolve over time, the evolution is assumed to be slow. Display ads, on the other hand, are regularly created while others are removed from circulation. This can occur when budgets run out, campaign goal changes, holiday season ends and many other latent factors that go beyond the control of the ad selection system. Another big challenge is that the set of available ads is often extremely huge but standard multi-armed bandit strategies converge with linear time complexity that cannot accommodate the usually dynamic changes. Due to the above challenges and the restrictions of the original MAB, we propose a novel dynamic contextual MAB which tightly integrates components of dynamic conversion rates prediction, contextual learning and arm overlapping modeling in a principled framework. Besides we propose an accompanied meta analyses framework that allows us to conclude experiments in a more statistically robust manner. We demonstrate on a world leading demand side platform (DSP) that our framework can effectively discriminate premium arms and significantly outperform some standard variations of MAB to these settings.

【Keywords】: Standards; Predictive models; Digital signal processing; Stochastic processes; Heuristic algorithms; Context; Algorithm design and analysis

168. Functional Regression with Mode-Sparsity Constraint.

Paper Link】 【Pages】:1311-1316

【Authors】: Pei Yang ; Jingrui He

【Abstract】: Functional data is ubiquitous in many domains such as healthcare, social media, manufacturing process, sensor networks, etc. Functional data analysis involves the analysis of data which is treated as infinite-dimensional continuous functions rather than discrete, finite-dimensional vectors. In this paper, we propose a novel function-on-function regression model based on mode-sparsity regularization. The main idea is to represent the regression coefficient function between predictor and response as the double expansion of basis functions, and then use mode-sparsity constraint to automatically filter out the irrelevant basis functions for both predictors and responses. The mode-sparsity regularization covers a wide spectrum of sparse models for function-on-function regression. The resulting optimization problem is challenging due to the non-smooth property of the mode-sparsity. We develop an efficient and convergence-guaranteed algorithm to solve the problem. The effectiveness of the proposed approach is verified on benchmark functional data sets in various domains.

【Keywords】: Predictive models; Optimization; Linear programming; Data models; Convergence; Additives; Data mining

169. Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View That Includes Motifs, Discords and Shapelets.

Paper Link】 【Pages】:1317-1322

【Authors】: Chin-Chia Michael Yeh ; Yan Zhu ; Liudmila Ulanova ; Nurjahan Begum ; Yifei Ding ; Hoang Anh Dau ; Diego Furtado Silva ; Abdullah Mueen ; Eamonn J. Keogh

【Abstract】: The all-pairs-similarity-search (or similarity join) problem has been extensively studied for text and a handful of other datatypes. However, surprisingly little progress has been made on similarity joins for time series subsequences. The lack of progress probably stems from the daunting nature of the problem. For even modest sized datasets the obvious nested-loop algorithm can take months, and the typical speed-up techniques in this domain (i.e., indexing, lower-bounding, triangular-inequality pruning and early abandoning) at best produce one or two orders of magnitude speedup. In this work we introduce a novel scalable algorithm for time series subsequence all-pairs-similarity-search. For exceptionally large datasets, the algorithm can be trivially cast as an anytime algorithm and produce high-quality approximate solutions in reasonable time. The exact similarity join algorithm computes the answer to the time series motif and time series discord problem as a side-effect, and our algorithm incidentally provides the fastest known algorithm for both these extensively-studied problems. We demonstrate the utility of our ideas for two time series data mining problems, including motif discovery and novelty discovery.

【Keywords】: Time series analysis; Approximation algorithms; Euclidean distance; Data mining; Indexes; Clustering algorithms; Text processing

170. Deep Convolutional Factor Analyser for Multivariate Time Series Modeling.

Paper Link】 【Pages】:1323-1328

【Authors】: Chao Yuan ; Amit Chakraborty

【Abstract】: Deep generative models can perform dramatically better than traditional graphical models in a number of machine learning tasks. However, training such models remains challenging because their latent variables typically do not have an analytical posterior distribution, largely due to the nonlinear activation nodes. We present a deep convolutional factor analyser (DCFA) for multivariate time series modeling. Our network is constructed in a way that bottom layer nodes are independent. Through a process of up-sampling and convolution, higher layer nodes gain more temporal dependency. Our model can thus give a time series different representations at different depths. DCFA only consists of linear Gaussian nodes. Therefore, the posterior distributions of latent variables are also Gaussian and can be estimated easily using standard variational Bayes algorithm. We show that even without nonlinearity the proposed deep model can achieve state-of-the-art results in anomaly detection, classification and clustering using both synthetic and real-world datasets.

【Keywords】: Time series analysis; Hidden Markov models; Training; Analytical models; Convolution; Gaussian distribution; Data models

171. Incorporating Pre-Training in Long Short-Term Memory Networks for Tweets Classification.

Paper Link】 【Pages】:1329-1334

【Authors】: Shuhan Yuan ; Xintao Wu ; Yang Xiang

【Abstract】: The paper presents deep learning models for tweets binary classification. Our approach is based on the Long Short-Term Memory (LSTM) recurrent neural network and hence expects to be able to capture long-term dependencies among words. We develop two models for tweets classification. The basic model, called LSTM-TC, takes word embeddings as input, uses the LSTM layer to derive semantic tweet representation, and applies logistic regression to predict tweet label. The basic LSTM-TC model, like other deep learning models, requires a large amount of well-labeled training data to achieve good performance. To address this challenge, we further develop an improved model, called LSTM-TC, that incorporates a large amount of weakly-labeled data for classifying tweets. We present two approaches of constructing the weakly-labeled data. One is based on hashtag information and the other is based on the prediction output of some traditional classifier that does not need a large amount of well-labeled training data. Our LSTM-TC model first learns tweet representation based on the weakly-labeled data, and then trains the logistic regression classifier based on the small amount of well-labeled data. Experimental results show that: (1) the proposed method can be successfully used for tweets classification and outperform existing state-of-the-art methods, (2) pre-training tweet representation, which utilizes weakly-labeled tweets, can significantly improve the accuracy of tweets classification.

【Keywords】: Training; Logistics; Data models; Twitter; Semantics; Tagging; Logic gates

172. Low-Rank Sparse Feature Selection for Patient Similarity Learning.

Paper Link】 【Pages】:1335-1340

【Authors】: Mengting Zhan ; Shilei Cao ; Buyue Qian ; Shiyu Chang ; Jishang Wei

【Abstract】: Comparing and identifying similar patients is a fundamental task in medical domains - an efficient technique can, for example, help doctors to track patient cohorts, compare the effectiveness of treatments, or predict medical outcomes. The goal of patient similarity learning is to derive a clinically meaningful measure to evaluate the similarity amongst patients represented by their key clinical indicators. However, it is challenging to learn such similarity, as medical data are usually high dimensional, heterogeneous, and complex. In addition, a desirable patient similarity is dependent on particular clinical settings, which implies supervised learning scheme is more useful in medical domains. To address these, in this paper we present a novel similarity learning approach formulated as the generalized Mahalanobis similarity function with pairwise constraints. Considering there always exists some features non-discriminative and contains redundant information, we encode a low-rank structure to our similarity function to perform feature selection. We evaluate the proposed model on both UCI benchmarks and a real clinical dataset for several medical tasks, including patient retrieval, classification, and cohort discovery. The results show that our similarity model significantly outperforms many state-of-the-art baselines, and is effective at removing noisy or redundant features.

【Keywords】: Medical diagnostic imaging; Measurement; Medical services; Linear programming; Algorithm design and analysis; Sparse matrices; Electronic mail

173. ROM: A Robust Online Multi-task Learning Approach.

Paper Link】 【Pages】:1341-1346

【Authors】: Chi Zhang ; Peilin Zhao ; Shuji Hao ; Yeng Chai Soh ; Bu-Sung Lee

【Abstract】: A series of online multi-task learning (OMTL) algorithms have been proposed to avoid the expensive training cost and poor adaptability of traditional batch multi-task learning (MTL) algorithms in recent years. However, these OMTL algorithms usually assume that all tasks are closely related, which may not hold in practical scenarios. More importantly, their theoretical reliability is weakened due to the lack of proof on the cumulative regrets. To overcome these limitations, we present a robust online multi-task classification framework (ROM) and its two optimization algorithms (ROM-PGD, ROM-RDA). The proposed algorithms can not only automatically capture the common features among all tasks and individual features for each task, but also identify the potential existence of outlier task. Theoretically, we prove that the regret bounds of these two algorithms are sub-linear compared with the best separating algorithm in hindsight. Empirical studies on both synthetic and real-world datasets also demonstrate the effectiveness of our proposed algorithms when compared with the state-of-the-art OMTL algorithms.

【Keywords】: Read only memory; Optimization; Robustness; Prediction algorithms; Matrix decomposition; Algorithm design and analysis; Electronic mail

174. Scalable Discrete Supervised Hash Learning with Asymmetric Matrix Factorization.

Paper Link】 【Pages】:1347-1352

【Authors】: Shifeng Zhang ; Jianmin Li ; Jinma Guo ; Bo Zhang

【Abstract】: Hashing methods map similar data to binary hashcodes with smaller hamming distance, and it has received a broad attention due to its low storage cost and fast retrieval speed. However, the existing limitations make the present algorithms difficult to deal with large-scale datasets: (1) discrete constraints are involved in the learning of the hash function, (2) pairwise or triplet similarity is adopted to generate efficient hashcodes, resulting both time and space complexity are greater than O(n2). To address these issues, we propose a novel discrete supervised hash learning framework which can be scalable to large-scale datasets. First, the learning procedure is decomposed into a binary classifier learning scheme and hashcodes learning scheme. Second, we adopt the Asymmetric Low-rank Matrix Factorization and propose the Fast Clustering-based Batch Coordinate Descent method, such that the time and space complexity is reduced to O(n). The proposed framework also provides a flexible paradigm to incorporate with arbitrary hash function, including deep neural networks. Experiments on large-scale datasets demonstrate that the proposed method is superior or comparable with state-of-the-art hashing algorithms.

【Keywords】: Binary codes; Training data; Training; Neural networks; Matrix decomposition; Semantics; Quantization (signal)

175. Multi-type Co-clustering of General Heterogeneous Information Networks via Nonnegative Matrix Tri-Factorization.

Paper Link】 【Pages】:1353-1358

【Authors】: Xianchao Zhang ; Haixin Li ; Wenxin Liang ; Jiebo Luo

【Abstract】: Many kinds of real world data can be modeled by a heterogeneous information network (HIN) which consists of multiple types of objects. Clustering plays an important role in mining knowledge from HIN. Several HIN clustering algorithms have been proposed in recent years. However, these algorithms suffer from one or moreof the following problems: (1) inability to model general HINs, (2) inability to simultaneously generate clusters for all types of objects, (3) inability to use similarity information of the objects with the same type. In this paper, we propose a powerful HIN clustering algorithm which can handle general HINs, simultaneously generate clusters for all types of objects, and use the similarity information of the same type of objects. First, we transform a general HIN into a meta-path-encoded relationship set. Second, we propose a nonnegative matrix tri-factorization multi-type co-clustering method, HMFClus, to cluster all types of objects in HIN simultaneously. Third, we integrate the information between the objects with the same type into HMFClus by using a similarity regularization. Extensive experiments on real world datasets show that the proposed algorithm outperforms the state-of-the-art methods.

【Keywords】: Clustering algorithms; Data mining; Transforms; Software; Electronic mail; Semantics; Conferences

176. Dynamic Poisson Factor Analysis.

Paper Link】 【Pages】:1359-1364

【Authors】: Yizhe Zhang ; Yue Zhao ; Lawrence David ; Ricardo Henao ; Lawrence Carin

【Abstract】: We introduce a novel dynamic model for discrete time-series data, in which the temporal sampling may be nonuniform. The model is specified by constructing a hierarchy of Poisson factor analysis blocks, one for the transitions between latent states and the other for the emissions between latent states and observations. Latent variables are binary and linked to Poisson factor analysis via Bernoulli-Poisson specifications. The model is derived for count data but can be readily modified for binary observations. We derive efficient inference via Markov chain Monte Carlo, that scales with the number of non-zeros in the data and latent binary states, yielding significant acceleration compared to related models. Experimental results on benchmark data show the proposed model achieves state-of-the-art predictive performance. Additional experiments on microbiome data demonstrate applicability of the proposed model to interesting problems in computational biology where interpretability is of utmost importance.

【Keywords】: Hidden Markov models; Data models; Analytical models; Biological system modeling; Computational modeling; Load modeling; Correlation

177. Gaussian Component Based Index for GMMs.

Paper Link】 【Pages】:1365-1370

【Authors】: Linfei Zhou ; Bianca Wackersreuther ; Frank Fiedler ; Claudia Plant ; Christian Böhm

【Abstract】: Efficient similarity search for uncertain data is a challenging task in many modern data mining applications like image retrieval, speaker recognition and stock market analysis. A common way to model the uncertainty of data objects is using probability density functions in the form of Gaussian Mixture Models (GMMs), which have an ability to approximate arbitrary distribution. However, due to the possible unequal length of mixture models, the use of existing index techniques has serious problems for the objects modeled by GMMs. Either the techniques cannot handle GMMs or they have too many limitations. Hence, we propose a dynamic index structure, Gaussian Component based Index (GCI), for GMMs. GCI decomposes GMMs into the single, pairs, or n-lets of Gaussian components, stores these components into well studied index trees such as U-tree and Gauss-Tree, and refines the corresponding GMMs in a conservative but tight way. GCI supports both k-most-likely queries and probability threshold queries by means of Matching Probability. Extensive experimental evaluations of GCI demonstrate a considerable speed-up of similarity search on both synthetic and real-world data sets.

【Keywords】: Indexes; Probability density function; Gaussian mixture model; Gaussian distribution; Covariance matrices; Mixture models

178. Multi-label Learning with Emerging New Labels.

Paper Link】 【Pages】:1371-1376

【Authors】: Yue Zhu ; Kai Ming Ting ; Zhi-Hua Zhou

【Abstract】: Multi-label learning is widely applied in many tasks, where an object possesses multiple concepts with each represented by a class label. Previous studies on multi-label learning have focused on a fixed set of class labels, i.e., the class label set of test data is the same as that in the training set. In many applications, however, the environment is open and new concepts may emerge with previously unseen instances. In order to maintain good predictive performance in this environment, a multi-label learning method must have the ability to detect and classify those instances with emerging new labels. To this end, we propose a new approach called Multi-label learning with Emerging New Labels (MuENL). It builds models with three functions: classify instances on currently known labels, detect the emergence of a new label in new instances, and construct a new classifier for each new label that works collaboratively with the classifier for known labels. Our empirical evaluation shows the effectiveness of MuENL.

【Keywords】: Detectors; Predictive models; Robustness; Data models; Training; Learning systems; Correlation