Proceedings of the 2014 SIAM International Conference on Data Mining, Philadelphia, Pennsylvania, USA, April 24-26, 2014. SIAM 【DBLP Link】
【Paper Link】 【Pages】:1-9
【Authors】: Hong Cao ; Chunyu Bao ; Xiaoli Li ; David Yew-Kwong Woon
【Abstract】: Traditional active learning encounters a cold start issue when very few labelled examples are present for learning a decent initial classifier. Its poor quality subsequently affects selection of the next query and stability of the iterative learning process, resulting in more annotation effort from a domain expert. To address this issue, this paper presents a novel class augmentation technique, which enhances each class's representation which initially consists of only limited set of labelled examples. Our augmentation employs a connectivity-based influence computation algorithm with an incorporated decaying mechanism for the unlabelled samples. Besides augmentation, our method also introduces structure preserving oversampling to correct class imbalance. Extensive experiments on ten publicly available data sets demonstrate the effectiveness of our proposed method over existing state-of-the-art methods. Moreover, our proposed modules perform at the fundamental data level without any requirement to modify the well-established standard machine learning tools.
【Keywords】:
【Paper Link】 【Pages】:10-18
【Authors】: Yu Cheng ; Zhengzhang Chen ; Hongliang Fei ; Fei Wang ; Alok N. Choudhary
【Abstract】: We consider the problem of active learning when the categories are represented as a tree with leaf nodes as outputs and internal nodes as clusters of the outputs at multiple granularity. Recent work has improved the traditional techniques by moving beyond “flat” structure through incorporation of the label hierarchy into the uncertainty measure. However, these methods have two major limitations when used. First, these methods roughly use the information in the label structure but do not take into account the training samples, which may lead to a sampling bias due to their crude approximation of the class relations. Second, none of these methods can work in a batch mode to reduce the computational time of training. We propose a batch mode active learning scheme that exploits both the hierarchical structure of the labels and the characteristics of the training data to select the most informative data for human labeling. We achieve this goal by first using an approach based on graph embedding that embeds the relationships between the labels and data points in a transformed low-dimensional space. Then, we compute uncertainty by calculating the variance among the points and the labels in the embedding space. Finally, the selection criterion is designed to construct batches and incorporate a diversity measure. Experimental results indicate that our technique achieves a notable improvement in performance over the state-of-the-art approaches.
【Keywords】:
【Paper Link】 【Pages】:19-27
【Authors】: Ou Wu ; Shuxiao Li ; Honghui Dong ; Ying Chen ; Weiming Hu
【Abstract】: Mining the data source from a crowd of people has elicited increasing attention in recent years. In existing studies, multiple users are utilized, in which each user is generally required to annotate only one attribute for each sample. However, there are cases in numerous annotation tasks wherein despite of the presence of multiple users, each user should classify or rate multiple attributes for each sample. This situation is referred to as multi-user multi-attribute annotations in this paper. This work deals with the learning problem under multi-user multi-attribute annotations. A generative model is introduced to describe the human labeling process for multi-user multi-attribute annotations. Subsequently, a maximum likelihood approach is leveraged to infer the parameters in the generative model, namely, ground-truth labels, user expertise, and annotation difficulties. The classifiers for each attribute are also learned simultaneously. Furthermore, the correlations among attributes are taken into account during inference and learning using conditional random field. The experimental results reveal that compared with existing methods that ignore the characteristics of multi-user multi-attribute annotations, our approach can obtain better estimation of the ground truth labels, user experts, annotation difficulties as well as attribute classifiers.
【Keywords】:
【Paper Link】 【Pages】:28-36
【Authors】: Peng Peng ; Raymond Chi-Wing Wong
【Abstract】: In the literature of supervised learning, most existing studies assume that the labels provided by the labelers are deterministic, which may introduce noise easily in many real-world applications. In many applications like crowdsourcing, however, many labelers may simultaneously label the same group of instances and thus the label of each instance is associated with a probability. Motivated by this observation, we propose a new framework where each label is enriched with a probability. In this paper, we study an interactive sampling strategy, namely, selective sampling, in which each selected instance is labeled with a probability. Specifically, we flip a coin every time when we read a new instance and decide whether it should be labeled according to the flipping result. We prove that in our setting the label complexity can be reduced dramatically. Finally, we conducted comprehensive experiments in order to verify the effectiveness of our proposed labeling framework.
【Keywords】:
【Paper Link】 【Pages】:37-45
【Authors】: Min-Ling Zhang
【Abstract】: Partial label learning deals with the problem where each training example is associated with a set of candidate labels, among which only one is correct. The common strategy is to try to disambiguate their candidate labels, such as by identifying the ground-truth label iteratively or by treating each candidate label equally. Nevertheless, the above disambiguation strategy is prone to be misled by the false positive label(s) within candidate label set. In this paper, a new disambiguation-free approach to partial label learning is proposed by employing the well-known error-correcting output codes (ECOC) techniques. Specifically, to build the binary classifier with respect to each column coding, any partially labeled example will be regarded as a positive or negative training example only if its candidate label set entirely falls into the coding dichotomy. Experiments on controlled and real-world data sets clearly validate the effectiveness of the proposed approach.
【Keywords】:
【Paper Link】 【Pages】:46-54
【Authors】: Yao Zhang ; B. Aditya Prakash
【Abstract】: Given a graph, like a social/computer network or the blogosphere, in which an infection (or meme or virus) has been spreading for some time, how to select the k best nodes for immunization/quarantining immediately? Most previous works for controlling propagation (say via immunization) have concentrated on developing strategies for vaccination pre-emptively before the start of the epidemic. While very useful to provide insights in to which baseline policies can best control an infection, they may not be ideal to make real-time decisions as the infection is progressing. In this paper, we study how to immunize healthy nodes, in presence of already infected nodes. Efficient algorithms for such a problem can help public-health experts make more informed choices. First we formulate the Data-Aware Vaccination problem, and prove it is NP-hard and also that it is hard to approximate. Secondly, we propose two effective polynomial-time heuristics DAVA and DAVA-fast. Finally, we also demonstrate the scalability and effectiveness of our algorithms through extensive experiments on multiple real networks including epidemiology datasets, which show substantial gains of up to 10 times more healthy nodes at the end.
【Keywords】:
【Paper Link】 【Pages】:55-63
【Authors】: Nicola Barbieri ; Francesco Bonchi
【Abstract】: Product design and viral marketing are two popular concepts in the marketing literature that, although following different paths, aim at the same goal: maximizing the adoption of a new product. While the effect of the social network is nowadays kept in great consideration in any marketing-related activity, the interplay between product design and social influence is surprisingly still largely unexplored. In this paper we move a first step in this direction and study the problem of designing the features of a novel product such that its adoption, fueled by peer influence and “word-of-mouth” effect, is maximized. We model the viral process of product adoption on the basis of social influence and the features of the product, and devise an improved iterative scaling procedure to learn the parameters that maximize the likelihood of our novel feature-aware propagation model. In order to design an effective algorithm for our problem, we study the property of the underlying propagation model. In particular we show that the expected spread, i.e., the objective function to maximize, is monotone and submodular when we fix the features of the product and seek for the set of users to target in the viral marketing campaign. Instead, when we fix the set of users and try to find the optimal features for the product, then the expected spread is neither submodular nor monotone (as it is the case, in general, for product design). Therefore, we develop an algorithm based on an alternating optimization between selecting the features of the product, and the set of users to target in the campaign. Our experimental evaluation on real-world data from the domain of social music consumption (LastFM) and social movie consumption (Flixster) confirms the effectiveness of the proposed framework in integrating product design in viral marketing.
【Keywords】:
【Paper Link】 【Pages】:64-72
【Authors】: Ruichao Shi ; Qingyao Wu ; Yunming Ye ; Shen-Shyang Ho
【Abstract】: In recent years much effort has been devoted to Collective Classification (CC) techniques for predicting labels of linked instances. Given a large number of labeled data, conventional CC algorithms make use of local labeled neighbours to increase accuracy. However, in many real-world applications, labeled data are limited and very expensive to obtain. In this situation, most of the data have no connection to labeled data, and supervision knowledge cannot be obtained from the local connections. Recently, Semi-Supervised Collective Classification (SSCC) has been examined to leverage unlabeled data for enhancing the classification performance of CC. In this paper we propose a probabilistic generative model with network regularization (GMNR) for SSCC. Our main idea is to compute label probability distributions for unlabeled instances by maximizing both the log-likelihood in the generative model and the label smoothness on the network topology of data. The proposed generative model is based on the Probabilistic Latent Semantic Analysis (PLSA) method using attribute features of all instances. A network regularizer is employed to smooth the label probability distributions on the network topology of data. Finally, we develop an effective EM algorithm to compute the label probability distributions for label prediction. Experimental results on three real sparsely-labeled network datasets show that the proposed model GMNR outperforms state-of-the-art CC algorithms and other SSCC algorithms.
【Keywords】:
【Paper Link】 【Pages】:73-81
【Authors】: Manish Gupta ; Arun Mallya ; Subhro Roy ; Jason H. D. Cho ; Jiawei Han
【Abstract】: In the real world, various systems can be modeled using entity-relationship graphs. Given such a graph, one may be interested in identifying suspicious or anomalous subgraphs. Specifically, a user may want to identify suspicious subgraphs matching a query template. A subgraph can be defined as anomalous based on the connectivity structure within itself as well as with its neighborhood. For example for a co-authorship network, given a subgraph containing three authors, one expects all three authors to be say data mining authors. Also, one expects the neighborhood to mostly consist of data mining authors. But a 3-author clique of data mining authors with all theory authors in the neighborhood clearly seems interesting. Similarly, having one of the authors in the clique as a theory author when all other authors (both in the clique and neighborhood) are data mining authors, is also suspicious. Thus, existence of low-probability links and absence of high-probability links can be a good indicator of subgraph outlierness. The probability of an edge can in turn be modeled based on the weighted similarity between the attribute values of the nodes linked by the edge. We claim that the attribute weights must be learned locally for accurate link existence probability computations. In this paper, we design a system that finds subgraph outliers given a graph and a query by modeling the problem as a linear optimization. Experimental results on several synthetic and real datasets show the effectiveness of the proposed approach in computing interesting outliers.
【Keywords】:
【Paper Link】 【Pages】:82-90
【Authors】: Nan Li ; Huan Sun ; Kyle C. Chipman ; Jemin George ; Xifeng Yan
【Abstract】: Uncovering subgraphs with an abnormal distribution of attributes reveals much insight into network behaviors. For example in social or communication networks, diseases or intrusions usually do not propagate uniformly, which makes it critical to find anomalous regions with high concentrations of a specific disease or intrusion. In this paper, we introduce a probabilistic model to identify anomalous subgraphs containing a significantly different percentage of a certain vertex attribute, such as a specific disease or an intrusion, compared to the rest of the graph. Our framework, gAnomaly, models generative processes of vertex attributes and divides the graph into regions that are governed by background and anomaly processes. Two types of regularizers are employed to smoothen the regions and to facilitate vertex assignment. We utilize deterministic annealing EM to learn the model parameters, which is less initialization-dependent and better at avoiding local optima. In order to find fine-grained anomalies, an iterative procedure is further proposed. Experiments show gAnomaly outperforms a state-of-the-art algorithm at uncovering anomalous subgraphs in attributed graphs.
【Keywords】:
【Paper Link】 【Pages】:91-99
【Authors】: Danai Koutra ; U. Kang ; Jilles Vreeken ; Christos Faloutsos
【Abstract】: How can we succinctly describe a million-node graph with a few simple sentences? How can we measure the ‘importance’ of a set of discovered subgraphs in a large graph? These are exactly the problems we focus on. Our main ideas are to construct a ‘vocabulary’ of subgraph-types that often occur in real graphs (e.g., stars, cliques, chains), and from a set of subgraphs, find the most succinct description of a graph in terms of this vocabulary. We measure success in a well-founded way by means of the Minimum Description Length (MDL) principle: a subgraph is included in the summary if it decreases the total description length of the graph. Our contributions are three-fold: (a) formulation: we provide a principled encoding scheme to choose vocabulary subgraphs; (b) algorithm: we develop VOG, an efficient method to minimize the description cost, and (c) applicability: we report experimental results on multi-million-edge real graphs, including Flickr and the Notre Dame web graph.
【Keywords】:
【Paper Link】 【Pages】:100-108
【Authors】: Lianhua Chi ; Bin Li ; Xingquan Zhu
【Abstract】: There have been a number of approximate algorithms for text similarity computation, such as min-wise hashing, random projection, and feature hashing, which are based on the bag-of-words representation. A limitation of their “flat-set” representation is that context information and semantic hierarchy cannot be preserved. In this paper, we aim to fast compute similarities between texts while also preserving context information. To take into account semantic hierarchy, we consider a notion of “multi-level exchangeability” which can be applied at word-level, sentence-level, paragraph-level, etc. We employ a nested-set to represent a multi-level exchangeable object. To fingerprint nested-sets for fast comparison, we propose a Recursive Min-wise Hashing (RMH) algorithm at the same computational cost of the standard min-wise hashing algorithm. Theoretical study and bound analysis confirm that RMH is a highly-concentrated estimator. The empirical studies show that the proposed context-preserving hashing method can significantly outperform min-wise hashing and feature hashing in accuracy at the same (or less) computational cost.
【Keywords】:
【Paper Link】 【Pages】:109-117
【Authors】: Alex Beutel ; Partha Pratim Talukdar ; Abhimanu Kumar ; Christos Faloutsos ; Evangelos E. Papalexakis ; Eric P. Xing
【Abstract】: Given multiple data sets of relational data that share a number of dimensions, how can we efficiently decompose our data into the latent factors? Factorization of a single matrix or tensor has attracted much attention, as, e.g., in the Netflix challenge, with users rating movies. However, we often have additional, side, information, like, e.g., demographic data about the users, in the Netflix example above. Incorporating the additional information leads to the coupled factorization problem. So far, it has been solved for relatively small datasets. We provide a distributed, scalable method for decomposing matrices, tensors, and coupled data sets through stochastic gradient descent on a variety of objective functions. We offer the following contributions: (1) Versatility: Our algorithm can perform matrix, tensor, and coupled factorization, with flexible objective functions including the Frobenius norm, Frobenius norm with an ℒ1 induced sparsity, and non-negative factorization. (2) Scalability: FlexiFaCT scales to unprecedented sizes in both the data and model, with up to billions of parameters. FlexiFaCT runs on standard Hadoop. (3) Convergence proofs showing that Flexi-FaCT converges on the variety of objective functions, even with projections.
【Keywords】:
【Paper Link】 【Pages】:118-126
【Authors】: Evangelos E. Papalexakis ; Christos Faloutsos ; Tom M. Mitchell ; Partha Pratim Talukdar ; Nicholas D. Sidiropoulos ; Brian Murphy
【Abstract】: How can we correlate the neural activity in the human brain as it responds to typed words, with properties of these terms (like ‘edible’, ‘fits in hand’)? In short, we want to find latent variables, that jointly explain both the brain activity, as well as the behavioral responses. This is one of many settings of the Coupled Matrix-Tensor Factorization (CMTF) problem. Can we accelerate any CMTF solver, so that it runs within a few minutes instead of tens of hours to a day, while maintaining good accuracy? We introduce Turbo-SMT, a meta-method capable of doing exactly that: it boosts the performance of any CMTF algorithm, by up to 200x, along with an up to 65 fold increase in sparsity, with comparable accuracy to the baseline. We apply Turbo-SMT to BrainQ, a dataset consisting of a (nouns, brain voxels, human subjects) tensor and a (nouns, properties) matrix, with coupling along the nouns dimension. Turbo-SMT is able to find meaningful latent variables, as well as to predict brain activity with competitive accuracy.
【Keywords】:
【Paper Link】 【Pages】:127-135
【Authors】: Lifang He ; Xiangnan Kong ; Philip S. Yu ; Xiaowei Yang ; Ann B. Ragin ; Zhifeng Hao
【Abstract】: With advances in data collection technologies, tensor data is assuming increasing prominence in many applications and the problem of supervised tensor learning has emerged as a topic of critical significance in the data mining and machine learning community. Conventional methods for supervised tensor learning mainly focus on learning kernels by flattening the tensor into vectors or matrices, however structural information within the tensors will be lost. In this paper, we introduce a new scheme to design structure-preserving kernels for supervised tensor learning. Specifically, we demonstrate how to leverage the naturally available structure within the tensorial representation to encode prior knowledge in the kernel. We proposed a tensor kernel that can preserve tensor structures based upon dual-tensorial mapping. The dual-tensorial mapping function can map each tensor instance in the input space to another tensor in the feature space while preserving the tensorial structure. Theoretically, our approach is an extension of the conventional kernels in the vector space to tensor space. We applied our novel kernel in conjunction with SVM to real-world tensor classification problems including brain fMRI classification for three different diseases (i.e., Alzheimer's disease, ADHD and brain damage by HIV). Extensive empirical studies demonstrate that our proposed approach can effectively boost tensor classification performances, particularly with small sample sizes.
【Keywords】:
【Paper Link】 【Pages】:136-144
【Authors】: Juhua Hu ; Jian Pei ; Jie Tang
【Abstract】: Given a large photo collection without domain knowledge (e.g., tourism photos, conference photos, event photos, images wrapped from webpages), it is not easy for human beings to organize or only view them within a reasonable time. In this paper, we propose to automatically extract meaningful semantics from a photo collection named “dimensions” to help people view, search and organize photos conveniently and efficiently. However, due to the lack of additional domain knowledge or content information, existing image retrieval techniques are not applicable. To tackle the problem, we first propose a simple strategy to extract all meaningful semantics from original photos/images as candidate dimensions, and then propose an efficient unsupervised feature/dimension selection method to select a sufficient dimension subset to uniquely index each photo within this collection. Our experiments on several real-world photo/image collections validate both the efficiency and effectiveness of our proposed method.
【Keywords】:
【Paper Link】 【Pages】:145-153
【Authors】: Behzad Golshan ; Evimaria Terzi
【Abstract】: Motivated by emerging applications including vehicular traffic networks and crowdsourcing systems, we introduce the Unveil problem defined as follows: given a system of linear equations with less equations than unknowns (i.e., an underdetermined system), identify a set of k variables to query their values so that the number of variables that can be uniquely deduced in the new system is maximized. In this paper, we study the complexity of the Unveil problem and several of its variants and we obtain the following results. In Unveil, when the variables of the linear system take real values, then the problem is NP-hard to solve or even approximate. These inapproximability results carry over to the case where the queries may involve linear combinations of variables. When the variables in the linear equations are binary, then the corresponding binary version of the UNVEIL problem becomes coNP-hard to solve or even approximate. In order to explore the practical significance of our formulation we also develop a set of heuristics for the basic Unveil problem and demonstrate that choosing the queried variables using heuristic algorithms is much more effective than random querying.
【Keywords】:
【Paper Link】 【Pages】:154-162
【Authors】: Dan He ; Irina Rish ; Laxmi Parida
【Abstract】: Sparse regression methods such as l1-regularized linear regression, or Lasso [18], are commonly used for analysis of high-dimensional, small-sample datasets, due to their good generalization and feature-selection properties. However, predictive accuracy of sparse regression can be further improved by incorporating more realistic data-modeling assumptions (e.g., non-linearity) and by fully exploiting all available data, as suggested by the transductive approach [6], which makes instance-specific predictions based on both labeled (training) data and unlabeled (test) data, instead of learning a single fixed model from training data. Based on these ideas, we develop a novel method, called Transductive HSIC Lasso, that incorporates transduction into a nonlinear sparse regression approach known as HSIC Lasso [19]. Unlike the existing transductive Lasso algorithm of [1], our approach does not rely on imputation, i.e., on estimation of unknown labels using a predictor built on training data; the latter may sometimes result in poor overall performance due to unreliable label estimates. Instead, our method exploits the structure of the HSIC Lasso, which maximizes the relevance between the selected features and the label, while minimizing the redundancy between the selected features; transduction is achieved by including unlabeled samples into the redundancy computation. Our experiments demonstrate advantages of the proposed method over the state-of-the-art approaches, both on simulated and real-life data, such as prediction of phenotypic traits from genomic data, and prediction of subject's pain level from his/her functional MRI data.
【Keywords】:
【Paper Link】 【Pages】:163-171
【Abstract】: Subspace learning is a popular approach for feature extraction and classification. However, its performance would be heavily degraded when data are corrupted by large amounts of noise. Inspired by recent work in matrix recovery, we tackle this problem by exploiting a subspace that is robust to noise and large variability for classification. Specifically, we propose a novel Supervised Regularization based Robust Subspace (SRRS) approach via low-rank learning. Unlike existing subspace methods, our approach jointly learns low-rank representations and a robust subspace from noisy observations. At the same time, to improve the classification performance, class label information is incorporated as supervised regularization. The problem can then be formulated as a constrained rank minimization objective function, which can be effectively solved by the inexact augmented Lagrange multiplier (ALM) algorithm. Our approach differs from current sparse representation and low-rank learning methods in that it explicitly learns a low-dimensional subspace where the supervised information is incorporated. Extensive experimental results on four datasets demonstrate that our approach outperforms the state-of-the-art subspace and low-rank learning methods in almost all cases, especially when the data contain large variations or are heavily corrupted by noise.
【Keywords】:
【Paper Link】 【Pages】:172-180
【Authors】: Caiming Xiong ; Wei Chen ; Gang Chen ; David M. Johnson ; Jason J. Corso
【Abstract】: Large-scale data mining and retrieval applications have increasingly turned to compact binary data representations as a way to achieve both fast queries and efficient data storage; many algorithms have been proposed for learning effective binary encodings. Most of these algorithms focus on learning a set of projection hyperplanes for the data and simply binarizing the result from each hyperplane, but this neglects the fact that informativeness may not be uniformly distributed across the projections. In this paper, we address this issue by proposing a novel adaptive quantization (AQ) strategy that adaptively assigns varying numbers of bits to different hyperplanes based on their information content. Our method provides an information-based schema that preserves the neighborhood structure of data points, and we jointly find the globally optimal bit-allocation for all hyperplanes. In our experiments, we compare with state-of-the-art methods on four large-scale datasets and find that our adaptive quantization approach significantly improves on traditional hashing methods.
【Keywords】:
【Paper Link】 【Pages】:181-189
【Authors】: Jingrui He ; Yan Liu ; Qiang Yang
【Abstract】: Most existing works on multi-task learning (MTL) assume the same input space for different tasks. In this paper, we address a general setting where different tasks have heterogeneous input spaces. This setting has a lot of potential applications, yet it poses new algorithmic challenges - how can we link seemingly uncorrelated tasks to mutually boost their learning performance? Our key observation is that in many real applications, there might exist some correspondence among the inputs of different tasks, which is referred to as pivots. For such applications, we first propose a learning scheme for multiple tasks and analyze its generalization performance. Then we focus on the problems where only a limited number of the pivots are available, and propose a general framework to leverage the pivot information. The idea is to map the heterogeneous input spaces to a common space, and construct a single prediction model in this space for all the tasks. We further propose an effective optimization algorithm to find both the mappings and the prediction model. Experimental results demonstrate its effectiveness, especially with very limited number of pivots.
【Keywords】:
【Paper Link】 【Pages】:190-198
【Authors】: Ayan Acharya ; Raymond J. Mooney ; Joydeep Ghosh
【Abstract】: Multitask learning (MTL) via a shared representation has been adopted to alleviate problems with sparsity of labeled data across different learning tasks. Active learning, on the other hand, reduces the cost of labeling examples by making informative queries over an unlabeled pool of data. Therefore, a unification of both of these approaches can potentially be useful in settings where labeled information is expensive to obtain but the learning tasks or domains have some common characteristics. This paper introduces two such models – Active Doubly Supervised Latent Dirichlet Allocation (Act-DSLDA) and its non-parametric variation (Act-NPDSLDA) that integrate MTL and active learning in the same framework. These models make use of both latent and supervised shared topics to accomplish multitask learning. Experimental results on both document and image classification show that integrating MTL and active learning along with shared latent and supervised topics is superior to other methods which do not employ all of these components.
【Keywords】:
【Paper Link】 【Pages】:199-207
【Authors】: Mahito Sugiyama ; Chloé-Agathe Azencott ; Dominik Grimm ; Yoshinobu Kawahara ; Karsten M. Borgwardt
【Abstract】: We propose a new formulation of multi-task feature selection coupled with multiple network regularizers, and show that the problem can be exactly and efficiently solved by maximum flow algorithms. This method contributes to one of the central topics in data mining: How to exploit structural information in multivariate data analysis, which has numerous applications, such as gene regulatory and social network analysis. On simulated data, we show that the proposed method leads to higher accuracy in discovering causal features by solving multiple tasks simultaneously using networks over features. Moreover, we apply the method to multi-locus association mapping with Arabidopsis thaliana genotypes and flowering time phenotypes, and demonstrate its ability to recover more known phenotype-related genes than other state-of-the-art methods.
【Keywords】:
【Paper Link】 【Pages】:208-216
【Authors】: Ben Tan ; ErHeng Zhong ; Michael K. Ng ; Qiang Yang
【Abstract】: Heterogeneous transfer learning has been proposed as a new learning strategy to improve performance in a target domain by leveraging data from other heterogeneous source domains where feature spaces can be different across different domains. In order to connect two different spaces, one common technique is to bridge feature spaces by using some co-occurrence data. For example, annotated images can be used to build feature mapping from words to image features, and then applied on text-to-image knowledge transfer. However, in practice, such co-occurrence data are often from Web, e.g. Flickr, and generated by users. That means these data can be sparse and contain personal biases. Directly building models based on them may fail to provide reliable bridge. To solve these aforementioned problems, in this paper, we propose a novel algorithm named Mixed-Transfer. It is composed of three components, that is, a cross domain harmonic function to avoid personal biases, a joint transition probability graph of mixed instances and features to model the heterogeneous transfer learning problem, a random walk process to simulate the label propagation on the graph and avoid the data sparsity problem. We conduct experiments on 171 real-world tasks, showing that the proposed approach outperforms four state-of-the-art heterogeneous transfer learning algorithms.
【Keywords】:
【Paper Link】 【Pages】:217-225
【Authors】: Jia Wu ; Zhibin Hong ; Shirui Pan ; Xingquan Zhu ; Chengqi Zhang ; Zhihua Cai
【Abstract】: In this paper, we formulate a new multi-graph learning task with only positive and unlabeled bags, where labels are only available for bags but not for individual graphs inside the bag. This problem setting raises significant challenges because bag-of-graph setting does not have features to directly represent graph data, and no negative bags exits for deriving discriminative classification models. To solve the challenge, we propose a puMGL learning framework which relies on two iteratively combined processes for multigraph learning: (1) deriving features to represent graphs for learning; and (2) deriving discriminative models with only positive and unlabeled graph bags. For the former, we derive a subgraph scoring criterion to select a set of informative subgraphs to convert each graph into a feature space. To handle unlabeled bags, we assign a weight value to each bag and use the adjusted weight values to select most promising unlabeled bags as negative bags. A margin graph pool (MGP), which contains some representative graphs from positive bags and identified negative bags, is used for selecting subgraphs and training graph classifiers. The iterative subgraph scoring, bag weight updating, and MGP based graph classification forms a closed loop to find optimal subgraphs and most suitable unlabeled bags for multi-graph learning. Experiments and comparisons on real-world multigraph data demonstrate the algorithm performance.
【Keywords】:
【Paper Link】 【Pages】:226-234
【Authors】: Tri Kurniawan Wijaya ; Tanuja Ganu ; Dipanjan Chakraborty ; Karl Aberer ; Deva P. Seetharam
【Abstract】: Many electricity suppliers around the world are deploying smart meters to gather fine-grained spatiotemporal consumption data and to effectively manage the collective demand of their consumer base. In this paper, we introduce a structured framework and a discriminative index that can be used to segment the consumption data along multiple contextual dimensions such as locations, communities, seasons, weather patterns, holidays, etc. The generated segments can enable various higher-level applications such as usage-specific tariff structures, theft detection, consumer-specific demand response programs, etc. Our framework is also able to track consumers’ behavioral changes, evaluate different temporal aggregations, and identify main characteristics which define a cluster.
【Keywords】:
【Paper Link】 【Pages】:235-243
【Authors】: Sunil Kumar Gupta ; Santu Rana ; Dinh Q. Phung ; Svetha Venkatesh
【Abstract】: Medical outcomes are inexorably linked to patient illness and clinical interventions. Interventions change the course of disease, crucially determining outcome. Traditional outcome prediction models build a single classifier by augmenting interventions with disease information. Interventions, however, differentially affect prognosis, thus a single prediction rule may not suffice to capture variations. Interventions also evolve over time as more advanced interventions replace older ones. To this end, we propose a Bayesian nonparametric, supervised framework that models a set of intervention groups through a mixture distribution building a separate prediction rule for each group, and allows the mixture distribution to change with time. This is achieved by using a hierarchical Dirichlet process mixture model over the interventions. The outcome is then modeled as conditional on both the latent grouping and the disease information through a Bayesian logistic regression. Experiments on synthetic and medical cohorts for 30-day readmission prediction demonstrate the superiority of the proposed model over clinical and data mining baselines.
【Keywords】:
【Paper Link】 【Pages】:244-252
【Authors】: Mahmudur Rahman ; Bogdan Carbunar ; Jaime Ballesteros ; George Burri ; Duen Horng (Polo) Chau
【Abstract】: The popularity and influence of reviews, make sites like Yelp ideal targets for malicious behaviors. We present Marco, a novel system that exploits the unique combination of social, spatial and temporal signals gleaned from Yelp, to detect venues whose ratings are impacted by fraudulent reviews. Marco increases the cost and complexity of attacks, by imposing a tradeoff on fraudsters, between their ability to impact venue ratings and their ability to remain undetected. We contribute a new dataset to the community, which consists of both ground truth and gold standard data. We show that Marco significantly outperforms state-of-the-art approaches, by achieving 94% accuracy in classifying reviews as fraudulent or genuine, and 95.8% accuracy in classifying venues as deceptive or legitimate. Marco successfully flagged 244 deceptive venues from our large dataset with 7,435 venues, 270,121 reviews and 195,417 users. Among the San Francisco car repair and moving companies that we analyzed, almost 10% exhibit fraudulent behaviors.
【Keywords】:
【Paper Link】 【Pages】:253-261
【Authors】: Anuj Karpatne ; Ankush Khandelwal ; Shyam Boriah ; Vipin Kumar
【Abstract】: A large number of real-world domains possess heterogeneity in their data, which implies that different partitions of the data show different relationships between explanatory and response variables. This increases the overall model complexity of predictive learning in the presence of heterogeneity. Additionally, a number of real-world domains lack sufficient training data, making the learning algorithm prone to over-fitting, especially when the model complexity is large. However, there often exists a structure among the data instances and their partitions which can be appropriately leveraged for reducing the model complexity along with addressing heterogeneity. In this paper, we present a framework for learning robust predictive models in real-world heterogeneous datasets which lack sufficient number of training samples. We demonstrate the usefulness of our framework in the domain of remote sensing for forest cover estimation. Through a series of comparative experiments with baseline approaches, we are able to show that our framework: (a) captures meaningful information about heterogeneity in the data, (b) improves prediction performance by addressing data heterogeneity, (c) is robust to over-fitting in the presence of limited training data, and (d) is robust to the choice of the number of partitions used for representing heterogeneity.
【Keywords】:
【Paper Link】 【Pages】:262-270
【Authors】: Prithwish Chakraborty ; Pejman Khadivi ; Bryan Lewis ; Aravindan Mahendiran ; Jiangzhuo Chen ; Patrick Butler ; Elaine O. Nsoesie ; Sumiko R. Mekaru ; John S. Brownstein ; Madhav V. Marathe ; Naren Ramakrishnan
【Abstract】: Modern epidemiological forecasts of common illnesses, such as the flu, rely on both traditional surveillance sources as well as digital surveillance data. However, most published studies have been retrospective. Concurrently, the reports about flu activity generally lags by several weeks and even when published are revised for several weeks more. We posit that effectively handling this uncertainty is one of the key challenges for a real-time prediction system in this sphere. In this paper, we present a detailed prospective analysis on the generation of robust quantitative predictions about temporal trends of flu activity, using several surrogate data sources for 15 Latin American countries. We present our findings about the limitations and possible advantages of correcting the uncertainty associated with official flu estimates. We also compare the prediction accuracy between model-level fusion of different surrogate data sources against data-level fusion. Finally, we present a novel matrix factorization approach using neighborhood embedding to predict flu case counts. Comparing our proposed ensemble method against several baseline methods helps us demarcate the importance of different data sources for the countries under consideration.
【Keywords】:
【Paper Link】 【Pages】:271-279
【Authors】: Priyanka Agrawal ; Vikas C. Raykar ; Amrita Saha
【Abstract】: Ensemble methods are widely used in practice with the hope of obtaining better predictive performance than could be obtained from any of the constituent classifiers in the ensemble. Most of the existing literature is concerned with learning ensembles in a supervised setting. In this paper we propose an unsupervised iterative algorithm to combine the discriminant scores from different binary classifiers. We prove that (under certain assumptions) the Area Under the ROC Curve (AUC) of the resulting ensemble is greater than or equal to the AUC of the best classifier (with maximum AUC). We also experimentally validate this claim on a number of datasets and also show that the performance is better than the supervised ensembles.
【Keywords】:
【Paper Link】 【Pages】:280-288
【Authors】: Sunho Park ; TaeHyun Hwang ; Seungjin Choi
【Abstract】: Multiclass problems are often decomposed into multiple binary problems that are solved by individual binary classifiers whose results are integrated into a final answer. Various methods, including all-pairs (APs), one-versus-all (OVA), and error correcting output code (ECOC), have been studied, to decompose multiclass problems into binary problems. However, little study has been made to optimally aggregate binary problems to determine a final answer to the multiclass problem. In this paper we present a convex optimization method for an optimal aggregation of binary classifiers to estimate class membership probabilities in multiclass problems. We model the class membership probability as a softmax function which takes a conic combination of discrepancies induced by individual binary classifiers, as an input. With this model, we formulate the regularized maximum likelihood estimation as a convex optimization problem, which is solved by the primal-dual interior point method. Connections of our method to large margin classifiers are presented, showing that the large margin formulation can be considered as a limiting case of our convex formulation. In the experiments on human disease classification, we demonstrate that our method outperforms existing aggregation methods as well as direct methods, in terms of the classification accuracy and F-score.
【Keywords】:
【Paper Link】 【Pages】:289-297
【Authors】: Xiaoyi Li ; Nan Du ; Hui Li ; Kang Li ; Jing Gao ; Aidong Zhang
【Abstract】: Time varying problems usually have complex underlying structures represented as dynamic networks where entities and relationships appear and disappear over time. The problem of efficiently performing dynamic link inference is extremely challenging due to the dynamic nature in massive evolving networks especially when there exist sparse connectivities and nonlinear transitional patterns. In this paper, we propose a novel deep learning framework, i.e., Conditional Temporal Restricted Boltzmann Machine (ctRBM), which predicts links based on individual transition variance as well as influence introduced by local neighbors. The proposed model is robust to noise and have the exponential capability to capture nonlinear variance. We tackle the computational challenges by developing an efficient algorithm for learning and inference of the proposed model. To improve the efficiency of the approach, we give a faster approximated implementation based on a proposed Neighbor Influence Clustering algorithm. Extensive experiments on simulated as well as real-world dynamic networks show that the proposed method outperforms existing algorithms in link inference on dynamic networks.
【Keywords】:
【Paper Link】 【Pages】:298-306
【Authors】: Zheng Chen ; Weixiong Zhang
【Abstract】: Missing information is ubiquitous in relational datasets. Imputation of missing relations, a.k.a. link prediction, has become an increasingly crucial problem in relational data analysis as a huge amount of data has been accumulated in various fields. Recent advances in the latent variable models have greatly improved the state-of-the-art in the link prediction accuracy, however it comes at the price of increasing complexity. In this paper, we propose a novel link prediction algorithm, marginalized denoising model (MDM), where the problem of predicting unobserved or missing links in a given relational matrix is cast as a problem of matrix denoising. The method learns a mapping function that models the embedded topological structures of the relational network by capturing the so-called indirect affinities among entities. We train the mapping function by recovering the originally observed matrix from a conceptually “infinite” number of corrupted matrices where some links are randomly masked from the observed matrix. By re-applying the learned function to the observed relational matrix, we aim to “denoise” the observed matrix and thus to recover the unobserved links. Experimental results on several benchmarks demonstrate the superior performance of the new method over several state-of-the-art link prediction methods.
【Keywords】:
【Paper Link】 【Pages】:307-315
【Authors】: Peng Peng ; Raymond Chi-Wing Wong ; Philip S. Yu
【Abstract】: Classification is a fundamental topic in the literature of data mining and all recent hot topics like active learning and transfer learning all rely on the concept of classification. Probabilistic information becomes more prevalent nowadays and can be found easily in many applications like crowdsourcing and pattern recognition. In this paper, we focus on a dataset which contains probabilistic information for classification. Based on this probabilistic dataset, we propose a classifier and give a theoretical bound linking the error rate of our classifier and the number of instances needed for training. Interestingly, we find that our theoretical bound is asymptotically at least no worse than the previously best-known bounds developed based on the traditional dataset. Furthermore, our classifier guarantees a fast rate of convergence compared with traditional classifiers. Experimental results show that our proposed algorithm has a higher accuracy than the traditional algorithm. We believe that this work is influential since it opens a new topic on the probabilistic dataset, allowing researchers to study all topics related to classification like active learning and transfer learning under this new probabilistic setting.
【Keywords】:
【Paper Link】 【Pages】:316-324
【Authors】: ErHeng Zhong ; Wei Fan ; Qiang Yang
【Abstract】: One important challenge in social network analysis is how to model users’ distance as a single measure. We propose to model this distance by simultaneously exploring users’ profile attributes and local network structures. Due to the sparsity of data, where each user may interact with just a few people and only a few users provide their profile information, it is typically difficult to learn effective distance measures for any individual network. One important observation is that, people nowadays engage in multiple social networks, such as Facebook, Twitter, etc., where auxiliary knowledge from related networks can help alleviate the data sparsity problem. Nonetheless, due to the network differences, borrowing knowledge directly does not work well. Instead, we propose an adaptive metric learning framework. The basic idea is to exploit knowledge from related networks collectively through embedding and employ boosting-based techniques to eliminate irrelevant attributes. We evaluate the adaptive user distance measure on link prediction problem - an important social modeling task. Empirical studies demonstrate that the proposed approach significantly improves the link-prediction precision over state-of-the-art metric learning and link prediction approaches on two large-scale social networking datasets significantly.
【Keywords】:
【Paper Link】 【Pages】:325-333
【Authors】: Hau Chan ; Leman Akoglu ; Hanghang Tong
【Abstract】: The function and performance of networks rely on their robustness, defined as their ability to continue functioning in the face of damage (targeted attacks or random failures) to parts of the network. Prior research has proposed a variety of measures to quantify robustness and various manipulation strategies to alter it. In this paper, our contributions are twofold. First, we critically analyze various robustness measures and identify their strengths and weaknesses. Our analysis suggests natural connectivity, based on the weighted count of loops in a network, to be a reliable measure. Second, we propose the first principled manipulation algorithms that directly optimize this robustness measure, which lead to significant performance improvement over existing, ad-hoc heuristic solutions. Extensive experiments on real-world datasets demonstrate the effectiveness and scalability of our methods against a long list of competitor strategies.
【Keywords】:
【Paper Link】 【Pages】:334-342
【Authors】: Peifeng Yin ; Qi He ; Xingjie Liu ; Wang-Chien Lee
【Abstract】: Understanding social tie development among users is crucial for user engagement in social networking services. In this paper, we analyze the social interactions, both online and offline, of users and investigate the development of their social ties using data trail of “how social ties grow” left in mobile and social networking services. To the best of our knowledge, this is the first research attempt on studying social tie development by considering both online and offline interactions in a heterogeneous yet realistic relationship. In this study, we aim to answer three key questions: 1) is there a correlation between online and offline interactions? 2) how is the social tie developed via heterogeneous interaction channels? 3) would the development of social tie between two users be affected by their common friends? To achieve our goal, we develop a Social-aware Hidden Markov Model (SaHMM) that explicitly takes into account the factor of common friends in measure of the social tie development. Our experiments show that, comparing with results obtained using HMM and other heuristic methods, the social tie development captured by our SaHMM is significantly more consistent to lifetime profiles of users.
【Keywords】:
【Paper Link】 【Pages】:343-351
【Authors】: Zhengwei Wu ; Victor M. Preciado
【Abstract】: The Laplacian eigenvalues of a network play an important role in the analysis of many structural and dynamical network problems. In this paper, we study the relationship between the eigenvalue spectrum of the normalized Laplacian matrix and the structure of ‘local’ subgraphs of the network. We call a subgraph local when it is induced by the set of nodes obtained from a breath-first search (BFS) of radius r around a node. In this paper, we propose techniques to estimate spectral properties of the normalized Laplacian matrix from a random collection of induced local subgraphs. In particular, we provide an algorithm to estimate the spectral moments of the normalized Laplacian matrix (the power-sums of its eigenvalues). Moreover, we propose a technique, based on convex optimization, to compute upper and lower bounds on the spectral radius of the normalized Laplacian matrix from local subgraphs. We illustrate our results studying the normalized Laplacian spectrum of a large-scale e-mail network.
【Keywords】:
【Paper Link】 【Pages】:352-360
【Authors】: David García-Soriano ; Konstantin Kutzkov
【Abstract】: We present a new randomized algorithm for estimating the number of triangles in massive graphs revealed as a stream of edges in arbitrary order. It exploits the fact that graphs arising from various domains often have small vertex covers, which enables us to reduce the space usage and sample complexity of triangle counting algorithms. The algorithm runs in four passes over the edge set and uses constant processing time per edge. We obtain precise bounds on the complexity and the approximation guarantee of our algorithm. For graphs where even the minimum vertex cover is prohibitively large, we extend the approach to achieve a trade-off between the number of passes and the space usage. Experiments on real-world graphs validate our theoretical analysis and show that the new algorithm yields more accurate estimates than state-of-the-art approaches using the same amount of space.
【Keywords】:
【Paper Link】 【Pages】:361-369
【Authors】: Parikshit Sondhi ; ChengXiang Zhai
【Abstract】: In this paper we present a constrained hidden markov model based approach for extracting non-explicit citing sentences in research articles. Our method involves first independently training a separate HMM for each citation in the article being processed and then performing a constrained joint inference to label non-explicit citing sentences. Results on a standard test collection show that our method significantly outperforms the baselines and is comparable to the state of the art approaches.
【Keywords】:
【Paper Link】 【Pages】:370-378
【Authors】: Subhabrata Mukherjee ; Gaurab Basu ; Sachindra Joshi
【Abstract】: Traditional works in sentiment analysis and aspect rating prediction do not take author preferences and writing style into account during rating prediction of reviews. In this work, we introduce Joint Author Sentiment Topic Model (JAST), a generative process of writing a review by an author. Authors have different topic preferences, ‘emotional’ attachment to topics, writing style based on the distribution of semantic (topic) and syntactic (background) words and their tendency to switch topics. JAST uses Latent Dirichlet Allocation to learn the distribution of author-specific topic preferences and emotional attachment to topics. It uses a Hidden Markov Model to capture short range syntactic and long range semantic dependencies in reviews to capture coherence in author writing style. JAST jointly discovers the topics in a review, author preferences for the topics, topic ratings as well as the overall review rating from the point of view of an author. To the best of our knowledge, this is the first work in Natural Language Processing to bring all these dimensions together to have an author-specific generative model of a review.
【Keywords】:
【Paper Link】 【Pages】:379-387
【Authors】: Ziqi Liu ; Qinghua Zheng ; Fei Wang ; Zhenhua Tian ; Bo Li
【Abstract】: Latent variable models have proven to be a useful tool for discovering latent structures from observational data. However, the data in social networks often come as streams, i.e., both text content (e.g., emails, user postings) and network structure (e.g., user friendship) evolve over time. To capture the time-evolving latent structures in such social streams, we propose a fully nonparametric Dynamic Topical Community Model (nDTCM), where infinite latent community variables coupled with infinite latent topic variables in each epoch, and the temporal dependencies between variables across epochs are modeled via the rich-gets-richer scheme. We focus on characterizing three dynamic aspects in social streams: the number of communities or topics changes (e.g., new communities or topics are born and old ones die out); the popularity of communities or topics evolves; the semantics such as community topic distribution, community participant distribution and topic word distribution drift. Furthermore, we develop an effective online posterior inference algorithm for nDTCM, which is concordant with the online nature of social streams. Experiments using real-world data show the effectiveness of our model at discovering the dynamic topical communities in social streams.
【Keywords】:
【Paper Link】 【Pages】:388-397
【Authors】: Qiming Diao ; Jing Jiang
【Abstract】: Due to the fast development of social media on the Web, Twitter has become one of the major platforms for people to express themselves. Because of the wide adoption of Twitter, events like breaking news and release of popular videos can easily catch people's attention and spread rapidly on Twitter, and the number of relevant tweets approximately reflects the impact of an event. Event identification and analysis on Twitter has thus become an important task. Recently the Recurrent Chinese Restaurant Process (RCRP) has been successfully used for event identification from news streams and news-centric social media streams. However, these models cannot be directly applied to Twitter based on our preliminary experiments mainly for two reasons: (1) Events emerge and die out fast on Twitter, while existing models ignore this burstiness property. (2) Most Twitter posts are personal interest oriented while only a small fraction is event related. Motivated by these challenges, we propose a new nonparametric model which considers burstiness. We further combine this model with traditional topic models to identify both events and topics simultaneously. Our quantitative evaluation provides sufficient evidence that our model can accurately detect meaningful events. Our qualitative evaluation also shows interesting analysis for events on Twitter.
【Keywords】:
【Paper Link】 【Pages】:398-406
【Authors】: Marina Danilevsky ; Chi Wang ; Nihit Desai ; Xiang Ren ; Jingyi Guo ; Jiawei Han
【Abstract】: We introduce a framework for topical keyphrase generation and ranking, based on the output of a topic model run on a collection of short documents. By shifting from the unigramcentric traditional methods of keyphrase extraction and ranking to a phrase-centric approach, we are able to directly compare and rank phrases of different lengths. Our method defines a function to rank topical keyphrases so that more highly ranked keyphrases are considered to be more representative phrases for that topic. We study the performance of our framework on multiple real world document collections, and also show that it is more scalable than comparable phrase-generating models.
【Keywords】:
【Paper Link】 【Pages】:407-415
【Authors】: Johannes Schneider ; Michail Vlachos
【Abstract】: Hierarchical clustering (HC) algorithms are generally limited to small data instances due to their runtime costs. Here we mitigate this shortcoming and explore fast HC algorithms based on random projections for single (SLC) and average (ALC) linkage clustering as well as for the minimum spanning tree problem (MST). We present a thorough adaptive analysis of our algorithms that improve prior work from O(N2) by up to a factor of N/(log N)2 for a dataset of N points in Euclidean space. The algorithms maintain, with arbitrary high probability, the outcome of hierarchical clustering as well as the worst-case running-time guarantees. We also present parameter-free instances of our algorithms.
【Keywords】:
【Paper Link】 【Pages】:416-424
【Authors】: Xiang Wang ; Jun Wang ; Buyue Qian ; Fei Wang ; Ian Davidson
【Abstract】: Although constrained spectral clustering has been used extensively for the past few years, all work assumes the guidance (constraints) are given by humans. Original formulations of the problem assumed the constraints are given passively whilst later work allowed actively polling an Oracle (human experts). In this paper, for the first time to our knowledge, we explore the problem of augmenting the given constraint set for constrained spectral clustering algorithms. This moves spectral clustering towards the direction of self-teaching as has occurred in the supervised learning literature. We present a formulation for self-taught spectral clustering and show that the self-teaching process can drastically improve performance without further human guidance.
【Keywords】:
【Paper Link】 【Pages】:425-433
【Authors】: Ahmed Elgohary ; Ahmed K. Farahat ; Mohamed S. Kamel ; Fakhri Karray
【Abstract】: The kernel k-means is an effective method for data clustering which extends the commonly-used k-means algorithm to work on a similarity matrix over complex data structures. It is, however, computationally very complex as it requires the complete kernel matrix to be calculated and stored. Further, its kernelized nature hinders the parallelization of its computations on modern scalable infrastructures for distributed computing. In this paper, we are defining a family of kernelbased low-dimensional embeddings that allows for scaling kernel k-means on MapReduce via an efficient and unified parallelization strategy. Afterwards, we propose two practical methods for low-dimensional embedding that adhere to our definition of the embeddings family. Exploiting the proposed parallelization strategy, we present two scalable MapReduce algorithms for kernel k-means. We demonstrate the effectiveness and efficiency of the proposed algorithms through an empirical evaluation on benchmark datasets.
【Keywords】:
【Paper Link】 【Pages】:434-442
【Authors】: Vladimir Vapnik ; Igor Braga ; Rauf Izmailov
【Abstract】: We introduce a constructive setting for the problem of density ratio estimation through the solution of a multidimensional integral equation. In this equation, not only its right hand side is approximately known, but also the integral operator is approximately defined. We show that this ill-posed problem has a rigorous solution and obtain the solution in a closed form. The key element of this solution is the novel V-matrix, which captures the geometry of the observed samples. We compare our method with previously proposed ones, using both synthetic and real data. Our experimental results demonstrate the good potential of the new approach.
【Keywords】:
【Paper Link】 【Pages】:443-451
【Authors】: Benjamin Peherstorfer ; Dirk Pflüger ; Hans-Joachim Bungartz
【Abstract】: Nonparametric density estimation is a fundamental problem of statistics and data mining. Even though kernel density estimation is the most widely used method, its performance highly depends on the choice of the kernel bandwidth, and it can become computationally expensive for large data sets. We present an adaptive sparse-grid-based density estimation method which discretizes the estimated density function on basis functions centered at grid points rather than on kernels centered at the data points. Thus, the costs of evaluating the estimated density function are independent from the number of data points. We give details on how to estimate density functions on sparse grids and develop a cross validation technique for the parameter selection. We show numerical results to confirm that our sparse-grid-based method is well-suited for large data sets, and, finally, employ our method for the classification of astronomical objects to demonstrate that it is competitive to current kernel-based density estimation approaches with respect to classification accuracy and runtime.
【Keywords】:
【Paper Link】 【Pages】:452-460
【Authors】: Chenyi Zhang ; Ke Wang ; Hongkun Yu ; Jianling Sun ; Ee-Peng Lim
【Abstract】: User preferences change over time and capturing such changes is essential for developing accurate recommender systems. Despite its importance, only a few works in collaborative filtering have addressed this issue. In this paper, we consider evolving preferences and we model user dynamics by introducing and learning a transition matrix for each user's latent vectors between consecutive time windows. Intuitively, the transition matrix for a user summarizes the time-invariant pattern of the evolution for the user. We first extend the conventional probabilistic matrix factorization and then improve upon this solution through its fully Bayesian model. These solutions take advantage of the model complexity and scalability of conventional Bayesian matrix factorization, yet adapt dynamically to user's evolving preferences. We evaluate the effectiveness of these solutions through empirical studies on six large-scale real life data sets.
【Keywords】:
【Paper Link】 【Pages】:461-469
【Authors】: Lijing Qin ; Shouyuan Chen ; Xiaoyan Zhu
【Abstract】: Recommender systems are faced with new challenges that are beyond traditional techniques. For example, most traditional techniques are based on similarity or overlap among existing data, however, there may not exist sufficient historical records for some new users to predict their preference, or users can hold diverse interest, but the similarity based methods may probably over-narrow it. To address the above challenges, we develop a principled approach called contextual combinatorial bandit in which a learning algorithm can dynamically identify diverse items that interest a new user. Specifically, each item is represented as a feature vector, and each user is represented as an unknown preference vector. On each of n rounds, the bandit algorithm sequentially selects a set of items according to the item-selection strategy that balances exploration and exploitation, and collects the user feedback on these selected items. A reward function is further designed to measure the quality (e.g. relevance or diversity) of the selected set based on observed feedback, and the goal of the algorithm is to maximize the total rewards of n rounds. The reward function only needs to satisfy two mild assumptions that is general enough to accommodate a large class of nonlinear functions. To solve this bandit problem, we provide algorithm that achieves Õ(√n) regret after playing n rounds. Experiments conducted on real-wold movie recommendation dataset demonstrate that our approach can effectively address the above challenges and hence improve the performance of recommendation task.
【Keywords】:
【Paper Link】 【Pages】:470-478
【Authors】: Yanjie Fu ; Bin Liu ; Yong Ge ; Zijun Yao ; Hui Xiong
【Abstract】: If properly analyzed, the multi-aspect rating data could be a source of rich intelligence for providing personalized restaurant recommendations. Indeed, while recommender systems have been studied for various applications and many recommendation techniques have been developed for general or specific recommendation tasks, there are few studies for restaurant recommendation by addressing the unique challenges of the multi-aspect restaurant reviews. As we know, traditional collaborative filtering methods are typically developed for single aspect ratings. However, multi-aspect ratings are often collected from the restaurant customers. These ratings can reflect multiple aspects of the service quality of the restaurant. Also, geographic factors play an important role in restaurant recommendation. To this end, in this paper, we develop a generative probabilistic model to exploit the multi-aspect ratings of restaurants for restaurant recommendation. Also, the geographic proximity is integrated into the probabilistic model to capture the geographic influence. Moreover, the profile information, which contains customer/restaurant-independent features and the shared features, is also integrated into the model. Finally, we conduct a comprehensive experimental study on a real-world data set. The experimental results clearly demonstrate the benefit of exploiting multi-aspect ratings and the improvement of the developed generative probabilistic model.
【Keywords】:
【Paper Link】 【Pages】:479-487
【Authors】: Tuan-Anh Hoang ; William W. Cohen ; Ee-Peng Lim
【Abstract】: In this paper, we propose the CBS topic model, a probabilistic graphical model, to derive the user communities in microblogging networks based on the sentiments they express on their generated content and behaviors they adopt. As a topic model, CBS can uncover hidden topics and derive user topic distribution. In addition, our model associates topic-specific sentiments and behaviors with each user community. Notably, CBS has a general framework that accommodates multiple types of behaviors simultaneously. Our experiments on two Twitter datasets show that the CBS model can effectively mine the representative behaviors and emotional topics for each community. We also demonstrate that CBS model perform as well as other state-of-the-art models in modeling topics, but outperforms the rest in mining user communities.
【Keywords】:
【Paper Link】 【Pages】:488-496
【Authors】: Nana Xu ; Hongyan Liu ; Jiawei Chen ; Jun He ; Xiaoyong Du
【Abstract】: Online user reviews are important information for both consumers and vendors. More and more people make their purchase decisions based on online reviews. Vendors also pay more and more attention to online reviews. However, as the number of reviews increases rapidly, the information overload problem prevents us making full use of online reviews. In this paper, we study how to find a representative set of high quality reviews to cover diversified aspects of user opinions. Existing work cannot solve this problem well. To overcome the drawbacks of existing methods, we define a new problem of finding the minimum set of reviews to cover all of features with different sentiment polarity and high quality without user-defined parameters. To solve the problem efficiently, we define potential objective function and develop greedy algorithm to find the solution in polynomial-time with approximation guarantee. We also propose two strategies to further reduce the number of reviews and to prune the search space respectively. Comprehensive experiments conducted on real review sets show that the proposed methods are effective and outperform existing methods.
【Keywords】:
【Paper Link】 【Pages】:497-505
【Authors】: Matteo Riondato ; Fabio Vandin
【Abstract】: Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It requires to identify all itemsets appearing in at least a fraction θ of a transactional dataset D. Often though, the ultimate goal of mining D is not an analysis of the dataset per se, but the understanding of the underlying process that generated it. Specifically, in many applications D is a collection of samples obtained from an unknown probability distribution π on transactions, and by extracting the FIs in D one attempts to infer itemsets that are frequently (i.e., with probability at least θ) generated by π, which we call the True Frequent Itemsets (TFIs). Due to the inherently stochastic nature of the generative process, the set of FIs is only a rough approximation of the set of TFIs, as it often contains a huge number of false positives, i.e., spurious itemsets that are not among the TFIs. In this work we design and analyze an algorithm to identify a threshold such that the collection of itemsets with frequency at least in contains only TFIs with probability at least 1 – δ, for some user-specified δ. Our method uses results from statistical learning theory involving the (empirical) VC-dimension of the problem at hand. This allows us to identify almost all the TFIs without including any false positive. We also experimentally compare our method with the direct mining of at frequency θ and with techniques based on widely-used standard bounds (i.e., the Chernoff bounds) of the binomial distribution, and show that our algorithm outperforms these methods and achieves even better results than what is guaranteed by the theoretical analysis.
【Keywords】:
【Paper Link】 【Pages】:506-514
【Authors】: Jonathan H. Huggins ; Cynthia Rudin
【Abstract】: This paper formalizes a latent variable inference problem we call supervised pattern discovery, the goal of which is to find sets of observations that belong to a single “pattern.” We discuss two versions of the problem and prove uniform risk bounds for both. In the first version, collections of patterns can be generated in an arbitrary manner and the data consist of multiple labeled collections. In the second version, the patterns are assumed to be generated independently by identically distributed processes. These processes are allowed to take an arbitrary form, so observations within a pattern are not in general independent of each other. The bounds for the second version of the problem are stated in terms of a new complexity measure, the quasi-Rademacher complexity.
【Keywords】:
【Paper Link】 【Pages】:515-523
【Authors】: Ning Yang ; Xiangnan Kong ; Fengjiao Wang ; Philip S. Yu
【Abstract】: Predicting both the time and the location of human movements is valuable but challenging for a variety of applications. To address this problem, we propose an approach considering both the periodicity and the sociality of human movements. We first define a new concept, Social Spatial-Temporal Event (SSTE), to represent social interactions among people. For the time prediction, we characterise the temporal dynamics of SSTEs with an ARMA (AutoRegressive Moving Average) model. To dynamically capture the SSTE kinetics, we propose a Kalman Filter based learning algorithm to learn and incrementally update the ARMA model as a new observation becomes available. For the location prediction, we propose a ranking model the periodicity and the sociality of human movements are simultaneously taken into consideration for improving the prediction accuracy. Extensive experiments conducted on real data sets validate our proposed approach.
【Keywords】:
【Paper Link】 【Pages】:524-532
【Authors】: Jason Lines ; Anthony Bagnall
【Abstract】: Recently, several alternative distance measures for comparing time series have been proposed and evaluated on Time Series Classification (TSC) problems. These include variants of Dynamic Time Warping (DTW), such as weighted and derivative DTW, and edit distance-based measures, including Longest Common Subsequence and Time Warp Edit Distance. These measures have the common characteristic that they operate in the time domain and compensate for potential localised misalignment through some elastic adjustment. Our aim is to experimentally test two hypotheses related to these distance measures. Firstly, we test whether there is any significant difference in accuracy for TSC problems between nearest neighbour classifiers using these distance measures. Secondly, we test whether combining these elastic distance measures through simple ensemble schemes gives significantly better accuracy. We test these hypotheses by carrying out one of the largest experimental studies ever conducted into time series classification. Our first key finding is that there is no significant difference between the elastic distance measures in terms of classification accuracy on our data sets. Our second finding, and the major contribution of this work, is to define an ensemble classifier that significantly outperforms the individual classifiers. Nearly all TSC papers in the data mining literature cite DTW (with warping window set through cross validation) as the benchmark for comparison. We believe that our ensemble is the first ever classifier to significantly outperform DTW and as such raises the bar for future work in this area.
【Keywords】:
【Paper Link】 【Pages】:533-541
【Authors】: Zhongyi Hu ; Hongan Wang ; Jiaqi Zhu ; Maozhen Li ; Ying Qiao ; Changzhi Deng
【Abstract】: Plain text documents created and distributed on the Internet are ever changing in various forms. Mining topics of these documents has significant applications in many domains. Most of the literature is devoted to topic modeling, while sequential patterns of topics in document streams are ignored. Moreover, traditional sequential pattern mining algorithms mainly focused on frequent patterns for deterministic data sets, and thus not suitable for document streams with topic uncertainty and rare patterns. In this paper, we formulate and handle the mining problem of rare Sequential Topic Patterns (STPs) for Internet document streams, which are rare on the whole but relatively often for specific users, so also interesting. Since this type of rare STPs reflects users’ specific behaviors, our work can be applied in many fields, such as personalized context-aware recommendation and real-time monitoring on abnormal user behaviors on the Internet. We propose a novel approach to discovering user-related rare STPs based on the temporal and probabilistic information of concerned topics. After extracting topics from documents by LDA and sorting the document stream into sessions for different users during different time periods, the proposed algorithms discover rare STPs by (1) mining STP candidates for each user through an efficient algorithm based on pattern-growth, and (2) generating user-related rare STPs by pattern rarity analysis. Experiments on both synthetic and real data sets show that our approach can discover interesting rare STPs very effectively and efficiently.
【Keywords】:
【Paper Link】 【Pages】:542-550
【Authors】: Erich Schubert ; Arthur Zimek ; Hans-Peter Kriegel
【Abstract】: We analyse the interplay of density estimation and outlier detection in density-based outlier detection. By clear and principled decoupling of both steps, we formulate a generalization of density-based outlier detection methods based on kernel density estimation. Embedded in a broader framework for outlier detection, the resulting method can be easily adapted to detect novel types of outliers: while common outlier detection methods are designed for detecting objects in sparse areas of the data set, our method can be modified to also detect unusual local concentrations or trends in the data set if desired. It allows for the integration of domain knowledge and specific requirements. We demonstrate the flexible applicability and scalability of the method on large real world data sets.
【Keywords】:
【Paper Link】 【Pages】:551-559
【Authors】: Shannon M. Bell ; Stephen W. Edwards
【Abstract】: Health effects are unknown for the vast majority of the >83,000 chemicals in commerce, creating challenges in balancing industrial needs against the complex landscape of health susceptibilities and exposures. The National Health and Nutritional Examination Survey (NHANES), a large-scale survey aimed at determining the prevalence and risk factors of major diseases, is increasingly used to postulate relationships between chemicals and adverse health effects in the U.S. population. The interpretation of these studies is complicated, however, by the ad hoc data mining approaches typically employed. Here we describe the use of frequent itemset mining for identifying exposure ⇒ health associations in data from the NHANES 2005–2006 cycle. From 9,440 dichotomized samples, 983 two-itemset rules were generated describing associations between markers of health and environmental exposure (lift >1, response threshold >97.5th quantile). A case study using parathyroid hormone levels to develop exposure-health effect hypotheses is presented. This case study demonstrates how association rules can be used in data mining to facilitate hypothesis development and improve traditional regression models by identification of potentially confounding variables even in the presence of missing information. Our approach is designed to enable more effective knowledge discovery of potential health impacts of environmental chemicals by facilitating comprehensive data mining and meta-analysis of the NHANES dataset. Long-term, our representation of the information allows for integration with other disparate data, such as known biological pathways, to address the current data gaps.
【Keywords】:
【Paper Link】 【Pages】:560-568
【Authors】: Frank Höppner
【Abstract】: When comparing time series, z-normalization preprocessing and dynamic time warping (DTW) distance became almost standard procedure. This paper makes a point against carelessly using this setup by discussing implications and alternatives. A (conceptually) simpler distance measure is proposed that allows for a linear transformation of amplitude and time only, but is also open for other normalizations (unachievable by z-normalization preprocessing). Lower bounding techniques are presented for this measure that apply directly to raw series.
【Keywords】:
【Paper Link】 【Pages】:569-577
【Authors】: Peder A. Olsen ; Ramesh Natarajan ; Sholom M. Weiss
【Abstract】: We describe graphical model based methods for analyzing prescription and medical claims data in order to identify fraud and waste. Our approach draws on ideas from speech recognition and language modeling to identify patients, doctors and pharmacies whose prescription encounters show significant departure from normative behavior. We have analyzed claims data from a large healthcare provider, consisting of over 53 million individual prescription claims in the calendar year 2011.
【Keywords】:
【Paper Link】 【Pages】:578-586
【Authors】: Lisa Friedland ; Amanda Gentzel ; David Jensen
【Abstract】: Density estimation methods are often regarded as unsuitable for anomaly detection in high-dimensional data due to the difficulty of estimating multivariate probability distributions. Instead, the scores from popular distance- and local-density-based methods, such as local outlier factor (LOF), are used as surrogates for probability densities. We question this infeasibility assumption and explore a family of simple statistically-based density estimates constructed by combining a probabilistic classifier with a naive density estimate. Across a number of semi-supervised and unsupervised problems formed from real-world data sets, we show that these methods are competitive with LOF and that even simple density estimates that assume attribute independence can perform strongly. We show that these density estimation methods scale well to data with high dimensionality and that they are robust to the problem of irrelevant attributes that plagues methods based on local estimates.
【Keywords】:
【Paper Link】 【Pages】:587-595
【Authors】: Xiaojian Zhang ; Rui Chen ; Jianliang Xu ; Xiaofeng Meng ; Yingtao Xie
【Abstract】: Histograms are the workhorse of data mining and analysis. This paper considers the problem of publishing histograms under differential privacy, one of the strongest privacy models. Existing differentially private histogram publication schemes have shown that clustering (or grouping) is a promising idea to improve the accuracy of sanitized histograms. However, none of them fully exploits the benefit of clustering. In this paper, we introduce a new clustering framework. It features a sophisticated evaluation of the trade-off between the approximation error due to clustering and the Laplace error due to Laplace noise injected, which is normally overlooked in prior work. In particular, we propose three clustering strategies with different orders of run-time complexities. We prove the superiority of our approach by theoretical utility comparisons with the competitors. Our extensive experiments over various standard real-life and synthetic datasets confirm that our technique consistently outperforms existing competitors.
【Keywords】:
【Paper Link】 【Pages】:596-604
【Abstract】: This paper presents WCAMiner, a system focusing on detecting how concepts are associated by incorporating Wikipedia knowledge. We propose to combine content analysis and link analysis techniques over Wikipedia resources, and define various association mining models to interpret such queries. Specifically, our algorithm can automatically build a Concept Association Graph (CAG) from Wikipedia for two given topics of interest, and generate a ranked list of concept chains as potential associations between the two given topics. In comparison to traditional cross-document mining models where documents are usually domain-specific, the system proposed here is capable of handling different query scenarios across domains without being limited to the given documents. We highlight the importance of this problem in various domains, present experiments on different datasets and compare the mining results with two competitive baseline models to demonstrate the improved performance of our system.
【Keywords】:
【Paper Link】 【Pages】:605-613
【Authors】: Yexi Jiang ; Chang-Shing Perng ; Tao Li
【Abstract】: Event summarization is an effective process that mines and organizes event patterns to represent the original events. It allows the analysts to quickly gain the general idea of the events. In recent years, several event summarization algorithms have been proposed, but they all focus on how to find out the optimal summarization results, and are designed for one-time analysis. As event summarization is a comprehensive analysis work, merely handling this problem with a single optimal algorithm is not enough. In the absence of an integrated summarization solution, we propose an extensible framework – META – to enable analysts to easily and selectively extract and summarize events from different views with different resolutions. In this framework, we store the original events in a carefully-designed data structure that enables an efficient storage and multiresolution analysis. On top of the data model, we define a summarization language that includes a set of atomic operators to manipulate the meta-data. Furthermore, we present 5 commonly used summarization tasks, and show that all these tasks can be easily expressed by the language. Experimental evaluation on both real and synthetic datasets demonstrates the efficiency and effectiveness of our framework.
【Keywords】:
【Paper Link】 【Pages】:614-622
【Authors】: Jeffrey W. Lockhart ; Gary M. Weiss
【Abstract】: Activity recognition allows ubiquitous mobile devices like smartphones to be context-aware and also enables new applications, such as mobile health applications that track a user's activities over time. However, it is difficult for smartphone-based activity recognition models to perform well, since only a single body location is instrumented. Most research focuses on universal/impersonal activity recognition models, where the model is trained using data from a panel of representative users. In this paper we compare the performance of these impersonal models with those of personal models, which are trained using labeled data from the intended user, and hybrid models, which combine aspects of both types of models. Our analysis indicates that personal training data is required for high accuracy but that only a very small amount of training data is necessary. This conclusion led us to implement a self-training capability into our Actitracker smartphone-based activity recognition system[1], and we believe personal models can also benefit other activity recognition systems as well.
【Keywords】:
【Paper Link】 【Pages】:623-631
【Authors】: Charanpal Dhanjal ; Romaric Gaudel ; Stéphan Clémençon
【Abstract】: It is the main goal of this paper to propose a novel method to perform matrix completion on-line. Motivated by a wide variety of applications, ranging from the design of recommender systems to sensor network localization through seismic data reconstruction, we consider the matrix completion problem when entries of the matrix of interest are observed gradually. Precisely, we place ourselves in the situation where the predictive rule should be refined incrementally, rather than recomputed from scratch each time the sample of observed entries increases. The extension of existing matrix completion methods to the sequential prediction context is indeed a major issue in the Big Data era, and yet little addressed in the literature. The algorithm promoted in this article builds upon the SOFT IMPUTE approach introduced in [1]. The major novelty essentially arises from the use of a randomised technique for both computing and updating the Singular Value Decomposition (SVD) involved in the algorithm. Though of disarming simplicity, the method proposed turns out to be very efficient, while requiring reduced computations. Several numerical experiments based on real datasets illustrating its performance are displayed, together with preliminary results giving it a theoretical basis.
【Keywords】:
【Paper Link】 【Pages】:632-640
【Authors】: Xi C. Chen ; Abdullah Mueen ; Vijay K. Narayanan ; Nikos Karampatziakis ; Gagan Bansal ; Vipin Kumar
【Abstract】: Recent advances in high throughput data collection and storage technologies have led to a dramatic increase in the availability of high-resolution time series data sets in various domains. These time series reflect the dynamics of the underlying physical processes in these domains. Detecting changes in a time series over time or changes in the relationships among the time series in a data set containing multiple contemporaneous time series can be useful to detect changes in these physical processes. Contextual events detection algorithms detect changes in the relationships between multiple related time series. In this work, we introduce a new type of contextual events, called group level contextual change events. In contrast to individual contextual change events that reflect the change in behavior of one target time series against a context, group level events reflect the change in behavior of a target group of time series relative to a context group of time series. We propose an online framework to detect two types of group level contextual change events: (i) group formation (i.e., detecting when a set of multiple unrelated timeseries or groups of time series with little prior relationship in their behavior forms a new group of related time series) and (ii) group disbanding (i.e., detecting when one stable set of related time series disbands into two or more subgroups with little relationship in their behavior). We demonstrate this framework using 2 real world datasets and show that the framework detects group level contextual change events that can be explained by plausible causes.
【Keywords】:
【Paper Link】 【Pages】:641-649
【Authors】: Michael Kamp ; Mario Boley ; Thomas Gärtner
【Abstract】: Corporate earnings are a crucial indicator for investment and business valuation. Despite their importance and the fact that classic econometric approaches fail to match analyst forecasts by orders of magnitude, the automatic prediction of corporate earnings from public data is not in the focus of current machine learning research. In this paper, we present for the first time a fully automatized machine learning method for earnings prediction that at the same time a) only relies on publicly available data and b) can outperform human analysts. The latter is shown empirically in an experiment involving all S&P 100 companies in a test period from 2008 to 2012. The approach employs a simple linear regression model based on a novel feature space of stock market prices and their pairwise correlations. With this work we follow the recent trend of nowcasting, i.e., of creating accurate contemporary forecasts of undisclosed target values based on publicly observable proxy variables.1
【Keywords】:
【Paper Link】 【Pages】:650-658
【Authors】: Fabrizio Costa ; Mathias Verbeke ; Luc De Raedt
【Abstract】: Regularization is one of the key concepts in machine learning, but so far it has received only little attention in the logical and relational learning setting. Here we propose a regularization and feature selection technique for such setting, in which one commonly represents the structure of the domain using an entity-relationship model. To this end, we introduce a notion of locality that ties together features according to their proximity in a transformed representation of the relational learning problem obtained via a procedure that we call “graphicalization”. We present two techniques, a wrapper and an efficient embedded approach, to identify the most relevant sets of predicates which yields more readily interpretable results than selecting low-level propositionalized features. The proposed techniques are implemented in the kernel-based relational learner kLog, although the ideas presented here can also be adapted to other relational learning frameworks. We evaluate our approach on classification tasks in the natural language processing and bioinformatics domain.
【Keywords】:
【Paper Link】 【Pages】:659-667
【Authors】: Pieter Soons ; Ad Feelders
【Abstract】: We consider ordinal classification and instance ranking problems where each attribute is known to have an increasing or decreasing relation with the class label or rank. For example, it stands to reason that the number of query terms occurring in a document has a positive influence on its relevance to the query. We aim to exploit such monotonicity constraints by using labeled attribute vectors to draw conclusions about the class labels of order related unlabeled ones. Assuming we have a pool of unlabeled attribute vectors, and an oracle that can be queried for class labels, the central problem is to choose a query point whose label is expected to provide the most information. We evaluate different query strategies by comparing the number of inferred labels after some limited number of queries, as well as by comparing the prediction errors of models trained on the points whose labels have been determined so far. We present an efficient algorithm to determine the query point preferred by the well-known active learning strategy generalized binary search. This algorithm can be applied to binary classification on incomplete matrix orders. For non-binary classification, we propose to include attribute vectors in the training set whose class labels have not been uniquely determined yet. We perform experiments on artificial and real data.
【Keywords】:
【Paper Link】 【Pages】:668-676
【Authors】: Christos Giatsidis ; Bogdan Cautis ; Silviu Maniu ; Dimitrios M. Thilikos ; Michalis Vazirgiannis
【Abstract】: Lately, there has been an increased interest in signed networks with applications in trust, security, or social computing. This paper focuses on the issue of defining models and metrics for reciprocity in signed graphs. In unsigned directed networks, reciprocity quantifies the predisposition of network members in creating mutual connections. On the other hand, this concept has not yet been investigated in the case of signed graphs. We capitalize on the graph degeneracy concept to identify subgraphs of the signed network in which reciprocity is more likely to occur. This enables us to assess reciprocity at a global level, rather than at an exclusively local one as in existing approaches. The large scale experiments we perform on real world data sets of trust networks lead to both interesting and intuitive results. We believe these reciprocity measures can be used in various social applications such as trust management, community detection and evaluation of individual nodes. The global reciprocity we define in this paper is closely correlated to the clustering structure of the graph, more than the local reciprocity as it is indicated by the experimental evaluation we conducted.
【Keywords】:
【Paper Link】 【Pages】:677-685
【Authors】: Alejandro Correa Bahnsen ; Aleksandar Stojanovic ; Djamila Aouada ; Björn E. Ottersten
【Abstract】: Previous analysis has shown that applying Bayes minimum risk to detect credit card fraud leads to better results measured by monetary savings, as compared with traditional methodologies. Nevertheless, this approach requires good probability estimates that not only separate well between positive and negative examples, but also assess the real probability of the event. Unfortunately, not all classification algorithms satisfy this restriction. In this paper, two different methods for calibrating probabilities are evaluated and analyzed in the context of credit card fraud detection, with the objective of finding the model that minimizes the real losses due to fraud. Even though under-sampling is often used in the context of classification with unbalanced datasets, it is shown that when probabilistic models are used to make decisions based on minimizing risk, using the full dataset provides significantly better results. In order to test the algorithms, a real dataset provided by a large European card processing company is used. It is shown that by calibrating the probabilities and then using Bayes minimum Risk the losses due to fraud are reduced. Furthermore, because of the good overall results, the aforementioned card processing company is currently incorporating the methodology proposed in this paper into their fraud detection system. Finally, the methodology has been tested on a different application, namely, direct marketing.
【Keywords】:
【Paper Link】 【Pages】:686-694
【Authors】: Gerhard Sageder ; Maia Zaharieva ; Matthias Zeppelzauer
【Abstract】: Feature selection is applied to identify relevant and complementary features from a given high-dimensional feature set. In general, existing filter-based approaches operate on single (scalar) feature components and ignore the relationships among components of multidimensional features. As a result, generated feature subsets lack in interpretability and hardly provide insights into the underlying data. We propose an unsupervised, filter-based feature selection approach that preserves the natural assignment of feature components to semantically meaningful features. Experiments on different tasks in the audio domain show that the proposed approach outperforms well-established feature selection methods in terms of retrieval performance and runtime. Results achieved on different audio datasets for the same retrieval task indicate that the proposed method is more robust in selecting consistent feature sets across different datasets than compared approaches.
【Keywords】:
【Paper Link】 【Pages】:695-703
【Authors】: Ting Wang ; Fei Wang ; Reiner Sailer ; Douglas Lee Schales
【Abstract】: Network traffic attribution, namely, inferring users responsible for activities observed on network interfaces, is one fundamental yet challenging task in network security forensics. Compared with other user-system interaction records, network traces are inherently coarsegrained, context-sensitive, and detached from user ends. This paper presents Kaleido, a new network traffic attribution tool with a series of key features: a) it adopts a new class of inductive discriminant models to capture user- and context-specific patterns (“footprints”) from different aspects of network traffic; b) it applies efficient learning methods to extracting and aggregating such footprints from noisy historical traces; c) with the help of novel indexing structures, it is able to perform efficient, runtime traffic attribution over high-volume network traces. The efficacy of Kaleido is evaluated with extensive experimental studies using the real network traces collected over three months in a large enterprise network.
【Keywords】:
【Paper Link】 【Pages】:704-712
【Authors】: Marvin Meeng ; Wouter Duivesteijn ; Arno J. Knobbe
【Abstract】: Subgroup Discovery (SD) aims to find coherent, easy-to-interpret subsets of the dataset at hand, where something exceptional is going on. Since the resulting subgroups are defined in terms of conditions on attributes of the dataset, this data mining task is ideally suited to be used by non-expert analysts. The typical SD approach uses a heuristic beam search, involving parameters that strongly influence the outcome. Unfortunately, these parameters are often hard to set properly for someone who is not a data mining expert; correct settings depend on properties of the dataset, and on the resulting search landscape. To remove this potential obstacle for casual SD users, we introduce ROCsearch, a new ROC-based beam search variant for Subgroup Discovery. On each search level of the beam search, ROCsearch analyzes the intermediate results in ROC space to automatically determine a sensible search width for the next search level. Thus, beam search parameter setting is taken out of the domain expert's hands, lowering the threshold for using Subgroup Discovery. Also, ROCsearch automatically adapts its search behavior to the properties and resulting search landscape of the dataset at hand. Aside form these advantages, we also show that ROCsearch is an order of magnitude more efficient than traditional beam search, while its results are equivalent and on large datasets even better than traditional beam search results.
【Keywords】:
【Paper Link】 【Pages】:713-721
【Authors】: Ruilin Liu ; Wendy Hui Wang ; Changhe Yuan
【Abstract】: There has been considerable recent interest in the data-mining-as-a-service paradigm: the client that lacks computational resources outsources his/her data and data mining needs to a third-party service provider. One of the security issues of this outsourcing paradigm is how the client can verify that the service provider indeed has returned correct data mining results. In this paper, we focus on the problem of result verification of outsourced Bayesian network (BN) structure learning. We consider the untrusted service provider that intends to return wrong BN structures. We develop three efficient probabilistic verification approaches to catch the incorrect BN structure with high probability and cheap overhead. Our experimental results demonstrate that our verification methods can capture wrong BN structure effectively and efficiently.
【Keywords】:
【Paper Link】 【Pages】:722-730
【Authors】: Ke Wu ; Andrea Edwards ; Wei Fan ; Jing Gao ; Kun Zhang
【Abstract】: Data stream classification and imbalanced data learning are two important areas of data mining research. Each has been well studied to date with many interesting algorithms developed. However, only a few approaches reported in literature address the intersection of these two fields due to their complex interplay. In this work, we proposed an importance sampling driven, dynamic feature group weighting framework (DFGW-IS) for classifying data streams of imbalanced distribution. Two components are tightly incorporated into the proposed approach to address the intrinsic characteristics of concept-drifting, imbalanced streaming data. Specifically, the ever-evolving concepts are tackled by a weighted ensemble trained on a set of feature groups with each sub-classifier (i.e. a single classifier or an ensemble) weighed by its discriminative power and stable level. The un-even class distribution, on the other hand, is typically battled by the sub-classifier built in a specific feature group with the underlying distribution rebalanced by the importance sampling technique. We derived the theoretical upper bound for the generalization error of the proposed algorithm. We also studied the empirical performance of our method on a set of benchmark synthetic and real world data, and significant improvement has been achieved over the competing algorithms in terms of standard evaluation metrics and parallel running time. Algorithm implementations and datasets are available upon request.
【Keywords】:
【Paper Link】 【Pages】:731-739
【Authors】: Hessam Zakerzadeh ; Charu C. Aggarwal ; Ken Barker
【Abstract】: The curse of dimensionality has remained a challenge for a wide variety of algorithms in data mining, clustering, classification and privacy. Recently, it was shown that an increasing dimensionality makes the data resistant to effective privacy. The theoretical results seem to suggest that the dimensionality curse is a fundamental barrier to privacy preservation. However, in practice, we show that some of the common properties of real data can be leveraged in order to greatly ameliorate the negative effects of the curse of dimensionality. In real data sets, many dimensions contain high levels of inter-attribute correlations. Such correlations enable the use of a process known as vertical fragmentation in order to decompose the data into vertical subsets of smaller dimensionality. An information-theoretic criterion of mutual information is used in the vertical decomposition process. This allows the use of an anonymization process, which is based on combining results from multiple independent fragments. We present a general approach which can be applied to the k-anonymity, l-diversity, and t-closeness models. In the presence of inter-attribute correlations, such an approach continues to be much more robust in higher dimensionality, without losing accuracy. We present experimental results illustrating the effectiveness of the approach. This approach is resilient enough to prevent identity, attribute, and membership disclosure attack.
【Keywords】:
【Paper Link】 【Pages】:740-748
【Authors】: Sujatha Das Gollapalli ; Yanjun Qi ; Prasenjit Mitra ; C. Lee Giles
【Abstract】: Professional homepages of researchers contain metadata that provides crucial evidence in several digital library tasks such as academic network extraction, record linkage and expertise search. Due to inherent diversity in values for certain metadata fields (e.g., affiliation) supervised algorithms require a large number of labeled examples for accurately identifying values for these fields. We address this issue with feature labeling, a recent semi-supervised machine learning technique. We apply feature labeling to researcher metadata extraction from homepages by combining a small set of expert-provided feature distributions with few fully-labeled examples. We study two types of labeled features: (1) Dictionary features provide unigram hints related to specific metadata fields, whereas, (2) Proximity features capture the layout information between metadata fields on a homepage in a second stage. We experimentally show that this two-stage approach along with labeled features provides significant improvements in the tagging performance. In one experiment with only ten labeled homepages and 22 expert-specified labeled features, we obtained a 45% relative increase in the F1 value for the affiliation field, while the overall F1 improves by 9%.
【Keywords】:
【Paper Link】 【Pages】:749-757
【Authors】: Senzhang Wang ; Sihong Xie ; Xiaoming Zhang ; Zhoujun Li ; Philip S. Yu ; Xinyu Shu
【Abstract】: Researchers or students entering a emerging research area are particularly interested in what newly published papers will be most cited and which young researchers will become influential in the future, so that they can catch the most recent advances and find valuable research directions. However, predicting the future importance of scientific articles and authors is extremely hard due to the dynamic nature of literature networks and evolving research topics. Different from most previous studies aiming to rank the current importance of literature and authors, we focus on ranking the future popularity of new publications and young researchers by proposing a unified ranking model to combine various available information. Specifically, we first propose to use two kinds of text features, words and words co-occurrence to characterize innovative papers and authors. Then, instead of using static and un-weighted graphs, we construct time-aware weighted graphs to distinguish the various importance of links established at different time. Finally, by leveraging both the constructed text features and graphs, we propose a mutual reinforcement ranking framework called MRFRank to rank the future importance of papers and authors simultaneously. Experimental results on the ArnetMiner dataset show that the proposed approach significantly outperforms the baselines on the metric recommendation intensity.
【Keywords】:
【Paper Link】 【Pages】:758-766
【Authors】: Calvin Newport ; Lisa Singh ; Yiqing Ren
【Abstract】: More and more companies are providing data mining and analytics solutions to customers using social media data. The general approach taken by these companies is to continually collect data from social media sites and then use the collected snapshot of the content for a data mining or analytics task. Unfortunately, given the exponential increase in the volume of social media data, building local database snapshots and running computationally expensive algorithms is not always plausible. As an alternative to the centralized approach, in this paper, we study the feasibility of cooperative algorithms where data never leaves the mined social media network, and instead the network users themselves work together, using only the communication primitives provided by the social media site, to solve data mining problems. While cooperative algorithms can be built for many different data mining tasks, to show the viability of this approach, we focus on a task fundamental to many different social mining applications - membership detection (an individual using the social media site wants to efficiently get a request to a member of a known group with unknown membership). Using Twitter as our specific social graph, we seek cooperative algorithms that solve this problem with high probability even when we assume only a small fraction of the Twitter network participates and we enforce a bound on the number of tweets generated. After validating the potential of cooperative solutions on Twitter, we empirically evaluate a collection of cooperative strategies on a snapshot of the Twitter network containing over 50 million users. Our best solution, which we call brokered token passing, can reliably and efficiently detect group membership while requiring only a small number of tweets be sent and a small percentage of users participate.
【Keywords】:
【Paper Link】 【Pages】:767-775
【Authors】: Yuxuan Li ; James Bailey ; Lars Kulik ; Jian Pei
【Abstract】: Substring matching is fundamental to data mining methods for sequential data. It involves checking the existence of a short subsequence within a longer sequence, ensuring no gaps within a match. Whilst a large amount of existing work has focused on substring matching and mining techniques for certain sequences, there are only a few results for uncertain sequences. Uncertain sequences provide powerful representations for modelling sequence behavioural characteristics in emerging domains, such as bioinformatics, sensor streams and trajectory analysis. In this paper, we focus on the core problem of computing substring matching probability in uncertain sequences and propose an efficient dynamic programming algorithm for this task. We demonstrate our approach is both competitive theoretically, as well as effective and scalable experimentally. Our results contribute towards a foundation for adapting classic sequence mining methods to deal with uncertain data.
【Keywords】:
【Paper Link】 【Pages】:776-784
【Authors】: Wei Wei ; Junfu Yin ; Jinyan Li ; Longbing Cao
【Abstract】: Modeling high-dimensional dependence is widely studied to explore deep relations in multiple variables particularly useful for financial risk assessment. Very often, strong restrictions are applied on a dependence structure by existing high-dimensional dependence models. These restrictions disabled the detection of sophisticated structures such as asymmetry, upper and lower tail dependence between multiple variables. The paper proposes a partial regular vine copula model to relax these restrictions. The new model employs partial correlation to construct the regular vine structure, which is algebraically independent. This model is also able to capture the asymmetric characteristics among multiple variables by using two-parametric copula with flexible lower and upper tail dependence. Our method is tested on a cross-country stock market data set to analyse the asymmetry and tail dependence. The high prediction performance is examined by the Value at Risk, which is a commonly adopted evaluation measure in financial market.
【Keywords】:
【Paper Link】 【Pages】:785-793
【Authors】: Samir Al-Stouhi ; Chandan K. Reddy
【Abstract】: Researchers have attempted to improve the quality of clustering solutions through various mechanisms. A promising new approach to improve clustering quality is to combine data from multiple related datasets (tasks) and apply multi-task clustering. In this paper, we present a novel framework that can simultaneously cluster multiple tasks through balanced Intra-Task (within-task) and Inter-Task (between-task) knowledge sharing. We propose an effective and flexible geometric affine transformation (contraction or expansion) of the distances between Inter-Task and Intra-Task instances. This transformation allows for an improved Intra-Task clustering without overwhelming the individual tasks with the bias accumulated from other tasks. A constrained low-rank decomposition of this multi-task transformation will allow us to maintain the class distribution of the clusters within each individual task. We impose an Intra-Task soft orthogonality constraint to a Symmetric Non-Negative Matrix Factorization (NMF) based formulation to generate basis vectors that are near orthogonal within each task. Inducing orthogonal basis vectors within each task imposes the prior knowledge that a task should have orthogonal (independent) clusters. Using several real-world experiments, we demonstrate that the proposed framework produces improves clustering quality compared to the state-of-the-art methods proposed in literature.
【Keywords】:
【Paper Link】 【Pages】:794-802
【Authors】: Ya-Zhou Ren ; Carlotta Domeniconi ; Guoji Zhang ; Guo-Xian Yu
【Abstract】: The mean shift algorithm is a nonparametric clustering technique that does not make assumptions on the number of clusters and on their shapes. It achieves this goal by performing kernel density estimation, and iteratively locating the local maxima of the kernel mixture. The set of points that converge to the same mode defines a cluster. While appealing, the performance of the mean shift algorithm significantly deteriorates with high dimensional data due to the sparsity of the input space. In addition, noisy features can create challenges for the mean shift procedure. In this paper we extend the mean shift algorithm to overcome these limitations, while maintaining its desirable properties. To achieve this goal, we first estimate the relevant subspace for each data point, and then embed such information within the mean shift algorithm, thus avoiding computing distances in the full dimensional input space. The resulting approach achieves the best-of-two-worlds: effective management of high dimensional data and noisy features, while preserving a nonparametric nature. Our approach can also be combined with random sampling to speedup the clustering process with large scale data, without sacrificing accuracy. Extensive experimental results on both synthetic and real-world data demonstrate the effectiveness of the proposed method.
【Keywords】:
【Paper Link】 【Pages】:803-811
【Authors】: Lei Meng ; Ah-Hwee Tan
【Abstract】: Discovering social communities of web users through clustering analysis of heterogeneous link associations has drawn much attention. However, existing approaches typically require the number of clusters a prior, do not address the weighting problem for fusing heterogeneous types of links and have a heavy computational cost. In this paper, we explore the feasibility of a newly proposed heterogeneous data clustering algorithm, called Generalized Heterogeneous Fusion Adaptive Resonance Theory (GHF-ART), for discovering user communities in social networks. Different from existing algorithms, GHF-ART performs real-time matching of patterns and one-pass learning which guarantee its low computational cost. With a vigilance parameter to restrain the intra-cluster similarity, GHF-ART does not need the number of clusters a prior. To achieve a better fusion of multiple types of links, GHF-ART employs a weighting function to incrementally assess the importance of all the feature channels. Extensive experiments have been conducted to analyze the performance of GHF-ART on two heterogeneous social network data sets. The promising results comparing with existing methods demonstrate the effectiveness and efficiency of GHF-ART.
【Keywords】:
【Paper Link】 【Pages】:812-820
【Authors】: Tzu-Ming Kuo ; Ching-Pei Lee ; Chih-Jen Lin
【Abstract】: Learning to rank is an important task for recommendation systems, online advertisement and web search. Among those learning to rank methods, rankSVM is a widely used model. Both linear and nonlinear (kernel) rankSVM have been extensively studied, but the lengthy training time of kernel rankSVM remains a challenging issue. In this paper, after discussing difficulties of training kernel rankSVM, we propose an efficient method to handle these problems. The idea is to reduce the number of variables from quadratic to linear with respect to the number of training instances, and efficiently evaluate the pairwise losses. Our setting is applicable to a variety of loss functions. Further, general optimization methods can be easily applied to solve the reformulated problem. Implementation issues are also carefully considered. Experiments show that our method is faster than state-of-the-art methods for training kernel rankSVM.
【Keywords】:
【Paper Link】 【Pages】:821-829
【Authors】: Chuan Hu ; Huiping Cao ; Chaomin Ke
【Abstract】: Graphs have been widely used to represent objects and object connections in applications such as the Web, social networks, and citation networks. Mining influence relationships from graphs has gained interests in recent years because providing influence information about the object connections in graphs can facilitate graph exploration, graph search, and connection recommendations. In this paper, we study the problem of detecting influence aspects, on which objects are connected, and influence degree (or influence strength), with which one graph node influences another graph node on a given aspect. Existing techniques focus on inferring either the influence degrees or influence types from graphs. We propose two generative Aspect Influence Models, OAIM and LAIM, to detect both influence aspects and influence degrees. These models utilize the topological structure of the graphs, the text content associated with objects, and the context in which the objects are connected. We compare these two models with one baseline approach which considers only the text content associated with objects. The empirical studies on citation graphs and networks of users from Twitter show that our models can discover more effective results than the baseline approach.
【Keywords】:
【Paper Link】 【Pages】:830-838
【Authors】: Yun-Qian Miao ; Ahmed K. Farahat ; Mohamed S. Kamel
【Abstract】: Covariate shift is a challenging problem in supervised learning that results from the discrepancy between the training and test distributions. An effective approach which recently drew a considerable attention in the research community is to reweight the training samples to minimize that discrepancy. In specific, many methods are based on developing Density-ratio (DR) estimation techniques that apply to both regression and classification problems. Although these methods work well for regression problems, their performance on classification problems is not satisfactory. This is due to a key observation that these methods focus on matching the sample marginal distributions without paying attention to preserving the separation between classes in the reweighted space. In this paper, we propose a novel method for Discriminative Density-ratio (DDR) estimation that addresses the aforementioned problem and aims at estimating the density ratio of joint distributions in a class-wise manner. The proposed algorithm is an iterative procedure that alternates between estimating the class information for the test data and estimating a new density ratio for each class. To incorporate the estimated class information of the test data, a soft matching technique is proposed. In addition, we employ an effective criterion which adopts mutual information as an indicator to stop the iterative procedure while resulting in a decision boundary that lies in a sparse region. Experiments on synthetic and benchmark datasets demonstrate the superiority of the proposed method.
【Keywords】:
【Paper Link】 【Pages】:839-847
【Authors】: Davoud Moulavi ; Pablo A. Jaskowiak ; Ricardo J. G. B. Campello ; Arthur Zimek ; Jörg Sander
【Abstract】: One of the most challenging aspects of clustering is validation, which is the objective and quantitative assessment of clustering results. A number of different relative validity criteria have been proposed for the validation of globular, clusters. Not all data, however, are composed of globular clusters. Density-based clustering algorithms seek partitions with high density areas of points (clusters, not necessarily globular) separated by low density areas, possibly containing noise objects. In these cases relative validity indices proposed for globular cluster validation may fail. In this paper we propose a relative validation index for density-based, arbitrarily shaped clusters. The index assesses clustering quality based on the relative density connection between pairs of objects. Our index is formulated on the basis of a new kernel density function, which is used to compute the density of objects and to evaluate the within- and between-cluster density connectedness of clustering results. Experiments on synthetic and real world data show the effectiveness of our approach for the evaluation and selection of clustering algorithms and their respective appropriate parameters.
【Keywords】:
【Paper Link】 【Pages】:848-856
【Authors】: Chia-Tung Kuo ; Ian Davidson
【Abstract】: Tensors or multiway arrays are useful constructs capable of representing complex graphs, spatio-temporal data and multi-source data. Analyzing such complex data requires enforcing simplicity (such as sparsity) so as to give meaningful insights. Sparse tensor decomposition typically requires adding a penalty term in addition to the reconstruction error term to the optimization problem which provides a number of challenges. Not only is tuning the weights on the terms time consuming, but the resulting sparse decomposition typically has a nonuniform distribution of sparsity. Such decomposition is undesirable in some datasets (such as fMRI scans) where we want each factor to have similar sparsity for easier interpretation. In this paper, we propose an alternative method that can directly obtain a sparse, nonnegative and interpretable decomposition without the need for tuning. We allow the user to specify the exact amount of sparsity and where the sparsity should be. This is achieved by augmenting the alternating nonnegative least squares (ANLS) algorithm with a projection step rather than imposing a penalty term on the objective function. We demonstrate our works usefulness in finding interpretable features for real world problems in fMRI scan data and face image analysis without the need to pre-process the data.
【Keywords】:
【Paper Link】 【Pages】:857-865
【Authors】: Qingyang Hu ; Kevin Chiew ; Hao Huang ; Qinming He
【Abstract】: Data sets collected from crowdsourcing platforms are well known for their cheap costs. But cheap costs may lead to low quality, i.e., labels may be incorrect or missing. Most of the existing work focuses on modeling the labeling errors of crowd workers, but missing labels can also cause problems when modeling the data. In this paper, we present an algorithm to predict the missing labels of crowd workers, in which we adopt thoughts from semi-supervised learning and utilize the particular consistency between crowd workers. We also define the consistency between workers by crowd labels and develop an algorithm to learn them from the data automatically. Experiments on both benchmark and real data show that our algorithm outperforms traditional semi-supervised learning algorithms in predicting missing labels, and the recovered crowd labels are capable of predicting the ground truth and reflecting real properties of crowd workers.
【Keywords】:
【Paper Link】 【Pages】:866-874
【Authors】: Yuanyuan Liu ; Fanhua Shang ; Hong Cheng ; James Cheng ; Hanghang Tong
【Abstract】: Most existing low-n-rank minimization algorithms for tensor completion suffer from high computational cost due to involving multiple singular value decompositions (SVDs) at each iteration. To address this issue, we propose a novel factor matrix trace norm minimization method for tensor completion problems. Based on the CANDECOMP/PARAFAC (CP) decomposition, we first formulate a factor matrix rank minimization model by deducing the relation between the rank of each factor matrix and the mode-n rank of a tensor. Then, we introduce a tractable relaxation of our rank function, which leads to a convex combination problem of much smaller scale matrix nuclear norm minimization. Finally, we develop an efficient alternating direction method of multipliers (ADMM) scheme to solve the proposed problem. Experimental results on both synthetic and real-world data validate the effectiveness of our approach. Moreover, our method is significantly faster than the state-of-the-art approaches and scales well to handle large datasets.
【Keywords】:
【Paper Link】 【Pages】:875-883
【Authors】: Jinsong Lan ; Cheng Long ; Raymond Chi-Wing Wong ; Youyang Chen ; Yanjie Fu ; Danhuai Guo ; Shuguang Liu ; Yong Ge ; Yuanchun Zhou ; Jianhui Li
【Abstract】: Trajectory data is becoming more and more popular nowadays and extensive studies have been conducted on trajectory data. One important research direction about trajectory data is the anomaly detection which is to find all anomalies based on trajectory patterns in a road network. In this paper, we introduce a road segment-based anomaly detection problem, which is to detect the abnormal road segments each of which has its “real” traffic deviating from its “expected” traffic and to infer the major causes of anomalies on the road network. First, a deviation-based method is proposed to quantify the anomaly of reach road segment. Second, based on the observation that one anomaly from a road segment can trigger other anomalies from the road segments nearby, a diffusion-based method based on a heat diffusion model is proposed to infer the major causes of anomalies on the whole road network. To validate our methods, we conduct intensive experiments on a large real-world GPS dataset of about 23,000 taxis in Shenzhen, China to demonstrate the performance of our algorithms.
【Keywords】:
【Paper Link】 【Pages】:884-892
【Authors】: Roel Bertens ; Arno Siebes
【Abstract】: When a seismologist analyses a new seismogram it is often useful to have access to a set of similar seismograms. For example if she tries to determine the event, if any, that caused the particular readings on her seismogram. So, the question is: when are two seismograms similar? To define such a notion of similarity, we first preprocess the seismogram by a wavelet decomposition, followed by a discretisation of the wavelet coefficients. Next we introduce a new type of patterns on the resulting set of aligned symbolic time series. These patterns, called block patterns, satisfy an Apriori property and can thus be found with a levelwise search. Next we use MDL to define when a set of such patterns is characteristic for the data. We introduce the MuLTi-Krimp algorithm to find such code sets. In experiments we show that these code sets are both good at distinguishing between dissimilar seismograms and good at recognising similar seismograms. Moreover, we show how such a code set can be used to generate a synthetic seismogram that shows what all seismograms in a cluster have in common.
【Keywords】:
【Paper Link】 【Pages】:893-901
【Authors】: Avradeep Bhowmik ; Vivek S. Borkar ; Dinesh Garg ; Madhavan Pallan
【Abstract】: We consider the team formation problem where the goal is to find a team of experts for a specific project. In the past, several attempts have been made to formulate this problem and each formulation focuses only on a subset of design criteria such as skill coverage, social compatibility, economy, skill redundancy, etc. In this paper, for the first time, we show that most of the important design criteria for this problem can be fully modeled within one single formulation as an unconstrained submodular function maximization problem. In our formulation, the submodular function turns out to be non-negative and non-monotone. The maximization of this class of submodular function is much less explored than its monotone constrained counterpart. A few recent works [7] [4] have come up with a simulated annealing based randomized approximation scheme for this problem with an approximation ratio of 0.41. In this paper, we customize this algorithm to our formulation and conduct an extensive set of experiments to show its efficacy. Our proposed formulation offers several advantageous features over the existing formulations including skill cover softening, better team communication, and connectivity relaxation. Unlike previous formulations, the skill cover softening feature allows a designer to specify Must Have and Should Have skills. Similarly, through better team communication, we avoid the restriction that all the communication among team members should pass through only the team members and not the outsiders. Finally, connectivity relaxation feature alleviates the constraint of whole team being connected and thereby, lowering the cost.
【Keywords】:
【Paper Link】 【Pages】:902-910
【Authors】: Yunlong He ; Koray Kavukcuoglu ; Yun Wang ; Arthur Szlam ; Yanjun Qi
【Abstract】: In this paper, we propose a new unsupervised feature learning framework, namely Deep Sparse Coding (DeepSC), that extends sparse coding to a multi-layer architecture for visual object recognition tasks. The main innovation of the framework is that it connects the sparse-encoders from different layers by a sparse-to-dense module. The sparse-to-dense module is a composition of a local spatial pooling step and a low-dimensional embedding process, which takes advantage of the spatial smoothness information in the image. As a result, the new method is able to learn multiple layers of sparse representations of the image which capture features at a variety of abstraction levels and simultaneously preserve the spatial smoothness between the neighboring image patches. Combining the feature representations from multiple layers, DeepSC achieves the state-of-the-art performance on multiple object recognition tasks.
【Keywords】:
【Paper Link】 【Pages】:911-919
【Authors】: Yun Gao ; Wei Zhou ; Zhang Zhang ; Jizhong Han ; Dan Meng ; Zhiyong Xu
【Abstract】: Nowadays, log sequences mining techniques are widely used in detecting anomalies for Internet services. The state-of-the-art anomaly detection methods either need significant computational costs, or require specific assumptions that the test logs are holding certain data distribution patterns in order to be effective. Therefore, it is very difficult to achieve real time responses and it greatly reduces the effectiveness of these mechanisms in reality. To address these issues, we propose an innovative anomaly detection strategy called CADM. In CADM, the relative entropy between test logs and normal logs is exploited to discover the anomalous levels. Instead of calculating the relative entropy based on certain predefined data distribution models, our solution inspects the relationship between relative entropy and compression size with an improved grammar-based compression method. No assumptions are needed. In addition, our mechanism has excellent scalability with only O(n) computational complexity. It can generate the detection results on the fly. Experimental analysis with both synthetic and real world logs proves that CADM is superior to the other methods. It can achieve very high anomaly detection accuracy with the minimal computational overhead. It is suitable for log mining tasks and can be applied on a broad variety of application fields.
【Keywords】:
【Paper Link】 【Pages】:920-928
【Authors】: Xiangnan Kong ; Zhaoming Wu ; Li-Jia Li ; Ruofei Zhang ; Philip S. Yu ; Hang Wu ; Wei Fan
【Abstract】: Multi-label learning deals with the classification problems where each instance can be assigned with multiple labels simultaneously. Conventional multi-label learning approaches mainly focus on exploiting label correlations. It is usually assumed, explicitly or implicitly, that the label sets for training instances are fully labeled without any missing labels. However, in many real-world multi-label datasets, the label assignments for training instances can be incomplete. Some ground-truth labels can be missed by the labeler from the label set. This problem is especially typical when the number instances is very large, and the labeling cost is very high, which makes it almost impossible to get a fully labeled training set. In this paper, we study the problem of large-scale multi-label learning with incomplete label assignments. We propose an approach, called Mpu, based upon positive and unlabeled stochastic gradient descent and stacked models. Unlike prior works, our method can effectively and efficiently consider missing labels and label correlations simultaneously, and is very scalable, that has linear time complexities over the size of the data. Extensive experiments on two real-world multi-label datasets show that our Mpu model consistently outperform other commonly-used baselines.
【Keywords】:
【Paper Link】 【Pages】:929-937
【Authors】: Yan Zhou ; Murat Kantarcioglu
【Abstract】: Many data mining applications operate in adversarial environment, for example, webpage ranking in the presence of web spam. A growing number of adversarial data mining techniques are recently developed, providing robust solutions under specific defense-attack models. Existing techniques are tied to distributional assumptions geared towards minimizing the undesirable impact of given attack models. However, the large variety of attack strategies renders the adversarial learning problem multimodal. Therefore, it calls for a more flexible modeling ideology for equivocal input. In this paper we present a Bayesian hierarchical mixtures of experts for adversarial learning. The technique groups data into soft partitions and fits simple function approximators, referred to as “experts”, within each. Experts are ranked using gating functions for each input. Ambiguous input is predicted competitively by multiple experts, while unambiguous input is effectively predicted by a single expert. Optimal attacks minimizing the likelihood of malicious data are modeled interactively at both expert and gating levels in the learning hierarchy. We demonstrate that our adversarial hierarchical-mixtures-of-experts learning model is robust against adversarial attacks on both artificial and real data.
【Keywords】:
【Paper Link】 【Pages】:938-946
【Authors】: Jiliang Tang ; Xia Hu ; Huiji Gao ; Huan Liu
【Abstract】: Feature selection has been proven to be efficient in preparing high dimensional data for data mining and machine learning. As most data is unlabeled, unsupervised feature selection has attracted more and more attention in recent years. Discriminant analysis has been proven to be a powerful technique to select discriminative features for supervised feature selection. To apply discriminant analysis, we usually need label information which is absent for unlabeled data. This gap makes it challenging to apply discriminant analysis for unsupervised feature selection. In this paper, we investigate how to exploit discriminant analysis in unsupervised scenarios to select discriminative features. We introduce the concept of pseudo labels, which enable discriminant analysis on unlabeled data, propose a novel unsupervised feature selection framework DisUFS which incorporates learning discriminative features with generating pseudo labels, and develop an effective algorithm for DisUFS. Experimental results on different types of real-world data demonstrate the effectiveness of the proposed framework DisUFS.
【Keywords】:
【Paper Link】 【Pages】:947-955
【Authors】: Reza Zafarani ; Huan Liu
【Abstract】: With the emergence of numerous social media sites, individuals, with their limited time, often face a dilemma of choosing a few sites over others. Users prefer more engaging sites, where they can find familiar faces such as friends, relatives, or colleagues. Link prediction methods help find friends using link or content information. Unfortunately, whenever users join any site, they have no friends or any content generated. In this case, sites have no chance other than recommending random influential users to individuals hoping that users by befriending them create sufficient information for link prediction techniques to recommend meaningful friends. In this study, by considering social forces that form friendships, namely, influence, homophily, and confounding, and by employing minimum information available for users, we demonstrate how one can significantly improve random predictions without link or content information. In addition, contrary to the common belief that similarity between individuals is the essence of forming friendships, we show that it is the similarity that one exhibits to the friends of another individual that plays a more decisive role in predicting their future friendship.
【Keywords】:
【Paper Link】 【Pages】:956-964
【Authors】: Wei Gong ; Ee-Peng Lim ; Feida Zhu ; Palakorn Achananuparp ; David Lo
【Abstract】: Gaming expertise is usually accumulated through playing or watching many game instances, and identifying critical moments in these game instances called turning points. Turning point rules (shorten as TPRs) are game patterns that almost always lead to some irreversible outcomes. In this paper, we formulate the notion of irreversible outcome property which can be combined with pattern mining so as to automatically extract TPRs from any given game datasets. We specifically extend the well-known PrefixSpan sequence mining algorithm by incorporating the irreversible outcome property. To show the usefulness of TPRs, we apply them to Tetris, a popular game. We mine TPRs from Tetris games and generate challenging game sequences so as to help training an intelligent Tetris algorithm. Our experiment results show that 1) TPRs can be found from historical game data automatically with reasonable scalability, 2) our TPRs are able to help Tetris algorithm perform better when it is trained with challenging game sequences.
【Keywords】:
【Paper Link】 【Pages】:965-973
【Authors】: Yijun Zhao ; Carla E. Brodley ; Tanuja Chitnis ; Brian C. Healy
【Abstract】: Predicting disease course is critical in chronic progressive diseases such as multiple sclerosis (MS). In our work we are applying machine learning methods to longitudinal records of MS patients to build a classifier that predicts whether a patient will have a significant increase in disability at the five year mark using information from the first two years of clinical visits. This prediction is key for choosing among the available treatments as some have more troubling side-effect profiles. Two challenges arise while learning with this data. First, patient data involves the physician's (possibly subjective) evaluation. Because a patient's data may come from one doctor on the first visit and different doctors on subsequent visits, it can be difficult to form an accurate predictor of disease outcome. In particular, some physicians may be biased in one direction, scoring each patient as more severe than would other physicians, while others may be biased in the opposite direction. Another challenge is that it is much easier to classify the cases with low future disability compared to cases with high future disability at early stages of the disease. This asymmetric property is due to the nature of the disease rather than class imbalance. In this paper we introduce a new transfer learning approach to handle these challenges. The algorithm builds a single SVM classifier for each doctor by dividing the entire dataset into primary (instances from the doctor) and auxiliary (instances from other doctors) sets. When applied to our dataset of MS patients our new approach is able to realize a significant increase in prediction performance over approaches that form a single SVM classifier for the entire dataset or for the physician's dataset alone.
【Keywords】:
【Paper Link】 【Pages】:974-982
【Authors】: Chris S. Magnano ; Ameet Soni ; Sriraam Natarajan ; Gautam Kunapuli
【Abstract】: Improvements in medical imaging techniques have provided clinicians the ability to obtain detailed brain images of patients at lower costs. This increased availability of rich data opens up new avenues of research that promise better understanding of common brain ailments such as Alzheimer's Disease and dementia. Improved data mining techniques, however, are required to leverage these new data sets to identify intermediate disease states (e.g., mild cognitive impairment) and perform early diagnosis. We propose a graphical model framework based on conditional random fields (CRFs) to mine MRI brain images. As a proof-of-concept, we apply CRFs to the problem of brain tissue segmentation. Experimental results show robust and accurate performance on tissue segmentation comparable to other state-of-the-art segmentation methods. In addition, results show that our algorithm generalizes well across data sets and is less susceptible to outliers. Our method relies on minimal prior knowledge unlike atlas-based techniques, which assume images map to a normal template. Our results show that CRFs are a promising model for tissue segmentation, as well as other MRI data mining problems such as anatomical segmentation and disease diagnosis where atlas assumptions are unreliable in abnormal brain images.
【Keywords】:
【Paper Link】 【Pages】:983-991
【Authors】: Prithwish Basu ; Feng Yu ; Amotz Bar-Noy ; Dror Rawitz
【Abstract】: Time-varying graphs (T-graph) consist of a time-evolving set of graph snapshots (or graphlets). A T-graph property with potential applications in both computer and social network forensics is T-reachability, which identifies the nodes reachable from a source node using the T-graph edges over time period T. In this paper, we consider the problem of estimating the T-reachable set of a source node in two different settings - when a time-evolution of a T-graph is specified by a probabilistic model, and when the actual T-graph snapshots are known and given to us offline (“data aware” setting). Since the value of T could be large in many applications, we propose two simple techniques, namely T-graph sampling and T-graph smashing for significantly reducing the complexity of this computation, while minimizing the estimation error. We show that for the data-aware case, both T-graph sampling and smashing problems are NP-hard, but they are amenable to reasonably good approximations. We also show that for the probabilistic setting where each graphlet in a T-graph is an Erdos-Renyi random graph, sampling yields a loose lower bound for the T-reachable set, while different styles of smashing yield more useful upper and lower bounds. Finally, we show that our algorithms (both data-aware and data-oblivious) can estimate the T-reachable set in real world time-varying networks within reasonable accuracy using less than 0.5% of the number of graphlets.
【Keywords】:
【Paper Link】 【Pages】:992-1000
【Authors】: Mahdi Pakdaman Naeini ; Iyad Batal ; Zitao Liu ; Charmgil Hong ; Milos Hauskrecht
【Abstract】: This paper studies multi-label classification problem in which data instances are associated with multiple, possibly high-dimensional, label vectors. This problem is especially challenging when labels are dependent and one cannot decompose the problem into a set of independent classification problems. To address the problem and properly represent label dependencies we propose and study a pairwise conditional random Field (CRF) model. We develop a new approach for learning the structure and parameters of the CRF from data. The approach maximizes the pseudo likelihood of observed labels and relies on the fast proximal gradient descend for learning the structure and limited memory BFGS for learning the parameters of the model. Empirical results on several datasets show that our approach outperforms several multi-label classification baselines, including recently published state-of-the-art methods.
【Keywords】:
【Paper Link】 【Pages】:1001-1009
【Authors】: Gowtham Atluri ; Michael Steinbach ; Kelvin O. Lim ; Angus W. MacDonald III ; Vipin Kumar
【Abstract】: The focus of this paper is to address the problem of discovering groups of time series that share similar behavior in multiple small intervals of time. This problem has two characteristics: i) There are exponentially many combinations of time series that needs to be explored to find these groups, ii) The groups of time series of interest need to have similar behavior only in some subsets of the time dimension. We present an Apriori based approach to address this problem. We evaluate it on a synthetic dataset and demonstrate that our approach can directly find all groups of intermittently correlated time series without finding spurious groups unlike other alternative approaches that find many spurious groups. We also demonstrate, using a neuroimaging dataset, that groups of intermittently coherent time series discovered by our approach are reproducible on independent sets of time series data. In addition, we demonstrate the utility of our approach on an S&P 500 stocks data set.
【Keywords】:
【Paper Link】 【Pages】:1010-1018
【Authors】: Steve Harenberg ; Ramona G. Seay ; Stephen Ranshous ; Kanchana Padmanabhan ; Jitendra K. Harlalka ; Eric R. Schendel ; Michael P. O'Brien ; Rada Chirkova ; William Hendrix ; Alok N. Choudhary ; Vipin Kumar ; Murali Doraiswamy ; Nagiza F. Samatova
【Abstract】: Community detection in real-world graphs presents a number of challenges. First, even if the number of detected communities grows linearly with the graph size, it becomes impossible to manually inspect each community for value added to the application knowledge base. Mining for communities with query nodes as knowledge priors could allow for filtering out irrelevant information and for enriching end-users knowledge associated with the problem of interest, such as discovery of genes functionally associated with the Alzheimer's (AD) biomarker genes. Second, the data-intensive nature of community enumeration challenges current approaches that often assume that the input graph and the detected communities fit in memory. As computer systems scale, DRAM memory sizes are not expected to increase linearly, while technologies such as SSD memories have the potential to provide much higher capacities at a lower power-cost point, and have a much lower latency than disks. Out-of-core algorithms and/or database-inspired indexing could provide an opportunity for different design optimizations for query-driven community detection algorithms tuned for emerging architectures. Therefore, this work addresses the need for query-driven and memory-efficient community detection. Using maximal cliques as the community definition, due to their high signal-to-noise ratio, we propose and systematically compare two contrasting methods: indexed-based and out-of-core. Both methods improve peak memory efficiency as much as 1000X compared to the state-of-the-art. However, the index-based method, which also has a 10-to-100-fold run time reduction, outperforms the out-of-core algorithm in most cases. The achieved scalability enables the discovery of diseases that are known to be or likely associated with Alzheimer's when the genome-scale network is mined with AD biomarker genes as knowledge priors.
【Keywords】:
【Paper Link】 【Pages】:1019-1027
【Authors】: Vladimir Coric ; Nemanja Djuric ; Slobodan Vucetic
【Abstract】: As mobile devices are becoming pervasive, participatory sensing is becoming an attractive way of collecting large quantities of valuable location-based data. An important participatory sensing application is traffic monitoring, where GPS-enabled smartphones can provide invaluable information about traffic conditions. In this paper we propose a strategy for frugal sensing in which the participants send only a fraction of the observed traffic information to reduce costs while achieving high accuracy. The strategy is based on autonomous sensing, in which participants make decisions to send traffic information without guidance from the central server, thus reducing the communication overhead and improving privacy. We propose to use traffic flow theory in deciding whether or not to send an observation to the server. To provide accurate and computationally efficient estimation of the current traffic, we propose to use a budgeted version of the Gaussian Process model on the server side. The model is tuned to be robust to missing observations, which occur quite often in frugal sensing. The experiments on real-life traffic data set indicate that the proposed approach can use up to two orders of magnitude less samples than a baseline approach when estimating traffic speed on a highway network, with only a negligible loss in accuracy.
【Keywords】:
【Paper Link】 【Pages】:1028-1036
【Authors】: Raed I. Seetan ; Anne M. Denton ; Omar Al Azzam ; Ajay Kumar ; Jazarai Sturdivant ; Shahryar F. Kianian
【Abstract】: Building comprehensive radiation hybrid maps for large sets of markers is a computationally expensive process, since the basic mapping problem is equivalent to the traveling salesman problem. The mapping problem is also susceptible to noise, and as a result, it is often beneficial to remove markers that are not trustworthy. The resulting framework maps are typically more reliable but don't provide information about as many markers. We present an approach to mapping most markers by first creating a framework map and then incrementally adding the remaining markers. We consider chromosomes of the human genome, for which the correct ordering is known, and compare the performance of our two-stage algorithm with the Carthagene radiation hybrid mapping software. We show that our approach is not only much faster than mapping the complete genome in one step, but that the quality of the resulting maps is also much higher.
【Keywords】:
【Paper Link】 【Pages】:1037-1045
【Authors】: Sucheta Soundarajan ; Tina Eliassi-Rad ; Brian Gallagher
【Abstract】: We consider the problem of determining how similar two networks (without known node-correspondences) are. This problem occurs frequently in real-world applications such as transfer learning and change detection. Many network-similarity methods exist; and it is unclear how one should select from amongst them. We provide the first empirical study on the relationships between different network-similarity methods. Specifically, we present (1) an approach for identifying groups of comparable network-similarity methods and (2) an approach for computing the consensus among a given set of network-similarity methods. We compare and contrast twenty network-similarity methods by applying our approaches to a variety of real datasets spanning multiple domains. Our experiments demonstrate that (1) different network-similarity methods are surprisingly well correlated, (2) some complex network-similarity methods can be closely approximated by a much simpler method, and (3) a few network-similarity methods produce rankings that are very close to the consensus ranking.
【Keywords】:
【Paper Link】 【Pages】:1046-1054
【Authors】: Huiwen Yu ; Dayu Yuan
【Abstract】: The problem of subgraph search in large graphs has wide applications in both nature and social science. The subgraph search results are typically ordered based on graph similarity score. In this paper, we study the problem of ranking the subgraph search results based on diversification. We design two ranking measures based on both similarity and diversity, and formalize the problem as an optimization problem. We give two efficient algorithms, the greedy selection and the swapping selection with provable performance guarantee. We also propose a novel local search heuristic with at least 100 times speedup and a similar solution quality. We demonstrate the efficiency and effectiveness of our approaches via extensive experiments.
【Keywords】:
【Paper Link】 【Pages】:1055-1063
【Authors】: Sanjoy Dey ; György J. Simon ; Bonnie L. Westra ; Michael Steinbach ; Vipin Kumar
【Abstract】: Mining patterns from electronic health-care records (EHR) can potentially lead to better and more cost-effective treatments. We aim to find the groups of ICD-9 diagnosis codes from EHRs that can predict the improvement of urinary incontinence of home health care (HHC) patients and also are interpretable to domain experts. In this paper, we propose two approaches for increasing the interpretability of the obtained groups of ICD-9 codes. First, we incorporate prior information available from clinical domain knowledge using the clinical classification system (CCS). Second, we incorporate additional types of clinical information for the same patients, such as demographic, behavioral, physiological, and psycho-social variables available from survey questions during the hospital visits. Finally, we develop a hybrid framework that can combine both prior information and the data-driven clinical information in the predictive model framework. Our results obtained from a large-scale EHR data set show that the hybrid framework enhances clinical interpretability as compared to the baseline model obtained from ICD-9 codes only, while achieving almost the same predictive capability.
【Keywords】:
【Paper Link】 【Pages】:1064-1072
【Authors】: Vignesh Veppur Sankaranarayanan ; Junaed Sattar ; Laks V. S. Lakshmanan
【Abstract】: Cricket is a popular sport played by 16 countries, is the second most watched sport in the world after soccer, and enjoys a multi-million dollar industry. There is tremendous interest in simulating cricket and more importantly in predicting the outcome of games, particularly in their one-day international format. The complex rules governing the game, along with the numerous natural parameters affecting the outcome of a cricket match present significant challenges for accurate prediction. Multiple diverse parameters, including but not limited to cricketing skills and performances, match venues and even weather conditions can significantly affect the outcome of a game. The sheer number of parameters, along with their interdependence and variance create a non-trivial challenge to create an accurate quantitative model of a game Unlike other sports such as basketball and baseball which are well researched from a sports analytics perspective, for cricket, these tasks have yet to be investigated in depth. In this paper, we build a prediction system that takes in historical match data as well as the instantaneous state of a match, and predicts future match events culminating in a victory or loss. We model the game using a subset of match parameters, using a combination of linear regression and nearest-neighbor clustering algorithms. We describe our model and algorithms and finally present quantitative results, demonstrating the performance of our algorithms in predicting the number of runs scored, one of the most important determinants of match outcome.
【Keywords】:
【Paper Link】 【Pages】:1073-1081
【Authors】: Masaaki Nishino ; Norihito Yasuda ; Shin-ichi Minato ; Masaaki Nagata
【Abstract】: We propose a method for accelerating matrix multiplications that are iteratively performed with a sparse adjacency matrix. These operations appear in a wide range of data analyses and data mining situations, which include the computation of Personalized PageRank (PPR) and Nonnegative Matrix Factorization (NMF). We exploit the fact that the intermediate computational results for the matrix multiplication of equivalent partial row vectors of a matrix are the same. Our new data structure, the adjacency forest, uses this property and represents an adjacency matrix as a rooted tree that is made by sharing the common suffixes of the row vectors of the matrix. By exploiting the structure of the tree, we can perform a matrix multiplication while sharing intermediate computational results to reduce the number of required operations. We also show that we can further accelerate computation by dividing a matrix into several sub-matrices and representing the original matrix as a forest. We confirm experimentally that our approach can speed up the computation of Personalized PageRank and NMF up to 300%.
【Keywords】: