Proceedings of the Eleventh SIAM International Conference on Data Mining, SDM 2011, April 28-30, 2011, Mesa, Arizona, USA. SIAM / Omnipress 【DBLP Link】
【Paper Link】 【Pages】:1-12
【Authors】: Shin Ando ; Einoshin Suzuki ; Yoichi Seki ; Theerasak Thanongphongphan ; Daisuke Hoshino
【Abstract】: This paper addresses an application of anomaly detection from subsequences of time series (STS) to autonomous robots' behaviors. An important aspect of mining sequential data is selecting the temporal parameters, such as the subsequence length and the degree of smoothing. For example in the task at hand, the patterns of the robot's velocity, which is one of its fundamental features, vary significantly subject to the interval for measuring the displacement. Selecting the time scale and resolution is difficult in unsupervised settings, and is often more critical than the choice of the method. In this paper, we propose an ensemble framework for aggregating anomaly detection from different perspectives, i.e., settings of user-defined, temporal parameters. In the proposed framework, each behavior is labeled whether it is an anomaly in multiple settings. The set of labels are used as meta-features of the respective behaviors. Cluster analysis in a meta-feature space partitions anomalous behaviors pertained to a specific range of parameters. The framework also includes a scalable implementation of the instance-based anomaly detection. We evaluate the proposed framework by ROC analysis, in comparison to conventional ensemble methods for anomaly detection.
【Keywords】:
【Paper Link】 【Pages】:13-24
【Authors】: Hans-Peter Kriegel ; Peer Kröger ; Erich Schubert ; Arthur Zimek
【Abstract】: Outlier scores provided by different outlier models differ widely in their meaning, range, and contrast between different outlier models and, hence, are not easily comparable or interpretable. We propose a unification of outlier scores provided by various outlier models and a translation of the arbitrary “outlier factors” to values in the range [0, 1] interpretable as values describing the probability of a data object of being an outlier. As an application, we show that this unification facilitates enhanced ensembles for outlier detection.
【Keywords】:
【Paper Link】 【Pages】:25-34
【Authors】: Roberto J. Bayardo ; Biswanath Panda
【Abstract】: Identifying the extremal (minimal and maximal) sets from a collection of sets is an important subproblem in the areas of data-mining and satisfiability checking. For example, extremal set finding algorithms are used in the context of mining maximal frequent itemsets, and for simplifying large propositional satisfiability instances derived from real world tasks such as circuit routing and verification. In this paper, we describe two new algorithms for the task and detail their performance on real and synthetic data. Each algorithm leverages an entirely different principle – one primarily exploits set cardinality constraints, the other lexicographic constraints. Despite the inherent difficulty of this problem (the best known worst-case bounds are nearly quadratic), we show that both these algorithms provide excellent performance in practice, and can identify all extremal sets from multi-gigabyte itemset data using only a single processor core. Both algorithms are concise and can be implemented in no more than a few hundred lines of code. Our reference C++ implementations are open source and available for download.
【Keywords】:
【Paper Link】 【Pages】:35-46
【Authors】: Chih-Hua Tai ; Philip S. Yu ; De-Nian Yang ; Ming-Syan Chen
【Abstract】: How to protect individual privacy in public data is always a concern. For social networks, the challenge is that, the structure of the social network graph can be utilized to infer the private and sensitive information of users. The existing anonymity schemes mostly focus on the anonymity of vertex identities, such that a malicious attacker cannot associate an user with a specific vertex. In real social networks, however, each vertex is usually associated with not only a vertex identity but also a community identity, which could represent the private information for the corresponding user, such as the political party affiliation or disease information sensitive to the public. In this paper, we first show that the attacker can still infer the community identity of an user even though the graph is protected by previous anonymity schemes. Afterward, we propose the structural diversity, which ensures the existences of at least k communities containing vertices with the same degree for every vertex in the graph, to provide the anonymity of the community identities. Specifically, we formulate a new problem, k-Structural Diversity Anonymization (k-SDA), which protects the community identity of each individual in publishing social networks. We propose an Integer Programming formulation to find the optimal solutions to k-SDA. Moreover, we devise three scalable heuristics to solve the large instances of k-SDA with different perspectives. The experiments on real data sets demonstrate the practical utility of our privacy model and our approaches.
【Keywords】:
【Paper Link】 【Pages】:47-58
【Authors】: Myunghwan Kim ; Jure Leskovec
【Abstract】: While the social and information networks have become ubiquitous, the challenge of collecting complete network data still persists. Many times the collected network data is incomplete with nodes and edges missing. Commonly, only a part of the network can be observed and we would like to infer the unobserved part of the network. We address this issue by studying the Network Completion Problem: Given a network with missing nodes and edges, can we complete the missing part? We cast the problem in the Expectation Maximization (EM) framework where we use the observed part of the network to fit a model of network structure, and then we estimate the missing part of the network using the model, re-estimate the parameters and so on. We combine the EM algorithm with the Kronecker graphs model and design a scalable Metropolized Gibbs sampling approach that allows for the estimation of the model parameters as well as the inference about missing nodes and edges of the network. Experiments on synthetic and several real-world networks show that our approach can effectively recover the network even when about half of the nodes in the network are missing. Our algorithm outperforms not only classical link-prediction approaches but also the state of the art Stochastic block modeling approach. Furthermore, our algorithm easily scales to networks with tens of thousands of nodes.
【Keywords】:
【Paper Link】 【Pages】:59-70
【Authors】: Fei Wang ; Jimeng Sun ; Shahram Ebadollahi
【Abstract】: Patient similarity assessment is an important task in the context of patient cohort identification for comparative effectiveness studies and clinical decision support applications. The goal is to derive clinically meaningful distance metric to measure the similarity between patients represented by their key clinical indicators. It is desirable to learn the distance metric based on experts' knowledge of clinical similarity among subjects. However, often different physicians have different understandings of patient similarity based on the specifics of the cases. The distance metric learned for each individual physician often leads to a limited view of the true underlying distance metric. The key challenge will be how to integrate the individual distance metrics obtained for a group of physicians into a globally consistent unified metric. In this paper, we propose the Composite Distance Integration (Comdi) approach. In this approach we first construct discriminative neighborhoods from each individual metrics, then we combine them into a single optimal distance metric. We formulate Comdi as a quadratic optimization problem and propose an efficient alternating strategy to find the optimal solution. Besides learning a globally consistent metric, Comdi provides an elegant way to share knowledge across multiple experts (physicians) without sharing the underlying data, which enables the privacy preserving collaboration. Our experiments on several benchmark data sets show approximately 10% improvement in classification accuracy over baseline. These results show that Comdi is an effective and general metric learning approach. An application of our approach to real patient data has also been presented in the results.
【Keywords】:
【Paper Link】 【Pages】:71-82
【Authors】: Ghim-Eng Yap ; Ah-Hwee Tan ; HweeHwa Pang
【Abstract】: The presence of noise or errors in the stated feature values of biomedical data can lead to incorrect prediction. We introduce a Bayesian Network-based Noise Correction framework named BN-NC. After data preprocessing, a Bayesian Network (BN) is learned to capture the feature dependencies. Using the BN to predict each feature in turn, BN-NC estimates a feature's error rate as the deviation between its predicted and stated values in the training data, and allocates the appropriate uncertainty to its subsequent findings during prediction. BN-NC automatically generates a probabilistic rule to explain BN prediction on the class variable using the feature values in its Markov blanket, and this is reapplied as necessary to explain the noise correction on those features. Using three real-life benchmark biomedical data sets (on HIV-1 drug resistance prediction and leukemia subtype classification), we demonstrate that BN-NC (1) accurately detects the errors in biomedical feature values, (2) automatically corrects for the errors to maintain higher prediction accuracy over competing methods including Decision Trees, Naive Bayes and Support Vector Machines, and (3) generates probabilistic rules that concisely explain the prediction and noise correction decisions. In addition to achieving more robust biomedical prediction in the presence of feature noise, by highlighting erroneous features and explaining their corrections, BN-NC provides medical researchers with high utility insights to biomedical data not found in other methods.
【Keywords】:
【Paper Link】 【Pages】:83-94
【Authors】: Elisa Boari de Lima ; Raquel Cardoso de Melo Minardi ; Wagner Meira Jr. ; Mohammed Javeed Zaki
【Abstract】: When multiple data sources are available for clustering, an a priori data integration process is usually required. This process may be costly and may not lead to good clusterings, since important information is likely to be discarded. In this paper we propose constrained clustering as a strategy for integrating data sources without losing any information. It basically consists of adding the complementary data sources as constraints that the algorithm must satisfy. As a concrete application of our approach, we focus on the problem of enzyme function prediction, which is a hard task usually performed by intensive experimental work. We use constrained clustering as a means of integrating information from diverse sources as constraints, and analyze how this additional information impacts clustering quality in an enzyme clustering application scenario. Our results show that constraints generally improve the clustering quality when compared to an unconstrained clustering algorithm.
【Keywords】:
【Paper Link】 【Pages】:95-106
【Authors】: Varun Chandola ; Ranga Raju Vatsavai
【Abstract】: Online time series change detection is a critical component of many monitoring systems, such as space and air-borne remote sensing instruments, cardiac monitors, and network traffic profilers, which continuously analyze observations recorded by sensors. Data collected by such sensors typically has a periodic component. Most existing time series change detection methods are not directly applicable to handle such data, either because they are not designed to handle periodic time series or because they cannot operate in an online mode. We propose an online change detection algorithm which can handle periodic time series. The algorithm uses a Gaussian process based non-parametric time series prediction model and monitors the difference between the predictions and actual observations within a statistical control chart framework to identify changes. A key challenge in using Gaussian process in an online mode is the need to solve a large system of equations involving the associated covariance matrix which grows with every time step. The proposed algorithm exploits the special structure of the covariance matrix and can analyze a time series of length T in O(T2) time while maintaining a O(T) memory footprint, compared to O(T4) time and O(T2) memory requirement of standard matrix manipulation methods. We experimentally demonstrate the superiority of the proposed algorithm over several existing time series change detection algorithms on a set of synthetic and real time series. Finally, we illustrate the effectiveness of the proposed algorithm for identifying land use land cover changes using Normalized Difference Vegetation Index (NDVI) data collected for an agricultural region in Iowa state, USA. Our algorithm is able to detect different types of changes in a NDVI validation data set (with ≈ 80% accuracy) which occur due to crop type changes as well as natural disasters.
【Keywords】:
【Paper Link】 【Pages】:107-118
【Authors】: Jaya Kawale ; Michael Steinbach ; Vipin Kumar
【Abstract】: Pressure dipoles are important long distance climate phenomena (teleconnection) characterized by pressure anomalies of opposite polarity appearing at two different locations at the same time. Such dipoles have proven important for understanding and explaining the variability in climate in many regions of the world, e.g., the El Niño climate phenomenon is known to be responsible for precipitation and temperature anomalies worldwide. This paper presents a novel approach for dipole discovery that outperforms existing state of the art algorithms. Our approach is based on a climate anomaly network that is constructed using the correlation of time series of climate variables at all the locations on the Earth. One novel aspect of our approach to the analysis of such networks is a careful treatment of negative correlations, whose proper consideration is critical for finding dipoles. Another key insight provided by our work is the importance of modeling the time dependent patterns of the dipoles in order to better capture the impact of important climate phenomena on land. The results presented in this paper show that these innovations allow our approach to produce better results than previous approaches in terms of matching existing climate indices with high correlation and capturing the impact of climate indices on land.
【Keywords】:
【Paper Link】 【Pages】:119-130
【Authors】: U. Kang ; Spiros Papadimitriou ; Jimeng Sun ; Hanghang Tong
【Abstract】: Node centrality measures are important in a large number of graph applications, from search and ranking to social and biological network analysis. In this paper we study node centrality for very large graphs, up to billions of nodes and edges. Various definitions for centrality have been proposed, ranging from very simple (e.g., node degree) to more elaborate. However, measuring centrality in billion-scale graphs poses several challenges. Many of the “traditional” definitions such as closeness and betweenness were not designed with scalability in mind. Therefore, it is very difficult, if not impossible, to compute them both accurately and efficiently. In this paper, we propose centrality measures suitable for very large graphs, as well as scalable methods to effectively compute them. More specifically, we propose effective closeness and LineRank which are designed for billion-scale graphs. We also develop algorithms to compute the proposed centrality measures in MapReduce, a modern paradigm for large-scale, distributed data processing. We present extensive experimental results on both synthetic and real datasets, which demonstrate the scalability of our approach to very large graphs, as well as interesting findings and anomalies.
【Keywords】:
【Paper Link】 【Pages】:131-142
【Authors】: Duen Horng Chau ; Carey Nachenberg ; Jeffrey Wilhelm ; Adam Wright ; Christos Faloutsos
【Abstract】: We present Polonium, a novel Symantec technology that detects malware through large-scale graph inference. Based on the scalable Belief Propagation algorithm, Polonium infers every file's reputation, flagging files with low reputation as malware. We evaluated Polonium with a billion-node graph constructed from the largest file submissions dataset ever published (60 terabytes). Polonium attained a high true positive rate of 87% in detecting malware; in the field, Polonium lifted the detection rate of existing methods by 10 absolute percentage points. We detail Polonium's design and implementation features instrumental to its success. Polonium has served 120 million people and helped answer more than one trillion queries for file reputation.
【Keywords】:
【Paper Link】 【Pages】:143-153
【Authors】: Hanghang Tong ; Ching-Yung Lin
【Abstract】: Given an IP source-destination traffic network, how do we spot mis-behavioral IP sources (e.g., port-scanner)? How do we find strange users in a user-movie rating graph? Moreover, how can we present the results intuitively so that it is relatively easier for data analysts to interpret? We propose NrMF, a non-negative residual matrix factorization framework, to address such challenges. We present an optimization formulation as well as an effective algorithm to solve it. Our method can naturally capture abnormal behaviors on graphs. In addition, the proposed algorithm is linear wrt the size of the graph therefore it is suitable for large graphs. The experimental results on several data sets validate its effectiveness as well as efficiency.
【Keywords】:
【Paper Link】 【Pages】:154-163
【Authors】: Yasuo Tabei ; Koji Tsuda
【Abstract】: Similarity search in databases of labeled graphs is a fundamental task in managing graph data such as XML, chemical compounds and social networks. Typically, a graph is decomposed to a set of substructures (e.g., paths, trees and subgraphs) and a similarity measure is defined via the number of common substructures. Using the representation, graphs can be stored in a document database by regarding graphs as documents and substructures as words. A graph similarity query then translates to a semi-conjunctive query that retrieves graphs sharing at least k substructures in common with the query graph. We argue that this kind of query cannot be solved efficiently by conventional inverted indexes, and develop a novel recursive search algorithm on wavelet trees (Grossi et al., SODA'03). Unlike gIndex, it does not require frequent subgraph mining for indexing. In experiments, our method was successfully applied to 25 million chemical compounds.
【Keywords】:
【Paper Link】 【Pages】:164-175
【Authors】: Berkant Savas ; Inderjit S. Dhillon
【Abstract】: In this paper we present a fast and accurate procedure called clustered low rank matrix approximation for massive graphs. The procedure involves a fast clustering of the graph and then approximates each cluster separately using existing methods, e.g. the singular value decomposition, or stochastic algorithms. The cluster-wise approximations are then extended to approximate the entire graph. This approach has several benefits: (1) important community structure of the graph is preserved due to the clustering; (2) highly accurate low rank approximations are achieved; (3) the procedure is efficient both in terms of computational speed and memory usage; (4) better performance in problems from various applications compared to standard low rank approximation. Further, we generalize stochastic algorithms to the clustered low rank approximation framework and present theoretical bounds for the approximation error. Finally, a set of experiments, using large scale and real-world graphs, show that our methods outperform standard low rank matrix approximation algorithms.
【Keywords】:
【Paper Link】 【Pages】:176-187
【Authors】: Byron C. Wallace ; Kevin Small ; Carla E. Brodley ; Thomas A. Trikalinos
【Abstract】: The active learning (AL) framework is an increasingly popular strategy for reducing the amount of human labeling effort required to induce a predictive model. Most work in AL has assumed that a single, infallible oracle provides labels requested by the learner at a fixed cost. However, real-world applications suitable for AL often include multiple domain experts who provide labels of varying cost and quality. We explore this multiple expert active learning (MEAL) scenario and develop a novel algorithm for instance allocation that exploits the meta-cognitive abilities of novice (cheap) experts in order to make the best use of the experienced (expensive) annotators. We demonstrate that this strategy outperforms strong baseline approaches to MEAL on both a sentiment analysis dataset and two datasets from our motivating application of biomedical citation screening. Furthermore, we provide evidence that novice labelers are often aware of which instances they are likely to mislabel.
【Keywords】:
【Paper Link】 【Pages】:188-198
【Authors】: Wei Liu ; Sanjay Chawla
【Abstract】: In this paper, we study the problem of data skewness. A data set is skewed/imbalanced if its dependent variable is asymmetrically distributed. Dealing with skewed data sets has been identified as one of the ten most challenging problems in data mining research. We address the problem of class skewness for supervised learning models which are based on optimizing a regularized empirical risk function. These include both classification and regression models for discrete and continuous dependent variables. Classical empirical risk minimization is akin to minimizing the arithmetic mean of prediction errors, in which approach the induction process is biased towards the majority class for skewed data. To overcome this drawback, we propose a quadratic mean based learning framework (QMLearn) that is robust and insensitive to class skewness. We will note that minimizing the quadratic mean is a convex optimization problem and hence can be efficiently solved for large and high dimensional data. Comprehensive experiments demonstrate that the QMLearn model significantly outperforms existing statistical learners including logistic regression, support vector machines, linear regression, support vector regression and quantile regression etc.
【Keywords】:
【Paper Link】 【Pages】:199-210
【Authors】: Hao Xia ; Steven C. H. Hoi
【Abstract】: Multiple kernel learning (MKL) has been shown as a promising machine learning technique for data mining tasks by integrating with multiple diverse kernel functions. Traditional MKL methods often formulate the problem as an optimization task of learning both optimal combination of kernels and classifiers, and attempt to resolve the challenging optimization task by various techniques. Unlike the existing MKL methods, in this paper, we investigate a boosting framework of exploring multiple kernel learning for classification tasks. In particular, we present a novel framework of Multiple Kernel Boosting (MKBoost), which applies boosting techniques for learning kernel-based classifiers with multiple kernels. Based on the proposed framework, we develop several variants of MKBoost algorithms and examine their empirical performance in comparisons to several state-of-the-art MKL algorithms on classification tasks. Experimental results show that the proposed method is more effective and efficient than the existing MKL techniques.
【Keywords】:
【Paper Link】 【Pages】:211-222
【Authors】: Ming-Syan Chen ; Keng-Pei Lin
【Abstract】: Training support vector machines (SVMs) with nonlinear kernel functions on large-scale data are usually very time-consuming. In contrast, there exist faster solvers to train the linear SVM. We propose a technique which sufficiently approximates the infinite-dimensional implicit feature mapping of the Gaussian kernel function by a low-dimensional feature mapping. By explicitly mapping data to the low-dimensional features, efficient linear SVM solvers can be applied to train the Gaussian kernel SVM, which leverages the efficiency of linear SVM solvers to train a nonlinear SVM. Experimental results show that the proposed technique is very efficient and achieves comparable classification accuracy to a normal nonlinear SVM solver.
【Keywords】:
【Paper Link】 【Pages】:223-234
【Authors】: Shirish Krishnaj Shevade ; Balamurugan P. ; S. Sundararajan ; S. Sathiya Keerthi
【Abstract】: In many real world prediction problems the output is a structured object like a sequence or a tree or a graph. Such problems range from natural language processing to computational biology or computer vision and have been tackled using algorithms, referred to as structured output learning algorithms. We consider the problem of structured classification. In the last few years, large margin classifiers like support vector machines (SVMs) have shown much promise for structured output learning. The related optimization problem is a convex quadratic program (QP) with a large number of constraints, which makes the problem intractable for large data sets. This paper proposes a fast sequential dual method (SDM) for structural SVMs. The method makes repeated passes over the training set and optimizes the dual variables associated with one example at a time. The use of additional heuristics makes the proposed method more efficient. We present an extensive empirical evaluation of the proposed method on several sequence learning problems. Our experiments on large data sets demonstrate that the proposed method is an order of magnitude faster than state of the art methods like cutting-plane method and stochastic gradient descent method (SGD). Further, SDM reaches steady state generalization performance faster than the SGD method. The proposed SDM is thus a useful alternative for large scale structured output learning.
【Keywords】:
【Paper Link】 【Pages】:235-246
【Authors】: Esa Junttila ; Petteri Kaski
【Abstract】: A binary matrix is fully nested if its columns form a chain of subsets; that is, any two columns are ordered by the subset relation, where we view each column as a subset of the rows indicated by the 1-entries. A binary matrix is k-nested if its columns can be partitioned into k pairwise disjoint blocks, each of which is fully nested. Such nested patterns are encountered, for example, in presence/absence patterns of species in ecological data. We study the automated discovery of k-nestedness on synthetic data and real ecological data. First, we show that k-nestedness can be efficiently discovered in a noise-free setting using a polynomial-time algorithm. Second, we show that it is NP-hard to find a k-nested matrix that minimizes the Hamming distance to a given dataset. Thus, it is likely that in the presence of noise no efficient algorithm exists for discovering k-nestedness in the general case. Third, we develop and evaluate multiple heuristic algorithms for discovering k-nestedness on noisy synthetic data. The methods based on a combination of singular value decomposition and k-means++ give the best performance in terms of structure discovery and noise tolerance. Fourth, we develop an MDL-based model selection technique for assessing nestedness, and discover k-nested structure in (a) paleontological data, and (b) geographical occurrence data for mammal species in Europe.
【Keywords】:
【Paper Link】 【Pages】:247-258
【Authors】: Zhengzheng Xing ; Jian Pei ; Philip S. Yu ; Ke Wang
【Abstract】: Early classification on time series data has been found highly useful in a few important applications, such as medical and health informatics, industry production management, safety and security management. While some classifiers have been proposed to achieve good earliness in classification, the interpretability of early classification remains largely an open problem. Without interpretable features, application domain experts such as medical doctors may be reluctant to adopt early classification. In this paper, we tackle the problem of extracting interpretable features on time series for early classification. Specifically, we advocate local shapelets as features, which are segments of time series remaining in the same space of the input data and thus are highly interpretable. We extract local shapelets distinctly manifesting a target class locally and early so that they are effective for early classification. Our experimental results on seven benchmark real data sets clearly show that the local shapelets extracted by our methods are highly interpretable and can achieve effective early classification.
【Keywords】:
【Paper Link】 【Pages】:259-270
【Authors】: Hao Shao ; Einoshin Suzuki
【Abstract】: This paper proposes an Extended Minimum Description Length Principle (EMDLP) for feature-based inductive transfer learning, in which both the source and the target data sets contain class labels and relevant features are transferred from the source domain to the target one. Despite numerous works on this topic, few of them have a solid theoretical framework and are parameter-free. Our EMDLP overcomes these flaws and allows us to evaluate the inferiority of the results of transfer learning with the add-sum of the code lengths of five components: the corresponding two hypotheses, the two data sets with the help of the hypotheses, and the set of the transferred features. We design a code book to build the connections between the source and the target tasks. Extensive experiments using both real and artificial data sets show that EMDLP is robust against noise and performs better on the classification accuracy than the state-of-the-art methods.
【Keywords】:
【Paper Link】 【Pages】:271-282
【Authors】: Jing Peng ; Guna Seetharaman ; Stefan A. Robila ; Aparna S. Varde ; Wei Fan
【Abstract】: Well known linear discriminant analysis (LDA) based on the Fisher criterion is incapable of dealing with heteroscedasticity in data. However, in many practical applications we often encounter heteroscedastic data, i.e., within-class scatter matrices can not be expected to be equal. A technique based on the Chernoff criterion for linear dimensionality reduction has been proposed recently. The technique extends well-known Fisher's LDA and is capable of exploiting information about heteroscedasticity in the data. While the Chernoff criterion has been shown to outperform the Fisher's, a clear understanding of its exact behavior is lacking. In addition, the criterion, as introduced, is rather complex, making it difficult to clearly state its relationship to other linear dimensionality reduction techniques. In this paper, we show precisely what can be expected from the Chernoff criterion and its relations to the Fisher criterion and Fukunaga-Koontz transform. Furthermore, we show that a recently proposed decomposition of the data space into four subspaces is incomplete. We provide arguments on how to best enrich the decomposition of the data space in order to account for heteroscedasticity in the data. Finally, we provide experimental results validating our theoretical analysis.
【Keywords】:
【Paper Link】 【Pages】:283-294
【Authors】: Haibing Lu ; Jaideep Vaidya ; Vijayalakshmi Atluri ; Heechang Shin ; Lili Jiang
【Abstract】: Mining discrete patterns in binary data is important for many data analysis tasks, such as data sampling, compression, and clustering. An example is that replacing individual records with their patterns would greatly reduce data size and simplify subsequent data analysis tasks. As a straightforward approach, rank-one binary matrix approximation has been actively studied recently for mining discrete patterns from binary data. It factorizes a binary matrix into the multiplication of one binary pattern vector and one binary presence vector, while minimizing mismatching entries. However, this approach suffers from two serious problems. First, if all records are replaced with their respective patterns, the noise could make as much as 50% in the resulting approximate data. This is because the approach simply assumes that a pattern is present in a record as long as their matching entries are more than their mismatching entries. Second, two error types, 1-becoming-0 and 0-becoming-1, are treated evenly, while in many application domains they are discriminated. To address the two issues, we propose weighted rank-one binary matrix approximation. It enables the tradeoff between the accuracy and succinctness in approximate data and allows users to impose their personal preferences on the importance of different error types. The decision problem, however, as proved in the paper is NP-complete. To solve it, several different mathematical programming formulations are provided, from which 2-approximation algorithms are derived for some special cases. An adaptive tabu search heuristic is presented for solving the general problem, and our experimental study shows the effectiveness of the heuristic.
【Keywords】:
【Paper Link】 【Pages】:295-306
【Authors】: Vineet Chaoji ; Geng Li ; Hilmi Yildirim ; Mohammed Javeed Zaki
【Abstract】: A wide variety of clustering algorithms exist that cater to applications based on certain special characteristics of the data. Our focus is on methods that capture arbitrary shaped clusters in data, the so called spatial clustering algorithms. With the growing size of spatial datasets from diverse sources, the need for scalable algorithms is paramount. We propose a shape-based clustering algorithm, ABACUS, that scales to large datasets. ABACUS is based on the idea of identifying the intrinsic structure for each cluster, which we also refer to as the backbone of that cluster. The backbone comprises of a much smaller set of points, thus giving this method the desired ability to scale to larger datasets. ABACUS operates in two stages. In the first stage, we identify the backbone of each cluster via an iterative process made up of globbing (or point merging) and point movement operations. The backbone enables easy identification of the true clusters in a subsequent stage. Experiments on a range of real (images from geospatial satellites, etc.) and synthetic datasets demonstrate the efficiency and effectiveness of our approach. In particular, ABACUS is over an order of magnitude faster than existing shape-based clustering methods, yet it provides a comparable or better clustering quality.
【Keywords】:
【Paper Link】 【Pages】:307-318
【Authors】: Parasaran Raman ; Jeff M. Phillips ; Suresh Venkatasubramanian
【Abstract】: This paper proposes a new distance metric between clusterings that incorporates information about the spatial distribution of points and clusters. Our approach builds on the idea of a Hilbert space-based representation of clusters as a combination of the representations of their constituent points. We use this representation and the underlying metric to design a spatially-aware consensus clustering procedure. This consensus procedure is implemented via a novel reduction to Euclidean clustering, and is both simple and efficient. All of our results apply to both soft and hard clusterings. We accompany these algorithms with a detailed experimental evaluation that demonstrates the efficiency and quality of our techniques.
【Keywords】:
【Paper Link】 【Pages】:319-330
【Authors】: Claudia Plant
【Abstract】: Clustering is one of the most fundamental challenges in data mining. We identified three core problems which turn finding a natural grouping of a data set into a difficult task: First, clusters may exist in arbitrarily oriented subspaces of various dimensionality (also known as correlation clusters). Secondly, the cluster structure may be hidden by noise and outliers. Finally, the number, size and density of the clusters is usually unknown which makes the parametrization of existing approaches very difficult. In this paper, we address these three problems by combining ideas from information theory and blind signal source separation. Our algorithm is inspired by the idea of an active sonar that reveals hidden objects by sending echo pings with various frequencies and from different directions. Analogously, our algorithm SONAR very efficiently generates primitive pre-clusters and considers exactly these pre-clusters as echo pings. Each echo of a ping is a mixture of the signals of the true clusters. Independent component analysis (ICA) allows us to decompose the mixed signals into statistically independent response patterns. We combine the idea of signal de-mixing with the Minimum Description Length (MDL) principle to allow an outlier-robust and parameter-free detection of the true clusters. Extensive experiments demonstrate the following assets of SONAR: Outlier-robust detection of correlation clusters of various density and subspace orientation, requiring no difficult input parameters, and scalability to large data sets.
【Keywords】:
【Paper Link】 【Pages】:331-342
【Authors】: Pu Wang ; Kathryn B. Laskey ; Carlotta Domeniconi ; Michael I. Jordan
【Abstract】: A nonparametric Bayesian approach to co-clustering ensembles is presented. Similar to clustering ensembles, co-clustering ensembles combine various base co-clustering results to obtain a more robust consensus co-clustering. To avoid pre-specifying the number of co-clusters, we specify independent Dirichlet process priors for the row and column clusters. Thus, the numbers of row- and column-clusters are unbounded a priori; the actual numbers of clusters can be learned a posteriori from observations. Next, to model non-independence of row- and column-clusters, we employ a Mondrian Process as a prior distribution over partitions of the data matrix. As a result, the co-clusters are not restricted to a regular grid partition, but form nested partitions with varying resolutions. The empirical evaluation demonstrates the effectiveness of nonparametric Bayesian co-clustering ensembles and their advantages over traditional co-clustering methods.
【Keywords】:
【Paper Link】 【Pages】:343-354
【Authors】: Omar Odibat ; Chandan K. Reddy
【Abstract】: The goal of co-clustering is to simultaneously cluster both rows and columns in a given matrix. Motivated by several applications in text mining, recommendation systems and bioinformatics, different methods have been developed to discover local patterns that cannot be identified by traditional clustering algorithms. In spite of much research in this domain, existing co-clustering algorithms have some critical limitations in terms of identifying co-clusters with different types of correlations in the data and the ability to capture overlapping co-clusters in the data matrix. In this paper, we present a new deterministic co-clustering algorithm, POsitive and NEgative correlation based Overlapping Co-Clustering (PONEOCC), which can be used to find significant co-clusters efficiently. Our algorithm uses a novel ranking-based objective function that is optimized to simultaneously find large co-clusters with minimum residual errors. It allows positively and negatively correlated objects to be members of the same co-clusters and can extract overlapping co-clusters. In addition, the co-clusters can be arbitrarily positioned in the data matrix. We evaluated our algorithm on several synthetic and real-world gene expression datasets, and the experimental results showed that PONEOCC is able to find biologically significant co-clusters and also outperformed some of the well-known existing co-clustering algorithms in terms of the quality, size and biological significance of the co-clusters.
【Keywords】:
【Paper Link】 【Pages】:355-366
【Authors】: Charu C. Aggarwal ; Nan Li
【Abstract】: In recent years, a large amount of information has become available online in the form of web documents, social networks, blogs, or other kinds of social entities. Such networks are large, heterogeneous, and often contain a huge number of links. This linkage structure encodes rich structural information about the underlying topical behavior of the network. Such networks are often dynamic and evolve rapidly over time. Much of the work in the literature has focussed either on the problem of classification with purely text behavior, or on the problem of classification with purely the linkage behavior of the underlying graph. Furthermore, the work in the literature is mostly designed for the problem of static networks. However, a given network may be quite diverse, and the use of either content or structure could be more or less effective in different parts of the network. In this paper, we examine the problem of node classification in dynamic information networks with both text content and links. Our techniques use a random walk approach in conjunction with the content of the network in order to facilitate an effective classification process. This results in an effective approach which is more robust to variations in content and linkage structure. Our approach is dynamic, and can be applied to networks which are updated incrementally. Our results suggest that an approach which is based on a combination of content and links is extremely robust and effective. We present experimental results illustrating the effectiveness and efficiency of our approach.
【Keywords】:
【Paper Link】 【Pages】:367-378
【Authors】: Freddy Chong Tat Chua ; Hady Wirawan Lauw ; Ee-Peng Lim
【Abstract】: Users face a dazzling array of choices on the Web when it comes to choosing which product to buy, which video to watch, etc. The trend of social information processing means users increasingly rely not only on their own preferences, but also on friends when making various adoption decisions. In this paper, we investigate the effects of social correlation on users' adoption of items. Given a user-user social graph and an item-user adoption graph, we seek to answer the following questions: 1) whether the items adopted by a user correlate to items adopted by her friends, and 2) how to incorporate social correlation in order to improve prediction of unobserved item adoptions. We propose the Social Correlation model based on Latent Dirichlet Allocation (LDA) that decomposes the adoption graph into a set of latent factors reflecting user preferences, and a social correlation matrix reflecting the degree of correlation from one user to another. This matrix is learned (rather than pre-assigned), has probabilistic interpretation, and preserves the underlying social network structure. We further devise a Hybrid model that combines a user's own latent factors with her friends' for adoption prediction. Experiments on Epinions and LiveJournal data sets show that our proposed models outperform the approach based on latent factors only (LDA).
【Keywords】:
【Paper Link】 【Pages】:379-390
【Authors】: Wei Chen ; Alex Collins ; Rachel Cummings ; Te Ke ; Zhenming Liu ; David Rincón ; Xiaorui Sun ; Yajun Wang ; Wei Wei ; Yifei Yuan
【Abstract】: Influence maximization, defined by Kempe, Kleinberg, and Tardos (2003), is the problem of finding a small set of seed nodes in a social network that maximizes the spread of influence under certain influence cascade models. In this paper, we propose an extension to the independent cascade model that incorporates the emergence and propagation of negative opinions. The new model has an explicit parameter called quality factor to model the natural behavior of people turning negative to a product due to product defects. Our model incorporates negativity bias (negative opinions usually dominate over positive opinions) commonly acknowledged in the social psychology literature. The model maintains some nice properties such as submodularity, which allows a greedy approximation algorithm for maximizing positive influence within a ratio of 1 – 1/e. We define a quality sensitivity ratio (qs-ratio) of influence graphs and show a tight bound of on the qs-ratio, where n is the number of nodes in the network and k is the number of seeds selected, which indicates that seed selection is sensitive to the quality factor for general graphs. We design an efficient algorithm to compute influence in tree structures, which is nontrivial due to the negativity bias in the model. We use this algorithm as the core to build a heuristic algorithm for influence maximization for general graphs. Through simulations, we show that our heuristic algorithm has matching influence with a standard greedy approximation algorithm while being orders of magnitude faster.
【Keywords】:
【Paper Link】 【Pages】:391-402
【Authors】: Charu C. Aggarwal ; Yan Xie ; Philip S. Yu
【Abstract】: In recent years, the size of many social networks such as Facebook, MySpace, and LinkedIn has exploded at a rapid pace, because of its convenience in using the internet in order to connect geographically disparate users. This has lead to considerable interest in many graph-theoretical aspects of social networks such as the underlying communities, the graph diameter, and other structural information which can be used in order to mine useful information from the social network. The graph structure of social networks is influenced by the underlying social behavior, which can vary considerably over different groups of individuals. One of the disadvantages of existing schemes is that they attempt to determine global communities, which (implicitly) assume uniform behavior over the network. This is not very well suited to the differences in the underlying density in different regions of the social network. As a result, a global analysis over social community structure can result in either very small communities (in sparse regions), or communities which are too large and incoherent (in dense regions). In order to handle the challenge of local heterogeneity, we will explore a simple property of social networks, which we refer to as the local succinctness property. We will use this property in order to extract compressed descriptions of the underlying community representation of the social network with the use of a min-hash approach. We will show that this approach creates balanced communities across a heterogeneous network in an effective way. We apply the approach to a variety of data sets, and illustrate its effectiveness over competing techniques.
【Keywords】:
【Paper Link】 【Pages】:403-413
【Authors】: Santhanakrishnan Anand ; Rajarathnam Chandramouli ; Koduvayur P. Subbalakshmi
【Abstract】: We study the dynamics of social networks in terms of population growth and control of user behavior. Most of the current research in social networks focus on static analysis through graph theoretic models to represent the networks or focus on modeling the traffic. Here, we study the cost of collaborative vs individualistic behavior of users in order to grow their network size in a social network. Each user incurs a cost (monetary or emotional) for collaboration. We formulate the behavior of the users as a non-linear optimization problem with a cost. The objective function of the optimization problem is obtained using a stochastic analysis of population growth in social networks, based on the first-passage time of a birth-death process. The stochastic model is validated by comparison with real data obtained from Twitter Results indicate that a homogeneous social network (in which users have similar characteristics) will be individualistic. However, heterogeneous social networks (users with different characteristics) exhibit a threshold effect, i.e., there is a minimum cost, below which the network is as collaborative as desired and a maximum cost above which the network is individualistic as required. To the best of our knowledge, this is one of the first analysis of dynamics of user behavior and temporal population growth in social networks.
【Keywords】:
【Paper Link】 【Pages】:414-425
【Authors】: Nikolaj Tatti
【Abstract】: Items in many datasets can be arranged to a natural order. Such orders are useful since they can provide new knowledge about the data and may ease further data exploration and visualization. Our goal in this paper is to define a statistically well-founded and an objective score measuring the quality of an order. Such a measure can be used for determining whether the current order has any valuable information or can it be discarded. Intuitively, we say that the order is good if dependent attributes are close to each other. To define the order score we fit an order-sensitive model to the dataset. Our model resembles a Markov chain model, that is, the attributes depend only on the immediate neighbors. The score of the order is the BIC score of the best model. For computing the measure we introduce a fast dynamic program. The score is then compared against random orders: if it is better than the scores of the random orders, we say that the order is good. We also show the asymptotic connection between the score function and the number of free parameters of the model. In addition, we introduce a simple greedy approach for finding an order with a good score. We evaluate the score for synthetic and real datasets using different spectral orders and the orders obtained with the greedy method.
【Keywords】:
【Paper Link】 【Pages】:426-437
【Authors】: Truyen Tran ; Dinh Q. Phung ; Svetha Venkatesh
【Abstract】: Ranking is an important task for handling a large amount of content. Ideally, training data for supervised ranking would include a complete rank of documents (or other objects such as images or videos) for a particular query. However, this is only possible for small sets of documents. In practice, one often resorts to document rating, in that a subset of documents is assigned with a small number indicating the degree of relevance. This poses a general problem of modelling and learning rank data with ties. In this paper, we propose a probabilistic generative model, that models the process as permutations over partitions. This results in super-exponential combinatorial state space with unknown numbers of partitions and unknown ordering among them. We approach the problem from the discrete choice theory, where subsets are chosen in a stagewise manner, reducing the state space per each stage significantly. Further, we show that with suitable parameterisation, we can still learn the models in linear time. We evaluate the proposed models on two application areas: (i) document ranking with the data from the recently held Yahoo! challenge, and (ii) collaborative filtering with movie data. The results demonstrate that the models are competitive against well-known rivals.
【Keywords】:
【Paper Link】 【Pages】:438-449
【Authors】: Kanishka Bhaduri ; Kamalika Das ; Chris Giannella
【Abstract】: The problem of monitoring a multivariate linear regression model is relevant in studying the evolving relationship between a set of input variables (features) and one or more dependent target variables. This problem becomes challenging for large scale data in a distributed computing environment when only a subset of instances is available at individual nodes and the local data changes frequently. Data centralization and periodic model recomputation can add high overhead to tasks like anomaly detection in such dynamic settings. Therefore, the goal is to develop techniques for monitoring and updating the model over the union of all nodes' data in a communication-efficient fashion. Correctness guarantees on such techniques are also often highly desirable, especially in safety-critical application scenarios. In this paper we develop DReMo—a distributed algorithm with very low resource overhead, for monitoring the quality of a regression model in terms of its coefficient of determination (R2 statistic). When the nodes collectively determine that R2 has dropped below a fixed threshold, the linear regression model is recomputed via a network-wide convergecast and the updated model is broadcast back to all nodes. We show empirically, using both synthetic and real data, that our proposed method is highly communication-efficient and scalable, and also provide theoretical guarantees on correctness.
【Keywords】:
【Paper Link】 【Pages】:450-461
【Authors】: Ramnath Balasubramanyan ; William W. Cohen
【Abstract】: Identifying latent groups of entities from observed interactions between pairs of entities is a frequently encountered problem in areas like analysis of protein interactions and social networks. We present a model that combines aspects of mixed membership stochastic block models and topic models to improve entity-entity link modeling by jointly modeling links and text about the entities that are linked. We apply the model to two datasets: a protein-protein interaction (PPI) dataset supplemented with a corpus of abstracts of scientific publications annotated with the proteins in the PPI dataset and an Enron email corpus. The model is evaluated by inspecting induced topics to understand the nature of the data and by quantitative methods such as functional category prediction of proteins and perplexity which exhibit improvements when joint modeling is used over baselines that use only link or text information.
【Keywords】:
【Paper Link】 【Pages】:462-473
【Authors】: Oliver Schulte
【Abstract】: Bayes nets (BNs) for relational databases are a major research topic in machine learning and artificial intelligence. When the database exhibits cyclic probabilistic dependencies, measuring the fit of a BN model to relational data with a likelihood function is a challenge [5, 36, 28, 9]. A common approach to difficulties in defining a likelihood function is to employ a pseudo-likelihood; a prominent example is the pseudo likelihood defined for Markov Logic Networks (MLNs). This paper proposes a new pseudo likelihood P∗ for Parametrized Bayes Nets (PBNs) [32] and other relational versions of Bayes nets. The pseudo log-likelihood L∗ = ln(P∗) is similar to the single-table BN log-likelihood, where row counts in the data table are replaced by frequencies in the database. We introduce a new type of semantics based on the concept of random instantiations (groundings) from classic AI research [12, 1]: The measure L∗ is the expected log-likelihood of a random instantiation of the 1st-order variables in the PBN. The standard moralization method for converting a PBN to an MLN provides another interpretation of L∗: the measure is closely related to the log-likelihood and to the pseudo log-likelihood of the moralized PBN. For parameter learning, the L∗-maximizing estimates are the empirical conditional frequencies in the databases. For structure learning, we show that the state of the art learn-and-join method of Khosravi et al. [18] implicitly maximizes the L∗ measure. The measure provides a theoretical foundation for this algorithm, while the algorithm's empirical success provides experimental validation for its usefulness.
【Keywords】:
【Paper Link】 【Pages】:474-485
【Authors】: Xi Chen ; Yanjun Qi ; Bing Bai ; Qihang Lin ; Jaime G. Carbonell
【Abstract】: Latent semantic analysis (LSA), as one of the most popular unsupervised dimension reduction tools, has a wide range of applications in text mining and information retrieval. The key idea of LSA is to learn a projection matrix that maps the high dimensional vector space representations of documents to a lower dimensional latent space, i.e. so called latent topic space. In this paper, we propose a new model called Sparse LSA, which produces a sparse projection matrix via the ℓ1 regularization. Compared to the traditional LSA, Sparse LSA selects only a small number of relevant words for each topic and hence provides a compact representation of topic-word relationships. Moreover, Sparse LSA is computationally very efficient with much less memory usage for storing the projection matrix. Furthermore, we propose two important extensions of Sparse LSA: group structured Sparse LSA and non-negative Sparse LSA. We conduct experiments on several benchmark datasets and compare Sparse LSA and its extensions with several widely used methods, e.g. LSA, Sparse Coding and LDA. Empirical results suggest that Sparse LSA achieves similar performance gains to LSA, but is more efficient in projection computation, storage, and also well explain the topic-word relationships.
【Keywords】:
【Paper Link】 【Pages】:486-497
【Authors】: Xuan Li ; Liang Du ; Yi-Dong Shen
【Abstract】: Update summarization is to summarize a document collection B given that the users have already read another document collection A, which has time stamp prior to that of B. An important and challenging issue in update summarization is that contents in B already covered by A should be excluded from the update summary. In this paper, we propose a graph-based regularization framework MarginRank for update summarization. MarginRank extends the cost function of Zhou's Manifold Ranking with suppression terms, suppression of A on B, to fulfil the assumption that users have read A. MarginRank ranks sentences in B in a way that the top ranked sentences are most important and at the same time cover different contents from A. Experiments on the benchmark data sets TAC 2008 and 2009 show the effectiveness of the proposed method.
【Keywords】:
【Paper Link】 【Pages】:498-509
【Authors】: Himabindu Lakkaraju ; Chiranjib Bhattacharyya ; Indrajit Bhattacharya ; Srujana Merugu
【Abstract】: Facet-based sentiment analysis involves discovering the latent facets, sentiments and their associations. Traditional facet-based sentiment analysis algorithms typically perform the various tasks in sequence, and fail to take advantage of the mutual reinforcement of the tasks. Additionally, inferring sentiment levels typically requires domain knowledge or human intervention. In this paper, we propose a series of probabilistic models that jointly discover latent facets and sentiment topics, and also order the sentiment topics with respect to a multi-point scale, in a language and domain independent manner. This is achieved by simultaneously capturing both short-range syntactic structure and long range semantic dependencies between the sentiment and facet words. The models further incorporate coherence in reviews, where reviewers dwell on one facet or sentiment level before moving on, for more accurate facet and sentiment discovery. For reviews which are supplemented with ratings, our models automatically order the latent sentiment topics, without requiring seed-words or domain-knowledge. To the best of our knowledge, our work is the first attempt to combine the notions of syntactic and semantic dependencies in the domain of review mining. Further, the concept of facet and sentiment coherence has not been explored earlier either. Extensive experimental results on real world review data show that the proposed models outperform various state of the art baselines for facet-based sentiment analysis.
【Keywords】:
【Paper Link】 【Pages】:510-521
【Authors】: Xia Ning ; Yanjun Qi
【Abstract】: Extracting semantic relations between entities is an important step towards automatic text understanding. In this paper, we propose a novel Semi-supervised Convolution Graph Kernel (SCGK) method for semantic Relation Extraction (RE) from natural language. By encoding English sentences as dependence graphs among words, SCGK computes kernels (similarities) between sentences using a convolution strategy, i.e., calculating similarities over all possible short single paths from two dependence graphs. Furthermore, SCGK adds three semi-supervised strategies in the kernel calculation to incorporate soft-matches between (1) words, (2) grammatical dependencies, and (3) entire sentences, respectively. From a large unannotated corpus, these semi-supervision steps learn to capture contextual semantic patterns of elements in natural sentences, which therefore alleviate the lack of annotated examples in most RE corpora. Through convolutions and multi-level semi-supervisions, SCGK provides a powerful model to encode both syntactic and semantic evidence existing in natural English sentences, which effectively recovers the target relational patterns of interest. We perform extensive experiments on five RE benchmark datasets which aim to identify interaction relations from biomedical literature. Our results demonstrate that SCGK achieves the state-of-the-art performance on the task of semantic relation extraction.
【Keywords】:
【Paper Link】 【Pages】:522-533
【Authors】: Charu C. Aggarwal ; Arijit Khan ; Xifeng Yan
【Abstract】: A central characteristic of social networks is that it facilitates rapid dissemination of information between large groups of individuals. This paper will examine the problem of determination of information flow representatives, a small group of authoritative representatives to whom the dissemination of a piece of information leads to the maximum spread. Clearly, information flow is affected by a number of different structural factors such as the node degree, connectivity, intensity of information flow interaction and the global structural behavior of the underlying network. We will propose a stochastic information flow model, and use it to determine the authoritative representatives in the underlying social network. We will first design an accurate RankedReplace algorithm, and then use a Bayes probabilistic model in order to approximate the effectiveness of this algorithm with the use of a fast algorithm. We will examine the results on a number of real social network data sets, and show that the method is more effective than state-of-the-art methods.
【Keywords】:
【Paper Link】 【Pages】:534-545
【Authors】: Francesco Bonchi ; Matthijs van Leeuwen ; Antti Ukkonen
【Abstract】: Motivated by sensor networks, mobility data, biology and life sciences, the area of mining uncertain data has recently received a great deal of attention. While various papers have focused on efficiently mining frequent patterns from uncertain data, the problem of discovering a small set of interesting patterns that provide an accurate and condensed description of a probabilistic database is still unexplored. In this paper we study the problem of discovering characteristic patterns in uncertain data through information theoretic lenses. Adopting the possible worlds interpretation of probabilistic data and a compression scheme based on the MDL principle, we formalize the problem of mining patterns that compress the database well in expectation. Despite its huge search space, we show that this problem can be accurately approximated. In particular, we devise a sequence of three methods where each new method improves the memory requirements orders of magnitudes compared to its predecessor, while giving up only a little in terms of approximation accuracy. We empirically compare our methods on both synthetic data and real data from life science. Results show that from a probabilistic matrix with more than one million rows and columns, we can extract a small set of meaningful patterns that accurately characterize the data distribution of any probable world.
【Keywords】:
【Paper Link】 【Pages】:546-557
【Authors】: Esther Galbrun ; Pauli Miettinen
【Abstract】: Redescription mining is a powerful data analysis tool that is used to find multiple descriptions of the same entities. Consider geographical regions as an example. They can be characterized by the fauna that inhabits them on one hand and by their meteorological conditions on the other hand. Finding such redescriptors, a task known as niche-finding, is of much importance in biology. But current redescription mining methods cannot handle other than Boolean data. This restricts the range of possible applications or makes discretization a prerequisite, entailing a possibly harmful loss of information. In niche-finding, while the fauna can be naturally represented using a Boolean presence/absence data, the weather cannot. In this paper, we extend redescription mining to real-valued data using a surprisingly simple and efficient approach. We provide extensive experimental evaluation to study the behaviour of the proposed algorithm. Furthermore, we show the statistical significance of our results using recent innovations on randomization methods.
【Keywords】:
【Paper Link】 【Pages】:558-569
【Authors】: Arno Siebes ; René Kersten
【Abstract】: The ultimate goal of descriptive data mining – in fact of descriptive data analysis in general – is to gain insight in the structure of the data. While the best model may reflect all important structure of D, this is not true for a good model and algorithms often only return a good model rather than the best. Data sets, however, have many models. Different models of the same data set D highlight different aspects of the structure of D. Hence, it makes sense to consider multiple good models of D. The question is: which good models? In this paper we propose a solution for the case were the data is a transaction database [2] and the models are code tables [8]. More in particular, we introduce a structure function, based on the Minimum Description Length (MDL) principle [6]. This is a partial function from the set of natural numbers to the set of code tables for D; higher natural numbers are mapped to more complex code tables. Computing the structure function exactly is, unfortunately, too complex. Therefore we introduce the heuristic Groei algorithm, which approximates the true structure function. Through experiments we show that Groei produces a set of good code tables that together provide more insight than any of them alone.
【Keywords】:
【Paper Link】 【Pages】:570-581
【Authors】: Kim-Ngan Nguyen ; Loïc Cerf ; Marc Plantevit ; Jean-François Boulicaut
【Abstract】: Popular data mining methods support knowledge discovery from patterns that hold in binary relations. We study the generalization of association rule mining within arbitrary n-ary relations and thus Boolean tensors instead of Boolean matrices. Indeed, many datasets of interest correspond to relations whose number of dimensions is greater or equal to 3. However, just a few proposals deal with rule discovery when both the head and the body can involve subsets of any dimensions. A challenging problem is to provide a semantics to such generalized rules by means of objective interestingness measures that have to be carefully designed. Therefore, we discuss the need for different generalizations of the classical confidence measure. We also present the first algorithm that computes, in such a general framework, every rule that satisfies both a minimal frequency constraint and minimal confidence constraints. The approach is tested on real datasets (ternary and 4-ary relations). We report on a case study that deals with analyzing a dynamic graph thanks to rules.
【Keywords】:
【Paper Link】 【Pages】:582-593
【Authors】: Mahashweta Das ; Deepak Padmanabhan ; Prasad Deshpande ; Ramakrishnan Kannan
【Abstract】: Association rule mining is an indispensable tool for discovering insights from large databases and data warehouses. The data in a warehouse being multi-dimensional, it is often useful to mine rules over subsets of data defined by selections over the dimensions. Such interactive rule mining over multi-dimensional query windows is difficult since rule mining is computationally expensive. Current methods using pre-computation of frequent itemsets require counting of some itemsets by revisiting the transaction database at query time, which is very expensive. We develop a method (RMW) that identifies the minimal set of itemsets to compute and store for each cell, so that rule mining over any query window may be performed without going back to the transaction database. We give formal proofs that the set of itemsets chosen by RMW is sufficient to answer any query and also prove that it is the optimal set to be computed for 1 dimensional queries. We demonstrate through an extensive empirical evaluation that RMW achieves extremely fast query response time compared to existing methods, with only moderate overhead in pre-computation and storage.
【Keywords】:
【Paper Link】 【Pages】:594-605
【Authors】: Wei Ping ; Ye Xu ; Jianyong Wang ; Xian-Sheng Hua
【Abstract】: Kernel method is a powerful tool in multi-instance learning. However, many typical kernel methods for multi-instance learning ignore the correspondence information of instances between two bags or co-occurrence information, and result in poor performance. Additionally, most current multi-instance kernels unreasonably assign all instances in each bag an equal weight, which neglects the significance of some “key” instances in multi-instance learning. Last but not least, almost all the multi-instance kernels encounter a heavy computation load, which may fail in large datasets. To cope with these shortcomings, we propose a FAst kernel for Multi-instancE leaRning named as FAMER. FAMER constructs a Locally Sensitive Hashing (LSH) based similarity measure for multi-instance framework, and represents each bag as a histogram by embedding instances within the bag into an auxiliary space, which captures the correspondence information between two bags. By designing a bin-dependent weighting scheme, we not only impose different weights on instances according to their discriminative powers, but also exploit co-occurrence relations according to the joint statistics of instances. Without directly computing in a pairwise manner, the time complexity of FAMER is much smaller compared to other typical multi-instance kernels. The experiments demonstrate the effectiveness and efficiency of the proposed method.
【Keywords】:
【Paper Link】 【Pages】:606-617
【Authors】: James R. Foulds ; Padhraic Smyth
【Abstract】: Multi-instance (MI) learning is a variant of supervised learning where labeled examples consist of bags (i.e. multi-sets) of feature vectors instead of just a single feature vector. Under standard assumptions, MI learning can be understood as a type of semi-supervised learning (SSL). The difference between MI learning and SSL is that positive bag labels provide weak label information for the instances that they contain. MI learning tasks can be approximated as SSL tasks by disregarding this weak label information, allowing the direct application of existing SSL techniques. To give insight into this connection we first introduce multi-instance mixture models (MIMMs), an adaption of mixture model classifiers for multi-instance data. We show how to learn such models using an Expectation-Maximization algorithm in the case where the instance-level class distributions are members of an exponential family. The cost of the semi-supervised approximation to multi-instance learning is explored, both theoretically and empirically, by analyzing the properties of MIMMs relative to semi-supervised mixture models.
【Keywords】:
【Paper Link】 【Pages】:618-629
【Authors】: Xiangnan Kong ; Xiaoxiao Shi ; Philip S. Yu
【Abstract】: Collective classification in relational data has become an important and active research topic in the last decade, where class labels for a group of linked instances are correlated and need to be predicted simultaneously. Collective classification has a wide variety of real world applications, e.g. hyperlinked document classification, social networks analysis and collaboration networks analysis. Current research on collective classification focuses on single-label settings, which assumes each instance can only be assigned with exactly one label among a finite set of candidate classes. However, in many real-world relational data, each instance can be assigned with a set of multiple labels simultaneously. In this paper, we study the problem of multi-label collective classification and propose a novel solution, called Icml (Iterative Classification of Multiple Labels), to effectively assign a set of multiple labels to each instance in the relational dataset. The proposed Icml model is able to capture the dependencies among the label sets for a group of related instances and the dependencies among the multiple labels within each label set simultaneously. Empirical studies on real-world tasks demonstrate that the proposed multi-label collective classification approach can effectively boost classification performances in multi-label relational datasets.
【Keywords】:
【Paper Link】 【Pages】:630-641
【Authors】: Nicola Barbieri ; Giuseppe Manco ; Ettore Ritacco
【Abstract】: This paper presents a hierarchical probabilistic approach to collaborative filtering which allows the discovery and analysis of both global patterns (i.e., tendency of some products of being ‘universally appreciated’) and local patterns (tendency of users within a community to express a common preference on the same group of items). We reformulate the collaborative filtering approach as a clustering problem in a high-dimensional setting, and propose a probabilistic approach to model the data. The core of our approach is a co-clustering strategy, arranged in a hierarchical fashion: first, user communities are discovered, and then the information provided by each user community is used to discover topics, grouping items into categories. The resulting probabilistic framework can be used for detecting interesting relationships between users and items within user communities. The experimental evaluation shows that the proposed model achieves a competitive prediction accuracy with respect to the state-of-art collaborative filtering approaches.
【Keywords】:
【Paper Link】 【Pages】:642-651
【Authors】: Tuyen N. Huynh ; Raymond J. Mooney
【Abstract】: Most of the existing weight-learning algorithms for Markov Logic Networks (MLNs) use batch training which becomes computationally expensive and even infeasible for very large datasets since the training examples may not fit in main memory. To overcome this problem, previous work has used online learning algorithms to learn weights for MLNs. However, this prior work has only applied existing online algorithms, and there is no comprehensive study of online weight learning for MLNs. In this paper, we derive a new online algorithm for structured prediction using the primal-dual framework, apply it to learn weights for MLNs, and compare against existing online algorithms on three large, real-world datasets. The experimental results show that our new algorithm generally achieves better accuracy than existing methods, especially on noisy datasets.
【Keywords】:
【Paper Link】 【Pages】:652-663
【Authors】: Charu C. Aggarwal
【Abstract】: In this paper, we will examine the problem of classification of massive graph streams. The problem of classification has been widely studied in the database and data mining community. The graph domain poses significant challenges because of the structural nature of the data. The stream scenario is even more challenging, and has not been very well studied in the literature. This is because the underlying graphs can be scanned only once in the stream setting, and it is difficult to extract the relevant structural information from the graphs. In many practical applications, the graphs may be defined over a very large set of nodes. The massive domain size of the underlying graph makes it difficult to learn summary structural information for the classification problem, and particularly so in the case of data streams. In order to address these challenges, we proposed a probabilistic approach for constructing an “in-memory” summary of the underlying structural data. This summary determines the discriminative patterns in the underlying graph with the use of a 2-dimensional hashing scheme. We provide probabilistic bounds on the quality of the patterns determined by the process, and experimentally demonstrate the quality on a number of real data sets.
【Keywords】:
【Paper Link】 【Pages】:664-675
【Authors】: Michael Hahsler ; Margaret H. Dunham
【Abstract】: This paper describes one of the first attempts to model the temporal structure of massive data streams in real-time using data stream clustering. Recently, many data stream clustering algorithms have been developed which efficiently find a partition of the data points in a data stream. However, these algorithms disregard the information represented by the temporal order of the data points in the stream which for many applications is an important part of the data stream. In this paper we propose a new framework called Temporal Relationships Among Clusters for Data Streams (TRACDS) which allows us to learn the temporal structure while clustering a data stream. We identify, organize and describe the clustering operations which are used by state-of-the-art data stream clustering algorithms. Then we show that by defining a set of new operations to transform Markov Chains with states representing clusters dynamically, we can efficiently capture temporal ordering information. This framework allows us to preserve temporal relationships among clusters for any state-of-the-art data stream clustering algorithm with only minimal overhead. To investigate the usefulness of TRACDS, we evaluate the improvement of TRACDS over pure data stream clustering for anomaly detection using several synthetic and real-world data sets. The experiments show that TRACDS is able to considerably improve the results even if we introduce a high rate of incorrect time stamps which is typical for real-world data streams.
【Keywords】:
【Paper Link】 【Pages】:676-686
【Authors】: Niko Vuokko ; Petteri Kaski
【Abstract】: Time series are a class of data whose complexity and rich structure make it difficult for data mining tools to extract meaningful patterns from them, and in particular to prune away the false positive patterns. Wavelet-based methods have recently become the preferred way for significance testing of time series and time series collections, but these methods are still often based on fairly ad hoc bootstrapping techniques in the wavelet domain without a disciplined null model analysis. We propose a new well-grounded null model for time series collections that also sets minimum requirements for realistic resampling methods. We compare it to the null models of common resampling methods and introduce a new randomization method that is compatible with the proposed null model. We conduct experiments on real and synthetic datasets to compare the behavior of the various methods and reflect the results to the differences in their null models. Compared with the other methods, our experiments suggest that the proposed method gives fewer Type I and Type II errors across a range of statistics.
【Keywords】:
【Paper Link】 【Pages】:687-698
【Authors】: Nuno Castro ; Paulo J. Azevedo
【Abstract】: Time series motif discovery is the task of extracting previously unknown recurrent patterns from time series data. It is an important problem within applications that range from finance to health. Many algorithms have been proposed for the task of efficiently finding motifs. Surprisingly, most of these proposals do not focus on how to evaluate the discovered motifs. They are typically evaluated by human experts. This is unfeasible even for moderately sized datasets, since the number of discovered motifs tends to be prohibitively large. Statistical significance tests are widely used in bioinformatics and association rules mining communities to evaluate the extracted patterns. In this work we present an approach to calculate time series motifs statistical significance. Our proposal leverages work from the bioinformatics community by using a symbolic definition of time series motifs to derive each motif's p-value. We estimate the expected frequency of a motif by using Markov Chain models. The p-value is then assessed by comparing the actual frequency to the estimated one using statistical hypothesis tests. Our contribution gives means to the application of a powerful technique – statistical tests – to a time series setting. This provides researchers and practitioners with an important tool to evaluate automatically the degree of relevance of each extracted motif.
【Keywords】:
【Paper Link】 【Pages】:699-710
【Authors】: Gustavo E. A. P. A. Batista ; Xiaoyue Wang ; Eamonn J. Keogh
【Abstract】: The ubiquity of time series data across almost all human endeavors has produced a great interest in time series data mining in the last decade. While there is a plethora of classification algorithms that can be applied to time series, all of the current empirical evidence suggests that simple nearest neighbor classification is exceptionally difficult to beat. The choice of distance measure used by the nearest neighbor algorithm depends on the invariances required by the domain. For example, motion capture data typically requires invariance to warping. In this work we make a surprising claim. There is an invariance that the community has missed, complexity invariance. Intuitively, the problem is that in many domains the different classes may have different complexities, and pairs of complex objects, even those which subjectively may seem very similar to the human eye, tend to be further apart under current distance measures than pairs of simple objects. This fact introduces errors in nearest neighbor classification, where complex objects are incorrectly assigned to a simpler class. We introduce the first complexity-invariant distance measure for time series, and show that it generally produces significant improvements in classification accuracy. We further show that this improvement does not compromise efficiency, since we can lower bound the measure and use a modification of triangular inequality, thus making use of most existing indexing and data mining algorithms. We evaluate our ideas with the largest and most comprehensive set of time series classification experiments ever attempted, and show that complexity-invariant distance measures can produce improvements in accuracy in the vast majority of cases.
【Keywords】:
【Paper Link】 【Pages】:711-722
【Authors】: Chun Li ; Charu C. Aggarwal ; Jianyong Wang
【Abstract】: The problem of privacy-preserving data mining has attracted considerable attention in recent years because of increasing concerns about the privacy of the underlying data. In recent years, an important data domain which has emerged is that of graphs and structured data. Many data sets such as XML data, transportation networks, traffic in IP networks, social networks and hierarchically structured data are naturally represented as graphs. Existing work on graph privacy has focussed on the problem of anonymizing nodes or edges of a single graph, in which the identity is assumed to be associated with individual nodes. In this paper, we examine the more complex case, where we have a collection of graphs, and the identity is associated with individual graphs rather than nodes or edges. In such cases, the problem of identity anonymization is extremely difficult, since we need to not only anonymize the labels on the nodes, but also the underlying global structural information. In such cases, both the global and local structural information can be a challenge to the anonymization process, since any combination of such information can be used in order to de-identify the underlying graphs. In order to achieve this goal, we will create synthesized representations of the underlying graphs based on aggregate structural analytics of the collection of graphs. The synthesized graphs retain the properties of the original data while satisfying the k-anonymity requirement. Our experimental results show that the synthesized graphs maintain a high level of structural information and compatible classification accuracies with the original data.
【Keywords】:
【Paper Link】 【Pages】:723-734
【Authors】: Frank Eichinger ; Christopher Oßner ; Klemens Böhm
【Abstract】: The localisation of defects in computer programmes is essential in software engineering and is important in domain-specific data mining. Existing techniques which build on call-graph mining localise defects well, but do not scale for large software projects. This paper presents a hierarchical approach with good scalability characteristics. It makes use of novel call-graph representations, frequent subgraph mining and feature selection. It first analyses call graphs of a coarse granularity, before it zooms-in into more fine-grained graphs. We evaluate our approach with defects in the Mozilla Rhino project: In our setup, it narrows down the code a developer has to examine to about 6% only.
【Keywords】:
【Paper Link】 【Pages】:735-746
【Authors】: Chih-Chun Chia ; Zeeshan Syed
【Abstract】: Heart disease is the leading cause of death in the United States, claiming over 830,000 lives each year (34% of all deaths or roughly one death every 38 seconds). A similar situation exists in other parts of the world, where an estimated 40% of all deaths in the developing world by the year 2020 are expected to be due to heart disease. The risk of death in these patients can be substantially lowered through the delivery of appropriate treatments (e.g., pharmacological and surgical interventions). However, matching patients to treatments that are appropriate for their risk remains challenging. In this paper, we aim to address this challenge by developing novel computational biomarkers that can be used to risk stratify patients. Our focus is on identifying high risk behavior in large datasets of electrocardiographic (ECG) signals from patients who experienced mortality following coronary attacks and those that remained event free. We frame the problem of finding risk markers as the discovery of approximately conserved heart rate sequences that are significantly overrepresented in either high or low risk patients. We propose a randomized hashing- and greedy centroid selection-based algorithm to efficiently discover such heart rate patterns in large high-resolution ECG datasets captured continuously over long periods from thousands of patients. When evaluated on data from 3,067 patients in two separate cohorts, our pattern discovery algorithm was able to correctly identify patients at high risk of death, even after adjusting for information in existing heart rate-based risk stratification metrics. Moreover, our approach can be easily extended to other clinical and non-clinical applications focused on approximate sequential patterns discovery in massive time-series datasets.
【Keywords】:
【Paper Link】 【Pages】:747-758
【Authors】: Hyungsul Kim ; Manish Marwah ; Martin F. Arlitt ; Geoff Lyon ; Jiawei Han
【Abstract】: Fear of increasing prices and concern about climate change are motivating residential power conservation efforts. We investigate the effectiveness of several unsupervised disaggregation methods on low frequency power measurements collected in real homes. Specifically, we consider variants of the factorial hidden Markov model. Our results indicate that a conditional factorial hidden semi-Markov model, which integrates additional features related to when and how appliances are used in the home and more accurately represents the power use of individual appliances, outperforms the other unsupervised disaggregation methods. Our results show that unsupervised techniques can provide perappliance power usage information in a non-invasive manner, which is ideal for enabling power conservation efforts.
【Keywords】:
【Paper Link】 【Pages】:759-770
【Authors】: Yasushi Sakurai ; Lei Li ; Yasuko Matsubara ; Christos Faloutsos
【Abstract】: Given a large stream of users clicking on web sites, how can we find trends, patterns and anomalies? We have developed a novel method, WindMine, and its fine-tuning sibling, WindMine-part, to find patterns and anomalies in such datasets. Our approach has the following advantages: (a) it is effective in discovering meaningful “building blocks” and patterns such as the lunch-break trend and anomalies, (b) it automatically determines suitable window sizes, and (c) it is fast, with its wall clock time linear on the duration of sequences. Moreover, it can be made sub-quadratic on the number of sequences (WindMine-part), with little loss of accuracy. We examine the effectiveness and scalability by performing experiments on 67 GB of real data (one billion clicks for 30 days). Our proposed WindMine does produce concise, informative and interesting patterns. We also show that WindMine-part can be easily implemented in a parallel or distributed setting, and that, even in a single-machine setting, it can be an order of magnitude faster (up to 70 times) than the plain version.
【Keywords】:
【Paper Link】 【Pages】:771-782
【Authors】: Yunlong He ; Renato D. C. Monteiro ; Haesun Park
【Abstract】: Sparse principal component analysis (PCA) imposes extra constraints or penalty terms to the standard PCA to achieve sparsity. In this paper, we first introduce an efficient algorithm for finding a single sparse principal component (PC) with a specified cardinality. Experiments on synthetic data, randomly generated data and real-world data sets show that our algorithm is very fast, especially on large and sparse data sets, while the numerical quality of the solution is comparable to the state-of-the-art algorithm. Moreover, combining our algorithm for computing a single sparse PC with the Schur complement deflation scheme, we develop an algorithm which sequentially computes multiple PCs by greedily maximizing the adjusted variance explained by them. On the other hand, to address the difficulty of choosing the proper sparsity and parameter in various sparse PCA algorithms, we propose a new PCA formulation whose aim is to minimize the sparsity of the PCs while requiring that their relative adjusted variance is larger than a given fraction. We also show that a slight modification of the aforementioned multiple component PCA algorithm can also find sharp solutions of the latter formulation.
【Keywords】:
【Paper Link】 【Pages】:783-794
【Authors】: Bin Tong ; Junbin Gao ; Thach Huy Nguyen ; Einoshin Suzuki
【Abstract】: Dimensionality reduction has been considered as one of the most significant tools for data analysis. In general, supervised information is helpful for dimensionality reduction. However, in typical real applications, supervised information in multiple source tasks may be available, while the data of the target task are unlabeled. An interesting problem of how to guide the dimensionality reduction for the unlabeled target data by exploiting useful knowledge, such as label information, from multiple source tasks arises in such a scenario. In this paper, we propose a new method for dimensionality reduction in the transfer learning setting. Unlike traditional paradigms where the useful knowledge from multiple source tasks is transferred through distance metric, our proposal firstly converts the dimensionality reduction problem into integral regression problems in parallel. Gaussian process is then employed to learn the underlying relationship between the original data and the reduced data. Such a relationship can be appropriately transferred to the target task by exploiting the prediction ability of the Gaussian process model and inventing different kinds of regularizers. Extensive experiments on both synthetic and real data sets show the effectiveness of our method.
【Keywords】:
【Paper Link】 【Pages】:795-803
【Authors】: Evgeni Tsivtsivadze ; Josef Urban ; Herman Geuvers ; Tom Heskes
【Abstract】: Learning reasoning techniques from previous knowledge is a largely underdeveloped area of automated reasoning. As large bodies of formal knowledge are becoming available to automated reasoners, state-of-the-art machine learning methods can provide powerful heuristics for problem-specific detection of relevant knowledge contained in the libraries. In this paper we develop a semantic graph kernel suitable for learning in structured mathematical domains. Our kernel incorporates contextual information about the features and unlike “random walk”-based graph kernels it is also applicable to sparse graphs. We evaluate the proposed semantic graph kernel on a subset of the large formal Mizar mathematical library. Our empirical evaluation demonstrates that graph kernels in general are particularly suitable for the automated reasoning domain and that in many cases our semantic graph kernel leads to improvement in performance compared to linear, Gaussian, latent semantic, and geometric graph kernels.
【Keywords】:
【Paper Link】 【Pages】:804-815
【Authors】: Koen Smets ; Jilles Vreeken
【Abstract】: In many situations there exists an abundance of positive examples, but only a handful of negatives. In this paper we show how in binary or transaction data such rare cases can be identified and characterised. Our approach uses the Minimum Description Length principle to decide whether an instance is drawn from the training distribution or not. By using frequent itemsets to construct this compressor, we can easily and thoroughly characterise the decisions, and explain what changes in an example would lead to a different verdict. Furthermore, we give a technique through which, given only a few negative examples, the decision landscape and optimal boundary can be predicted—making the approach parameter-free. Experimentation on benchmark and real data shows our method provides very high classification accuracy, thorough and insightful characterisation of decisions, predicts the decision landscape reliably, and can pinpoint observation errors. Moreover, a case study on real MCADD data shows we provide an interpretable approach with state-of-the-art performance for screening newborn babies for rare diseases.
【Keywords】:
【Paper Link】 【Pages】:816-827
【Authors】: Santanu Das ; Nikunj C. Oza
【Abstract】: In this paper we propose an innovative learning algorithm – a variation of One-class ν Support Vector Machines (SVMs) learning algorithm to produce sparser solutions with a much reduced computational complexity. The proposed technique returns an approximate solution, nearly as good as the solution obtained by the classical approach, by minimizing the original risk function along with a regularization term. We introduce a bi-criterion optimization that helps guide the search towards the optimal set in reduced time in comparison to the classical approach. The outcome of the proposed learning technique was compared with the benchmark one-class Support Vector machines algorithm which more often leads to solutions with redundant support vectors. Throughout the analysis, the problem size for both optimization routines was kept consistent. We have tested the proposed algorithm on a variety of data sources under different conditions to demonstrate its effectiveness. In all cases the proposed algorithm closely preserves the accuracy of standard one-class ν SVMs while reducing both training time and test time by several factors.
【Keywords】:
【Paper Link】 【Pages】:828-838
【Authors】: Pratik Jawanpuria ; Jagarlapudi Saketha Nath
【Abstract】: This paper presents two novel formulations for learning shared feature representations across multiple tasks. The idea is to pose the problem as that of learning a shared kernel, which is constructed from a given set of base kernels, leading to improved generalization in all the tasks. The first formulation employs a (l1, lp), p ≥ 2 mixed norm regularizer promoting sparse combinations of the base kernels and unequal weightings across tasks—enabling the formulation to work with unequally reliable tasks. While this convex formulation can be solved using a suitable mirror-descent algorithm, it may not learn shared feature representations which are sparse. The second formulation extends these ideas for learning sparse feature representations constructed from multiple base kernels and shared across multiple tasks. The sparse feature representation learnt by this formulation is essentially a direct product of low-dimensional subspaces lying in the induced feature spaces of few base kernels. The formulation is posed as a (l1, lq), q ≥ 1 mixed Schatten-norm regularized problem. One main contribution of this paper is a novel mirror-descent based algorithm for solving this problem which is not a standard set-up studied in the optimization literature. The proposed formulations can also be understood as generalizations of the framework of multiple kernel learning to the case of multiple tasks and hence are suitable for various learning applications. Simulation results on real-world datasets show that the proposed formulations generalize better than state-of-the-art. The results also illustrate the efficacy of the proposed mirror-descent based algorithms.
【Keywords】:
【Paper Link】 【Pages】:839-850
【Authors】: Shivani Agarwal
【Abstract】: Ranking problems have become increasingly important in machine learning and data mining in recent years, with applications ranging from information retrieval and recommender systems to computational biology and drug discovery. In this paper, we describe a new ranking algorithm that directly maximizes the number of relevant objects retrieved at the absolute top of the list. The algorithm is a support vector style algorithm, but due to the different objective, it no longer leads to a quadratic programming problem. Instead, the dual optimization problem involves l1,∞ constraints; we solve this dual problem using the recent l1,∞ projection method of Quattoni et al (2009). Our algorithm can be viewed as an l∞-norm extreme of the lp-norm based algorithm of Rudin (2009) (albeit in a support vector setting rather than a boosting setting); thus we refer to the algorithm as the ‘Infinite Push’. Experiments on real-world data sets confirm the algorithm's focus on accuracy at the absolute top of the list.
【Keywords】:
【Paper Link】 【Pages】:851-861
【Authors】: Jun Du ; Charles X. Ling
【Abstract】: We propose and study a new intelligent teaching paradigm called active teaching in this paper. In contrast to active learning, we assume that the learner can only passively conduct inductive learning from the given examples, but the teacher (oracle) can actively provide “good” examples to the learner, in order to speed up the teaching (learning) process. We establish the framework with four specific paradigms of active teaching, and develop the corresponding teaching strategies. Empirical study shows that, the proposed teaching strategies can indeed outperform the traditional active learning and the basic random sampling strategies, thus making the teaching (learning) process more efficient.
【Keywords】:
【Paper Link】 【Pages】:862-871
【Authors】: Ling Chen ; Chengqi Zhang
【Abstract】: Semi-supervised learning, which uses a small amount of labeled data in conjunction with a large amount of unlabeled data for training, has recently attracted huge research attention due to the considerable improvement in learning accuracy. In this work, we focus on semi-supervised variable weighting for clustering, which is a critical step in clustering as it is known that interesting clustering structure usually occurs in a subspace defined by a subset of variables. Besides exploiting both labeled and unlabeled data to effectively identify the real importance of variables, our method embeds variable weighting in the process of semi-supervised clustering, rather than calculating variable weights separately, to ensure the computation efficiency. Our experiments carried out on both synthetic and real data demonstrate that semi-supervised variable weighting significantly improves the clustering accuracy of existing semi-supervised k-means without variable weighting, or with unsupervised variable weighting.
【Keywords】:
【Paper Link】 【Pages】:872-883
【Authors】: Vinayaka Pandit ; Sreyash Kenkre ; Arindam Khan
【Abstract】: The problem of ordering a set of entities which contain inherent ties among them arises in many applications. Notion of “bucket order” has emerged as a popular mechanism of ranking in such settings. A bucket order is an ordered partition of the set of entities into “buckets”. There is a total order on the buckets, but the entities within a bucket are treated as tied. In this paper, we focus on discovering bucket order from data captured in the form of user preferences. We consider two settings: one in which the discrepancies in the input preferences are “local” (when collected from experts) and the other in which discrepancies could be arbitrary (when collected from a large population). We present a formal model to capture the setting of local discrepancies and consider the following question: “how many experts need to be queried to discover the underlying bucket order on n entities?”. We prove an upperbound of . In the case of arbitrary discrepancies, we model it as the bucket order problem of discovering a bucket order that best fits the data (captured as pairwise preference statistics). We present a new approach which exploits a connection between the discovery of buckets and the correlation clustering problem. We present empirical evaluation of our algorithms on real and artificially generated datasets.
【Keywords】:
【Paper Link】 【Pages】:884-895
【Authors】: Kewei Tu ; Xixiu Ouyang ; Dingyi Han ; Vasant Honavar
【Abstract】: The biclustering, co-clustering, or subspace clustering problem involves simultaneously grouping the rows and columns of a data matrix to uncover biclusters or sub-matrices of the data matrix that optimize a desired objective function. In coherent biclustering, the objective function contains a coherence measure of the biclusters. We introduce a novel formulation of the coherent biclustering problem and use it to derive two algorithms. The first algorithm is based on loopy message passing; and the second relies on a greedy strategy yielding an algorithm that is significantly faster than the first. A distinguishing feature of these algorithms is that they identify an exemplar or a prototypical member of each bicluster. We note the interference from background elements in biclustering, and offer a means to circumvent such interference using additional regularization. Our experiments with synthetic as well as real-world datasets show that our algorithms are competitive with the current state-of-the-art algorithms for finding coherent biclusters.
【Keywords】:
【Paper Link】 【Pages】:896-907
【Authors】: Rikiya Takahashi
【Abstract】: Computing not the local, but the global optimum of a cluster assignment is one of the important aspects in clustering. Convex clustering is an approach to acquire the global optimum, assuming some fixed centers and bandwidths of the clusters. The essence of the convex clustering is a convex optimization of the mixture weights whose optimum becomes sparse. One of the limitations in the convex clustering was the computational inefficiency of the Expectation-Maximization algorithm, where an extremely large number of iterations is required for the convergence. This paper proposes a more efficient optimization algorithm for convex clustering to significantly reduce the required number of iterations. The key ideas in the proposed algorithm are accurate pruning while choosing a pair of kernels and an element-wise Newton-Raphson method for fast convergence of the non-zero mixture weights. The proposed algorithm is further accelerated when incorporating locally adaptive bandwidths of the clusters, which are primarily introduced to improve the predictive capability.
【Keywords】:
【Paper Link】 【Pages】:908-919
【Authors】: Fei Wang ; Ping Li ; Arnd Christian König
【Abstract】: In recent years, Nonnegative Matrix Factorization (NMF) has received considerable interest from the data mining and information retrieval fields. NMF has been successfully applied in document clustering, image representation, and other domains. This study proposes an online NMF (ONMF) algorithm to efficiently handle very large-scale and/or streaming datasets. Unlike conventional NMF solutions which require the entire data matrix to reside in the memory, our ONMF algorithm proceeds with one data point or one chunk of data points at a time. Experiments with one-pass and multi-pass ONMF on real datasets are presented.
【Keywords】:
【Paper Link】 【Pages】:920-931
【Abstract】: Consensus clustering has emerged as an important extension of the classical clustering problem. Given a set of input clusterings of a given dataset, consensus clustering aims to find a single final clustering which is a better fit in some sense than the existing clusterings. There is a significant drawback in generating a single consensus clustering since different input clusterings could differ significantly. In this paper, we develop a new framework, called Multiple Consensus Clustering (MCC), to explore multiple clustering views of a given dataset from a set of input clusterings. Instead of generating a single consensus, MCC organizes the different input clusterings into a hierarchical tree structure and allows for interactive exploration of multiple clustering solutions. A dynamic programming algorithm is proposed to obtain a flat partition from the hierarchical tree using the modularity measure. Multiple consensuses are finally obtained by applying consensus clustering algorithms to each cluster of the partition. Extensive experimental results on 11 real world data sets and a case study on a Protein-Protein Interaction (PPI) data set demonstrate the effectiveness of our proposed method.
【Keywords】:
【Paper Link】 【Pages】:932-943
【Authors】: Nikola Mueller ; Katrin Haegler ; Junming Shao ; Claudia Plant ; Christian Böhm
【Abstract】: Object similarities are now more and more characterized by connectivity information available in form of network or graph data. Complex graph data arises in various fields like e-commerce, social networks, high throughput biological analysis etc. The generated interaction information for objects is often not simply binary but rather associated with interaction strength which are in turn represented as edge weights in graphs. The identification of groups of highly connected nodes is an important task and results in valuable knowledge of the data set as a whole. Many popular clustering techniques are designed for vector or unweighted graph data, and can thus not be directly applied for weighted graphs. In this paper, we propose a novel clustering algorithm for weighted graphs, called PaCCo (Parameter-free Clustering by Coding costs), which is based on the Minimum Description Length (MDL) principle in combination with a bisecting k-Means strategy. MDL relates the clustering problem to the problem of data compression: A good cluster structure on graphs enables strong graph compression. The compression efficiency depends on the underlying edges which constitute the graph connectivity. The compression rate serves as similarity or distance metric for nodes. The MDL principle ensures that our algorithm is parameter free (automatically finds the number of clusters) and avoids restrictive assumptions that no information on the data is required. We systematically evaluate our clustering approach PaCCo on synthetic as well as on real data to demonstrate the superiority of our developed algorithm over existing approaches.
【Keywords】:
【Paper Link】 【Pages】:944-955
【Authors】: Fei Wang ; Jimeng Sun ; Jianying Hu ; Shahram Ebadollahi
【Abstract】: Patient similarity assessment aims at providing a clinically meaningful distance measure for case retrieval in the context of clinical decision intelligence. Two of the key challenges are how to incorporate physician feedback with regard to the retrieval results and how to interactively update the underlying similarity measure based on the feedback. In this paper, we present the interactive Metric learning (iMet) method that can incrementally adjust the underlying distance metric based on latest supervision information. iMet is designed to scale linearly with the data set size based on matrix perturbation theory which allows the derivation of sound theoretical guarantees. We show empirical results demonstrating that iMet outperforms the baseline by three orders of magnitude in speed while obtaining comparable accuracy on several benchmark datasets. We also describe the application of the algorithm in a real world physician decision support system.
【Keywords】:
【Paper Link】 【Pages】:956-967
【Authors】: Charanpal Dhanjal ; Stéphan Clémençon
【Abstract】: In percolation theory, vertices within a graph have a binary state: either active or inactive. Furthermore, a percolation process decides how activation spreads within the graph. Firstly, we propose and analyse a simple data-driven percolation process in which percolations are preliminarily learnt from a graph with observed percolations. Secondly, we study a problem related to the one solved by Kempe et al. in [1]: given a percolation process, which k vertices should one choose in order to maximise the number of active vertices at the end of process? This question is important in many areas, ranging from viral marketing to the study of epidemic spread. We generalise the problem by considering activations in [0, 1], measuring the “quality” of percolation, and percolation decays along edges in the percolation graph. For a varying cost of activating each vertex, we maximise the total activation whilst keeping within a budget L. The problem can be solved with a greedy algorithm with a guaranteed approximation quality, and furthermore we show its connection to the maximal coverage problem. The resulting algorithm is analysed empirically over predicted percolation graphs on a synthetic dataset and on a real dataset modelling information diffusion within a social network.
【Keywords】:
【Paper Link】 【Pages】:968-979
【Authors】: Pedro Olmo S. Vaz de Melo ; Christos Faloutsos ; Antonio Alfredo Ferreira Loureiro
【Abstract】: How often humans communicate with each other? What are the mechanisms that explain how human actions are distributed over time? Here we answer these questions by studying the time interval between calls and SMS messages in an anonymized, large mobile network, with 3.1 million users, over 200 million phone calls and 300 million SMS messages, spanning 70 GigaBytes. Our first contribution is the Truncated Autocatalytic Process (TAP) model, that explains the time between communication events (ie., times between phone-initiations) for a single individual. The novelty is that the model is ‘autocatalytic’, in the sense that the parameters of the model change, depending on the latest inter-event time: long periods of inactivity in the past result in long periods of inactivity in the future, and vice-versa. We show that the TAP model mimics the inter-event times of the users of our dataset extremely well, despite its parsimony and simplicity. Our second contribution is the TAP-classifier, a classification method based on the inter-event times and in addition to other features. We showed that the inferred sleep intervals and the reciprocity between outgoing and incoming calls are good features to classify users. Finally, analyze the network effects of each class of users and we found surprising results. Moreover, all of our methods are fast, and scale linearly with the number of customers.
【Keywords】:
【Paper Link】 【Pages】:980-991
【Authors】: Zhijun Yin ; Liangliang Cao ; Jiawei Han ; Jiebo Luo ; Thomas S. Huang
【Abstract】: Social media such as those residing in the popular photo sharing websites is attracting increasing attention in recent years. As a type of user-generated data, wisdom of the crowd is embedded inside such social media. In particular, millions of users upload to Flickr their photos, many associated with temporal and geographical information. In this paper, we investigate how to rank the trajectory patterns mined from the uploaded photos with geotags and timestamps. The main objective is to reveal the collective wisdom recorded in the seemingly isolated photos and the individual travel sequences reflected by the geo-tagged photos. Instead of focusing on mining frequent trajectory patterns from geo-tagged social media, we put more effort into ranking the mined trajectory patterns and diversifying the ranking results. Through leveraging the relationships among users, locations and trajectories, we rank the trajectory patterns. We then use an exemplar-based algorithm to diversify the results in order to discover the representative trajectory patterns. We have evaluated the proposed framework on 12 different cities using a Flickr dataset and demonstrated its effectiveness.
【Keywords】:
【Paper Link】 【Pages】:992-1003
【Authors】: Bo Liu ; Yanshan Xiao ; Longbing Cao ; Philip S. Yu
【Abstract】: This paper presents a novel approach to one-class-based uncertain data stream learning. Our proposed approach works in three steps. Firstly, we put forward a local kernel-density-based method to generate a bound score for each instance, which refines the location of the corresponding instance. Secondly, we construct an uncertain one-class classifier by incorporating the generated bound score into a one-class SVM-based learning phase. Thirdly, we devise an ensemble classifier, integrated from uncertain one-class classifiers built on the current and historical chunks, to cope with the concept drift involved in the uncertain data stream environment. Our proposed method explicitly handles the uncertainty of the input data and enhances the ability of one-class learning in reducing the sensitivity to noise. Extensive experiments on uncertain data streams demonstrate that our proposed approach can achieve better performance and is highly robust to noise in comparison with state-of-the-art one-class learning method.
【Keywords】:
【Paper Link】 【Pages】:1004-1015
【Authors】: Hoang Thanh Lam ; Toon Calders ; Ninh Pham
【Abstract】: A motif is a pair of non-overlapping sequences with very similar shapes in a time series. We study the online top-k most similar motif discovery problem. A special case of this problem corresponding to k = 1 was investigated in the literature by Mueen and Keogh [2]. We generalize the problem to any k and propose space-efficient algorithms for solving it. We show that our algorithms are optimal in term of space. In the particular case when k = 1, our algorithms achieve better performance both in terms of space and time consumption than the algorithm of Mueen and Keogh. We demonstrate our results by both theoretical analysis and extensive experiments with both synthetic and real-life data. We also show possible application of the top-k similar motifs discovery problem.
【Keywords】: