Proceedings of the 13th SIAM International Conference on Data Mining, May 2-4, 2013. Austin, Texas, USA. SIAM 【DBLP Link】
【Paper Link】 【Pages】:1-9
【Authors】: Ryan R. Curtin ; Alexander G. Gray ; Parikshit Ram
【Abstract】: The wide applicability of kernels makes the problem of max-kernel search ubiquitous and more general than the usual similarity search in metric spaces. We focus on solving this problem efficiently. We begin by characterizing the inherent hardness of the max-kernel search problem with a novel notion of directional concentration. Following that, we present a method to use an O(n log n) algorithm to index any set of objects (points in RD or abstract objects) directly in the Hilbert space without any explicit feature representations of the objects in this space. We present the first provably O(log n) algorithm for exact max-kernel search using this index. Empirical results for a variety of data sets as well as abstract objects demonstrate up to 4 orders of magnitude speedup in some cases. Extensions for approximate max-kernel search are also presented.
【Keywords】:
【Paper Link】 【Pages】:10-18
【Authors】: Tamara G. Kolda ; Ali Pinar ; C. Seshadhri
【Abstract】: Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of a graph. Some of the most useful graph metrics, especially those measuring social cohesion, are based on triangles. Despite the importance of these triadic measures, associated algorithms can be extremely expensive. We discuss the method of wedge sampling. This versatile technique allows for the fast and accurate approximation of all current variants of clustering coefficients and enables rapid uniform sampling of the triangles of a graph. Our methods come with provable and practical time-approximation tradeoffs for all computations. We provide extensive results that show our methods are orders of magnitude faster than the state-of-the-art, while providing nearly the accuracy of full enumeration. Our results will enable more wide-scale adoption of triadic measures for analysis of extremely large graphs, as demonstrated on several real-world examples.
【Keywords】:
【Paper Link】 【Pages】:19-27
【Authors】: Stéphan Clémençon ; Sylvain Robbiano ; Jessica Tressou
【Abstract】: It is the goal of this paper to extend the Empirical Risk Minimization (ERM) paradigm, from a practical perspective, to the situation where a natural estimate of the risk is of the form of a K-sample U-statistics, as it is the case in the K-partite ranking problem for instance. Indeed, the numerical computation of the empirical risk is hardly feasible if not infeasible, even for moderate samples sizes. Precisely, it involves averaging O(nd1+…+dK) terms, when considering a U-statistic of degrees (d1, …, dK) based on samples of sizes proportional to n. We propose here to consider a drastically simpler Monte-Carlo version of the empirical risk based on O(n) terms solely, which can be viewed as an incomplete generalized U-statistic, and prove that, remarkably, the approximation stage does not damage the ERM procedure and yields a learning rate of order Oℙ(1/√n). Beyond a theoretical analysis guaranteeing the validity of this approach, numerical experiments are displayed for illustrative purpose.
【Keywords】:
【Paper Link】 【Pages】:28-36
【Authors】: Petko Bogdanov ; Christos Faloutsos ; Misael Mongiovì ; Evangelos E. Papalexakis ; Razvan Ranca ; Ambuj K. Singh
【Abstract】: How to spot and summarize anomalies in dynamic networks such as road networks, communication networks and social networks? An anomalous event, such as a traffic accident, a denial of service attack or a chemical spill, can affect several near-by edges and make them behave abnormally, over several consecutive time-ticks. We focus on spotting and summarizing such significant anomalous regions, spanning space (i.e. nearby edges), as well as time. Our first contribution is the problem formulation, namely finding all such Significant Anomalous Regions (SAR). The next contribution is the design of novel algorithms: an expensive, exhaustive algorithm, as well as an efficient approximation, called NETSPOT. Compared to the exhaustive algorithm, NETSPOT is up to one order of magnitude faster in real data, while achieving less than 4% average relative error rate. In synthetic datasets, it is more than 30 times faster and solves large problem instances that are otherwise infeasible. The final contribution is the validation on real data: we demonstrate the utility of NETSPOT for inferring accidents on road networks and detecting patterns of anomalous access to subnetworks of Wikipedia. We also study NETSPOT'S scalability in large social, transportation and synthetic evolving networks, spanning in total up to 50 million edges.
【Keywords】:
【Paper Link】 【Pages】:37-45
【Authors】: Leman Akoglu ; Duen Horng Chau ; Christos Faloutsos ; Nikolaj Tatti ; Hanghang Tong ; Jilles Vreeken
【Abstract】: Suppose we are given a large graph in which, by some external process, a handful of nodes are marked. What can we say about these nodes? Are they close together in the graph? or, if segregated, how many groups do they form? We approach this problem by trying to find sets of simple connection pathways between sets of marked nodes. We formalize the problem in terms of the Minimum Description Length principle: a pathway is simple when we need only few bits to tell which edges to follow, such that we visit all nodes in a group. Then, the best partitioning is the one that requires the least number of bits to describe the paths that visit all the marked nodes. We prove that solving this problem is NP-hard, and introduce DOT2DOT, an efficient algorithm for partitioning marked nodes by finding simple pathways between nodes. Experimentation shows that DOT2DOT correctly groups nodes for which good connection paths can be constructed, while separating distant nodes.
【Keywords】:
【Paper Link】 【Pages】:46-54
【Authors】: Lei Liu ; Prakash Mandayam Comar ; Antonio Nucci ; Sabyasachi Saha ; Pang-Ning Tan
【Abstract】: Real-world data are often riddled with data quality problems such as noise, outliers and missing values, which present significant challenges for supervised learning algorithms to effectively classify them. This paper explores the ill-effects of inapplicable features on the performance of supervised learning algorithms. In particular, we highlight the difference between missing and inapplicable feature values. We argue that the current approaches for dealing with missing values, which are mostly based on single or multiple imputation methods, are insufficient to handle inapplicable features, especially those that are continuous valued. We also illustrate how current tree-based and kernel-based classifiers can be adversely affected by the presence of such features if not handled appropriately. Finally, we propose methods to extend existing tree-based and kernel-based classifiers to deal with the inapplicable continuous-valued features.
【Keywords】:
【Paper Link】 【Pages】:55-63
【Authors】: Jianying Hu ; Yashu Liu ; Jimeng Sun ; Jieping Ye ; Jiayu Zhou
【Abstract】: The patient risk prediction model aims at assessing the risk of a patient in developing a target disease based on his/her health profile. As electronic health records (EHRs) become more prevalent, a large number of features can be constructed in order to characterize patient profiles. This wealth of data provides unprecedented opportunities for data mining researchers to address important biomedical questions. Practical data mining challenges include: How to correctly select and rank those features based on their prediction power? What predictive model performs the best in predicting a target disease using those features? In this paper, we propose top-k stability selection, which generalizes a powerful sparse learning method for feature selection by overcoming its limitation on parameter selection. In particular, our proposed top-k stability selection includes the original stability selection method as a special case given k = 1. Moreover, we show that the top-k stability selection is more robust by utilizing more information from selection probabilities than the original stability selection, and provides stronger theoretical properties. In a large set of real clinical prediction datasets, the top-k stability selection methods outperform many existing feature selection methods including the original stability selection. We also compare three competitive classification methods (SVM, logistic regression and random forest) to demonstrate the effectiveness of selected features by our proposed method in the context of clinical prediction applications. Finally, through several clinical applications on predicting heart failure related symptoms, we show that top-k stability selection can successfully identify important features that are clinically meaningful.
【Keywords】:
【Paper Link】 【Pages】:64-72
【Authors】: Chris H. Q. Ding ; Deguang Kong ; Miao Zhang
【Abstract】: Kernels are similarity functions, and play important roles in machine learning. Traditional kernels are built directly from the feature vectors of data instances xi, xj. However, data could be noisy, and there are missing values or corrupted values in feature vectors. In this paper, we propose a new approach to build kernel - Collective Kernel, especially from noisy data. We also derive an efficient algorithm to solve the L1-norm based optimization. Extensive experiments on face data, hand written characters and image scene datasets show improved performance for clustering and semi-supervised classification tasks on our collective kernel comparing with the traditional gaussian kernel.
【Keywords】:
【Paper Link】 【Pages】:73-81
【Authors】: Ling Chen ; Chunyang Liu ; Chengqi Zhang
【Abstract】: Probabilistic frequent pattern mining over uncertain data has received a great deal of attention recently due to the wide applications of uncertain data. Similar to its counterpart in deterministic databases, however, probabilistic frequent pattern mining suffers from the same problem of generating an exponential number of result patterns. The large number of discovered patterns hinders further evaluation and analysis, and calls for the need to find a small number of representative patterns to approximate all other patterns. This paper formally defines the problem of probabilistic representative frequent pattern (P-RFP) mining, which aims to find the minimal set of patterns with sufficiently high probability to represent all other patterns. The problem's bottleneck turns out to be checking whether a pattern can probabilistically represent another, which involves the computation of a joint probability of supports of two patterns. To address the problem, we propose a novel and efficient dynamic programming-based approach. Moreover, we have devised a set of effective optimization strategies to further improve the computation efficiency. Our experimental results demonstrate that the proposed P-RFP mining effectively reduces the size of probabilistic frequent patterns. Our proposed approach not only discovers the set of P-RFPs efficiently, but also restores the frequency probability information of patterns with an error guarantee.
【Keywords】:
【Paper Link】 【Pages】:82-93
【Authors】: Xiangnan Kong ; Ann B. Ragin ; Xue Wang ; Philip S. Yu
【Abstract】: Mining discriminative features for graph data has attracted much attention in recent years due to its important role in constructing graph classifiers, generating graph indices, etc. Most measurement of interestingness of discriminative subgraph features are defined on certain graphs, where the structure of graph objects are certain, and the binary edges within each graph represent the “presence” of linkages among the nodes. In many real-world applications, however, the linkage structure of the graphs is inherently uncertain. Therefore, existing measurements of interestingness based upon certain graphs are unable to capture the structural uncertainty in these applications effectively. In this paper, we study the problem of discriminative subgraph feature selection from uncertain graphs. This problem is challenging and different from conventional subgraph mining problems because both the structure of the graph objects and the discrimination score of each subgraph feature are uncertain. To address these challenges, we propose a novel discriminative subgraph feature selection method, DUG, which can find discriminative subgraph features in uncertain graphs based upon different statistical measures including expectation, median, mode and φ-probability. We first compute the probability distribution of the discrimination scores for each subgraph feature based on dynamic programming. Then a branch-and-bound algorithm is proposed to search for discriminative subgraphs efficiently. Extensive experiments on various neuroimaging applications (i.e., Alzheimers Disease, ADHD and HIV) have been performed to analyze the gain in performance by taking into account structural uncertainties in identifying discriminative subgraph features for graph classification.
【Keywords】:
【Paper Link】 【Pages】:94-102
【Authors】: Carl Dean Meyer ; Shaina Race ; Kevin Valakuzhy
【Abstract】: We use a cluster ensemble to determine the number of clusters, k, in a group of data. A consensus similarity matrix is formed from the ensemble using multiple algorithms and several values for k. A random walk is induced on the graph defined by the consensus matrix and the eigenvalues of the associated transition probability matrix are used to determine the number of clusters. For noisy or high-dimensional data, an iterative technique is presented to refine this consensus matrix in way that encourages a block-diagonal form. It is shown that the resulting consensus matrix is generally superior to existing similarity matrices for this type of spectral analysis.
【Keywords】:
【Paper Link】 【Pages】:103-111
【Authors】: Jaya Kawale ; Daniel Boley
【Abstract】: Constrained spectral clustering is a semi-supervised learning problem that aims at incorporating user-defined constraints in spectral clustering. Typically, there are two kinds of constraints: (i) must-link, and (ii) cannot-link. These constraints represent prior knowledge indicating whether two data objects should be in the same cluster or not; thereby aiding in clustering. In this paper, we propose a novel approach that uses convex subproblems to incorporate constraints in spectral clustering and co-clustering. In comparison to the prior state-of-art approaches, our approach presents a more natural way to incorporate constraints in the spectral methods and allows us to make a trade off between the number of satisfied constraints and the quality of partitions on the original graph. We use an L1 regularizer analogous to LASSO, often used in literature to induce sparsity, in order to control the number of constraints satisfied. Our approach can handle both must-link and cannot-link constraints, unlike a large number of previous approaches that mainly work on the former. Further, our formulation is based on the reduction to a convex subproblem which is relatively easy to solve using existing solvers. We test our proposed approach on real world datasets and show its effectiveness for both spectral clustering and co-clustering over the prior state-of-art.
【Keywords】:
【Paper Link】 【Pages】:112-120
【Authors】: Christian Böhm ; Jing Feng ; Xiao He ; Son T. Mai
【Abstract】: Many clustering algorithms suffer from scalability problems on massive datasets and do not support any user interaction during runtime. To tackle these problems, anytime clustering algorithms are proposed. They produce a fast approximate result which is continuously refined during the further run. Also, they can be stopped or suspended anytime and provide an answer. In this paper, we propose a novel anytime clustering algorithm based on the density-based clustering paradigm. Our algorithm called A-DBSCAN is applicable to very high dimensional databases such as time series, trajectory, medical data, etc. The general idea of our algorithm is to use a sequence of lower-bounding functions (LBs) of the true similarity measure to produce multiple approximate results of the true density-based clusters. A-DBSCAN operates in multiple levels w.r.t. the LBs and is mainly based on two algorithmic schemes: (1) an efficient distance upgrade scheme which restricts distance calculations to core-objects at each level of the LBs; (2) a local re-clustering scheme which restricts update operations to the relevant objects only. Extensive experiments demonstrate that A-DBSCAN acquires very good clustering results at very early stages of execution thus saves a large amount of computational time. Even if it runs to the end, A-DBSCAN is still orders of magnitude faster than DBSCAN.
【Keywords】:
【Paper Link】 【Pages】:121-129
【Authors】: Shuiwang Ji ; Wenlu Zhang ; Rui Zhang
【Abstract】: We consider the mining of hidden block structures from time-varying data using evolutionary co-clustering. Existing methods are based on the spectral learning framework, thus lacking a probabilistic interpretation. To overcome this limitation, we develop a probabilistic model for evolutionary co-clustering in this paper. The proposed model assumes that the observed data are generated via a two-step process that depends on the historic co-clusters, thereby capturing the temporal smoothness in a probabilistically principled manner. We develop an EM algorithm to perform maximum likelihood parameter estimation. An appealing feature of the proposed probabilistic model is that it leads to soft co-clustering assignments naturally. To the best of our knowledge, our work represents the first attempt to perform evolutionary soft co-clustering. We evaluate the proposed method on both synthetic and real data sets. Experimental results show that our method consistently outperforms prior approaches based on spectral method.
【Keywords】:
【Paper Link】 【Pages】:130-138
【Authors】: Duc-Son Pham ; Dinh Q. Phung ; Budhaditya Saha ; Svetha Venkatesh
【Abstract】: We propose in this paper a novel sparse subspace clustering method that regularizes sparse subspace representation by exploiting the structural sharing between tasks and data points via group sparse coding. We derive simple, provably convergent, and computationally efficient algorithms for solving the proposed group formulations. We demonstrate the advantage of the framework on three challenging benchmark datasets ranging from medical record data to image and text clustering and show that they consistently outperforms rival methods.
【Keywords】:
【Paper Link】 【Pages】:139-150
【Authors】: Philip S. Yu ; Yuchen Zhao
【Abstract】: Graph clustering becomes an important problem due to emerging applications involving the web, social networks and bio-informatics. Recently, many such applications generate data in the form of streams. Clustering massive, dynamic graph streams is significantly challenging because of the complex structures of graphs and computational difficulties of continuous data. Meanwhile, a large volume of side information is associated with graphs, which can be of various types. The examples include the properties of users in social network activities, the meta attributes associated with web click graph streams and the location information in mobile communication networks. Such attributes contain extremely useful information and have the potential to improve the clustering process, but are neglected by most recent graph stream mining techniques. In this paper, we define a unified distance measure on both link structures and side attributes for clustering. In addition, we propose a novel optimization framework DMO, which can dynamically optimize the distance metric and make it adapt to the newly received stream data. We further introduce a carefully designed statistics SGS(C) which consume constant storage spaces with the progression of streams. We demonstrate that the statistics maintained are sufficient for the clustering process as well as the distance optimization and can be scalable to massive graphs with side attributes. We will present experiment results to show the advantages of the approach in graph stream clustering with both links and side information over the baselines.
【Keywords】:
【Paper Link】 【Pages】:151-161
【Authors】: Jian-Huang Lai ; Chang-Dong Wang ; Philip S. Yu
【Abstract】: In this paper, we aim to tackle the problem of discovering dynamic communities in weighted graph streams, especially when the underlying social behavior of individuals varies considerably over different graph regions. To tackle this problem, a novel structure termed Local Weighted-Edge-based Pattern (LWEP) Summary is proposed to describe a local homogeneous region. To efficiently compute LWEPs, some statistics need to be maintained according to the principle of preserving maximum weighted neighbor information with limited memory storage. To this end, the proposed approach is divided into online and offline components. During the online phase, we introduce some statistics, termed top-k neighbor lists and top-k candidate lists, to track. The key is to maintain only the top-k neighbors with the largest link weights for each node. To allow for less active neighbors to transition into top-k neighbors, an auxiliary data structure termed top-k candidate list is used to identify emerging active neighbors. The statistics can be efficiently maintained in the online component. In the offline component, these statistics are used at each snapshot to efficiently compute LWEPs. Clustering is then performed to consolidate LWEPs into high level clusters. Finally, mapping is made between clusters of consecutive snapshots to generate temporally smooth communities. Experimental results are presented to illustrate the effectiveness and efficiency of the proposed approach.
【Keywords】:
【Paper Link】 【Pages】:162-170
【Authors】: Christos Faloutsos ; Danai Koutra ; Joshua T. Vogelstein
【Abstract】: How much did a network change since yesterday? How different is the wiring between Bob's brain (a left-handed male) and Alice's brain (a right-handed female)? Graph similarity with known node correspondence, i.e. the detection of changes in the connectivity of graphs, arises in numerous settings. In this work, we formally state the axioms and desired properties of the graph similarity functions, and evaluate when state-of-the-art methods fail to detect crucial connectivity changes in graphs. We propose DeltaCon, a principled, intuitive, and scalable algorithm that assesses the similarity between two graphs on the same nodes (e.g. employees of a company, customers of a mobile carrier). Experiments on various synthetic and real graphs showcase the advantages of our method over existing similarity measures. Finally, we employ DeltaCon to real applications: (a) we classify people to groups of high and low creativity based on their brain connectivity graphs, and (b) do temporal anomaly detection in the who-emails-whom Enron graph.
【Keywords】:
【Paper Link】 【Pages】:171-179
【Authors】: Hong Cheng ; Jihang Ye ; Zhe Zhu
【Abstract】: Location-based social networks have been gaining increasing popularity in recent years. To increase users’ engagement with location-based services, it is important to provide attractive features, one of which is geo-targeted ads and coupons. To make ads and coupon delivery more effective, it is essential to predict the location that is most likely to be visited by a user at the next step. However, an inherent challenge in location prediction is a huge prediction space, with millions of distinct check-in locations as prediction target. In this paper we exploit the check-in category information to model the underlying user movement pattern. We propose a framework which uses a mixed hidden Markov model to predict the category of user activity at the next step and then predict the most likely location given the estimated category distribution. The advantages of modeling the category level include a significantly reduced prediction space and a precise expression of the semantic meaning of user activities. Extensive experimental results show that, with the predicted category distribution, the number of location candidates for prediction is 5.45 times smaller, while the prediction accuracy is 13.21% higher.
【Keywords】:
【Paper Link】 【Pages】:180-188
【Authors】: Li Chen ; Weike Pan
【Abstract】: Collaborative filtering aims to make use of users’ feedbacks to improve the recommendation performance, which has been deployed in various industry recommender systems. Some recent works have switched from exploiting explicit feedbacks of numerical ratings to implicit feedbacks like browsing and shopping records, since such data are more abundant and easier to collect. One fundamental challenge of leveraging implicit feedbacks is the lack of negative feedbacks, because there are only some observed relatively “positive” feedbacks, making it difficult to learn a prediction model. Previous works address this challenge via proposing some pointwise or pairwise preference assumptions on items. However, such assumptions with respect to items may not always hold, for example, a user may dislike a bought item or like an item not bought yet. In this paper, we propose a new and relaxed assumption of pairwise preferences over item-sets, which defines a user's preference on a set of items (item-set) instead of on a single item. The relaxed assumption can give us more accurate pairwise preference relationships. With this assumption, we further develop a general algorithm called CoFiSet (collaborative filtering via learning pairwise preferences over item-sets). Experimental results show that CoFiSet performs better than several state-of-the-art methods on various ranking-oriented evaluation metrics on two real-world data sets. Furthermore, CoFiSet is very efficient as shown by both the time complexity and CPU time.
【Keywords】:
【Paper Link】 【Pages】:189-197
【Authors】: Sanjay Chawla ; Aristides Gionis
【Abstract】: We present a unified approach for simultaneously clustering and discovering outliers in data. Our approach is formalized as a generalization of the k-MEANS problem. We prove that the problem is NP-hard and then present a practical polynomial time algorithm, which is guaranteed to converge to a local optimum. Furthermore we extend our approach to all distance measures that can be expressed in the form of a Bregman divergence. Experiments on synthetic and real datasets demonstrate the effectiveness of our approach and the utility of carrying out both clustering and outlier detection in a concurrent manner. In particular on the famous KDD cup network-intrusion dataset, we were able to increase the precision of the outlier detection task by nearly 100% compared to the classical nearest-neighbor approach.
【Keywords】:
【Paper Link】 【Pages】:198-206
【Authors】: Klemens Böhm ; Fabian Keller ; Emmanuel Müller ; Hoang Vu Nguyen ; Jilles Vreeken
【Abstract】: In many real world applications data is collected in multi-dimensional spaces, with the knowledge hidden in subspaces (i.e., subsets of the dimensions). It is an open research issue to select meaningful subspaces without any prior knowledge about such hidden patterns. Standard approaches, such as pairwise correlation measures, or statistical approaches based on entropy, do not solve this problem; due to their restrictive pairwise analysis and loss of information in discretization they are bound to miss subspaces with potential clusters and outliers. In this paper, we focus on finding subspaces with strong mutual dependency in the selected dimension set. Chosen subspaces should provide a high discrepancy between clusters and outliers and enhance detection of these patterns. To measure this, we propose a novel contrast score that quantifies mutual correlations in subspaces by considering their cumulative distributions—without having to discretize the data. In our experiments, we show that these high contrast subspaces provide enhanced quality in cluster and outlier detection for both synthetic and real world data.
【Keywords】:
【Paper Link】 【Pages】:207-215
【Authors】: Steven C. H. Hoi ; Peilin Zhao
【Abstract】: Although both cost-sensitive classification and online learning have been well studied separately in data mining and machine learning, there was very few comprehensive study of cost-sensitive online classification in literature. In this paper, we formally investigate this problem by directly optimizing cost-sensitive measures for an online classification task. As the first comprehensive study, we propose the Cost-Sensitive Double Updating Online Learning (CSDUOL) algorithms, which explores a recent double updating technique to tackle the online optimization task of cost-sensitive classification by maximizing the weighted sum or minimizing the weighted misclassification cost. We theoretically analyze the cost-sensitive measure bounds of the proposed algorithms, extensively examine their empirical performance for cost-sensitive online classification tasks, and finally demonstrate the application of our technique to solve online anomaly detection tasks.
【Keywords】:
【Paper Link】 【Pages】:216-224
【Authors】: Longbing Cao ; Jinjiu Li ; Can Wang ; Philip S. Yu
【Abstract】: Rule-based anomaly and fraud detection systems often suffer from massive false alerts against a huge number of enterprise transactions. A crucial and challenging problem is to effectively select a globally optimal rule set which can capture very rare anomalies dispersed in large-scale background transactions. The existing rule selection methods which suffer significantly from complex rule interactions and overlapping in large imbalanced data, often lead to very high false positive rate. In this paper, we analyze the interactions and relationships between rules and their coverage on transactions, and propose a novel metric, Max Coverage Gain. Max Coverage Gain selects the optimal rule set by evaluating the contribution of each rule in terms of overall performance to cut out those locally significant but globally redundant rules, without any negative impact on the recall. An effective algorithm, MCGminer, is then designed with a series of built-in mechanisms and pruning strategies to handle complex rule interactions and reduce computational complexity towards identifying the globally optimal rule set. Substantial experiments on 13 UCI data sets and a real time online banking transactional database demonstrate that MCGminer achieves significant improvement on both accuracy, scalability, stability and efficiency on large imbalanced data compared to several state-of-the-art rule selection techniques.
【Keywords】:
【Paper Link】 【Pages】:225-233
【Authors】: Ira Assent ; Xuan Hong Dang ; Barbora Micenková ; Raymond T. Ng
【Abstract】: Detecting a small number of outliers from a set of data observations is always challenging. In this paper, we present an approach that exploits space transformation and uses spectral analysis in the newly transformed space for outlier detection. Unlike most existing techniques in the literature which rely on notions of distances or densities, this approach introduces a novel concept based on local quadratic entropy for evaluating the similarity of a data object with its neighbors. This information theoretic quantity is used to regularize the closeness amongst data instances and subsequently benefits the process of mapping data into a usually lower dimensional space. Outliers are then identified by spectral analysis of the eigenspace spanned by the set of leading eigenvectors derived from the mapping procedure. The proposed technique is purely data-driven and imposes no assumptions regarding the data distribution, making it particularly suitable for identification of outliers from irregular, non-convex shaped distributions and from data with diverse, varying densities.
【Keywords】:
【Paper Link】 【Pages】:234-242
【Authors】: Ian Davidson ; Buyue Qian ; Xiang Wang ; Jieping Ye
【Abstract】: Traditionally, spectral clustering is limited to a single objective: finding the normalized min-cut of a single graph. However, many real-world datasets, such as scientific data (fMRI scans of different individuals), social data (different types of connections between people), web data (multi-type data), are generated from multiple heterogeneous sources. How to optimally combine knowledge from multiple sources to improve spectral clustering remains a developing area. Previous work on multi-view clustering formulated the problem as a single objective function to optimize, typically by combining the views under a compatibility assumption and requiring the users to decide the importance of each view a priori. In this work, we propose a multi-objective formulation and show how to solve it using Pareto optimization. The Pareto frontier captures all possible good cuts without requiring the users to set the “correct” parameter. The effectiveness of our approach is justified by both theoretical analysis and empirical results. We also demonstrate a novel application of our approach: resting-state fMRI analysis.
【Keywords】:
【Paper Link】 【Pages】:243-251
【Authors】: Ben Tan ; Evan Wei Xiang ; Qiang Yang ; ErHeng Zhong
【Abstract】: Transfer learning, which aims to help the learning task in a target domain by leveraging knowledge from auxiliary domains, has been demonstrated to be effective in different applications, e.g., text mining, sentiment analysis, etc. In addition, in many real-world applications, auxiliary data are described from multiple perspectives and usually carried by multiple sources. For example, to help classify videos on Youtube, which include three views/perspectives: image, voice and subtitles, one may borrow data from Flickr, Last.FM and Google News. Although any single instance in these domains can only cover a part of the views available on Youtube, actually the piece of information carried by them may compensate with each other. In this paper, we define this transfer learning problem as Transfer Learning with Multiple Views and Multiple Sources. As different sources may have different probability distributions and different views may be compensate or inconsistent with each other, merging all data in a simplistic manner will not give optimal result. Thus, we propose a novel algorithm to leverage knowledge from different views and sources collaboratively, by letting different views from different sources complement each other through a co-training style framework, while revise the distribution differences in different domains. We conduct empirical studies on several real-world datasets to show that the proposed approach can improve the classification accuracy by up to 8% against different state-of-the-art baselines.
【Keywords】:
【Paper Link】 【Pages】:252-260
【Authors】: Jing Gao ; Jiawei Han ; Jialu Liu ; Chi Wang
【Abstract】: Many real-world datasets are comprised of different representations or views which often provide information complementary to each other. To integrate information from multiple views in the unsupervised setting, multi-view clustering algorithms have been developed to cluster multiple views simultaneously to derive a solution which uncovers the common latent structure shared by multiple views. In this paper, we propose a novel NMF-based multi-view clustering algorithm by searching for a factorization that gives compatible clustering solutions across multiple views. The key idea is to formulate a joint matrix factorization process with the constraint that pushes clustering solution of each view towards a common consensus instead of fixing it directly. The main challenge is how to keep clustering solutions across different views meaningful and comparable. To tackle this challenge, we design a novel and effective normalization strategy inspired by the connection between NMF and PLSA. Experimental results on synthetic and several real datasets demonstrate the effectiveness of our approach.
【Keywords】:
【Paper Link】 【Pages】:261-269
【Authors】: Jing Gao ; Liang Ge ; Kang Li ; Hung Q. Ngo ; Aidong Zhang
【Abstract】: Transfer learning has benefited many real-world applications where labeled data are abundant in source domains but scarce in the target domain. As there are usually multiple relevant domains where knowledge can be transferred, multiple source transfer learning (MSTL) has recently attracted much attention. However, we are facing two major challenges when applying MSTL. First, without knowledge about the difference between source and target domains, negative transfer occurs when knowledge is transferred from highly irrelevant sources. Second, existence of imbalanced distributions in classes, where examples in one class dominate, can lead to improper judgement on the source domains’ relevance to the target task. Since existing MSTL methods are usually designed to transfer from relevant sources with balanced distributions, they will fail in applications where these two challenges persist. In this paper, we propose a novel two-phase framework to effectively transfer knowledge from multiple sources even when there exist irrelevant sources and imbalanced class distributions. First, an effective Supervised Local Weight (SLW) scheme is proposed to assign a proper weight to each source domain's classifier based on its ability of predicting accurately on each local region of the target domain. The second phase then learns a classifier for the target domain by solving an optimization problem which concerns both training error minimization and consistency with weighted predictions gained from source domains. A theoretical analysis shows that as the number of source domains increases, the probability that the proposed approach has an error greater than a bound is becoming exponentially small. Extensive experiments on disease prediction, spam filtering and intrusion detection data sets demonstrate the significant improvement in classification performance gained by the proposed method over existing MSTL approaches.
【Keywords】:
【Paper Link】 【Pages】:270-278
【Authors】: Jiliang Tang ; Xia Hu ; Huiji Gao ; Huan Liu
【Abstract】: The explosive popularity of social media produces mountains of high-dimensional data and the nature of social media also determines that its data is often unla-belled, noisy and partial, presenting new challenges to feature selection. Social media data can be represented by heterogeneous feature spaces in the form of multiple views. In general, multiple views can be complementary and, when used together, can help handle noisy and partial data for any single-view feature selection. These unique challenges and properties motivate us to develop a novel feature selection framework to handle multi-view social media data. In this paper, we investigate how to exploit relations among views to help each other select relevant features, and propose a novel unsupervised feature selection framework, MVFS, for multiview social media data. We systematically evaluate the proposed framework in multi-view datasets from social media websites and the results demonstrate the effectiveness and potential of MVFS.
【Keywords】:
【Paper Link】 【Pages】:279-287
【Authors】: Andrew Horner ; Derek Hao Hu ; Bin Wu ; Qiang Yang ; ErHeng Zhong
【Abstract】: Music emotion recognition (MER) aims to recognize the affective content of a piece of music, which is important for applications such as automatic soundtrack generation and music recommendation. MER is commonly formulated as a supervised learning problem. In practice, except for Pop music, there is little labeled data in most genres. In addition, emotion is genre specific in music and thus the labeled data of Pop music cannot be used for other genres. In this paper, we aim to solve the genre-specific MER problem by exploiting two kinds of auxiliary data: unlabeled songs and social tags. However, using these two kinds of data effectively is a non-trivial task, e.g. tags are noisy and therefore cannot be treated as fully trustworthy. To build an accurate model with the help from the unlabeled songs and noisy tags, we present SMART, which stands for Semi-Supervised Music Affective Emotion Recognition with Social Tagging, combining of a graph-based semi-supervised learning algorithm with a novel tag refinement method. Experiments on the Million Song Dataset show that our proposed approach, trained with only 10 labeled instances, is as accurate as Support Vector Regression trained with 750 labeled songs.
【Keywords】:
【Paper Link】 【Pages】:288-296
【Authors】: Ayan Acharya ; Joydeep Ghosh ; Eduardo R. Hruschka ; Jean-David Ruvini ; Badrul Sarwar
【Abstract】: Unsupervised models can provide supplementary soft constraints to help classify new target data under the assumption that similar objects in the target set are more likely to share the same class label. Such models can also help detect possible differences between training and target distributions, which is useful in applications where concept drift may take place. This paper describes a Bayesian framework that takes as input class labels from existing classifiers (designed based on labeled data from the source domain), as well as cluster labels from a cluster ensemble operating solely on the target data to be classified, and yields a consensus labeling of the target data. This framework is particularly useful when the statistics of the target data drift or change from those of the training data. We also show that the proposed framework is privacy-aware and allows performing distributed learning when data/models have sharing restrictions. Experiments show that our framework can yield superior results to those provided by applying classifier ensembles only.
【Keywords】:
【Paper Link】 【Pages】:297-305
【Authors】: Ian Davidson ; Hongfei Li ; Buyue Qian ; Jun Wang ; Xiang Wang
【Abstract】: This paper investigates learning a ranking function using pairwise constraints in the context of human-machine interaction. As the performance of a learnt ranking model is predominantly determined by the quality and quantity of training data, in this work we explore an active learning to rank approach. Furthermore, since humans may not be able to confidently provide an order for a pair of similar instances we explore two types of pairwise supervision: (i) a set of “strongly” ordered pairs which contains confidently ranked instances, and (ii) a set of “weakly” ordered pairs which consists of similar or closely ranked instances. Our active knowledge injection is performed by querying domain experts on pairwise orderings, where informative pairs are located by considering both local and global uncertainties. Under this active scheme, querying of pairs which are uninformative or outliers instances would not occur. We evaluate the proposed approach on three real world datasets and compare with representative methods. The promising experimental results demonstrate the superior performance of our approach, and validate the effectiveness of actively using pairwise orderings to improve ranking performance.
【Keywords】:
【Paper Link】 【Pages】:306-314
【Authors】: Xia Hu ; Jiliang Tang ; Huiji Gao ; Huan Liu
【Abstract】: Supervised learning, e.g., classification, plays an important role in processing and organizing microblogging data. In microblogging, it is easy to mass vast quantities of unlabeled data, but would be costly to obtain labels, which are essential for supervised learning algorithms. In order to reduce the labeling cost, active learning is an effective way to select representative and informative instances to query for labels for improving the learned model. Different from traditional data in which the instances are assumed to be independent and identically distributed (i.i.d.), instances in microblogging are networked with each other. This presents both opportunities and challenges for applying active learning to microblogging data. Inspired by social correlation theories, we investigate whether social relations can help perform effective active learning on networked data. In this paper, we propose a novel Active learning framework for the classification of Networked Texts in microblogging (ActNeT). In particular, we study how to incorporate network information into text content modeling, and design strategies to select the most representative and informative instances from microblogging for labeling by taking advantage of social network structure. Experimental results on Twitter datasets show the benefit of incorporating network information in active learning and that the proposed framework outperforms existing state-of-the-art methods.
【Keywords】:
【Paper Link】 【Pages】:315-323
【Authors】: Meng Fang ; Jie Yin ; Chengqi Zhang ; Xingquan Zhu
【Abstract】: With the recent explosion of social network applications, active learning has increasingly become an important paradigm for classifying networked data. While existing research has shown promising results by exploiting network properties to improve the active learning performance, they are all based on a static setting where the number and the type of classes underlying the networked data remain stable and unchanged. For most social network applications, the dynamic change of users and their evolving relationships, along with the emergence of new social events, often result in new classes that need to be immediately discovered and labeled for classification. This paper proposes a novel approach called ADLNET for active class discovery and learning with networked data. Our proposed method uses the Dirichlet process defined over class distributions to enable active discovery of new classes, and explicitly models label correlations in the utility function of active learning. Experimental results on two real-world networked data sets demonstrate that our proposed approach outperforms other state-of-the-art methods.
【Keywords】:
【Paper Link】 【Pages】:324-332
【Authors】: Karthik Subbian ; Arindam Banerjee
【Abstract】: There are several Global Climate Models (GCM) reported by various countries to the Intergovernmental Panel on Climate Change (IPCC). Due to the varied nature of the GCM model assumptions, the future projections of the GCMs show high variability which makes it difficult to come up with confident projections into the future. Climate scientists combine these multiple GCM models to minimize the variability and the prediction error. Most of these model combinations are specifically for a location, or at a global scale. They do not consider regional or local smoothing (including the IPCC model). In this paper, we address this problem of combining multiple GCM model outputs with spatial smoothing as an important desired criterion. The problem formulation takes the form of multiple least squares regression for each geographic location with graph Laplacian based smoothing amongst the neighboring locations. Unlike the existing Laplacian regression frameworks, our formulation has both inner and outer products of the coefficient matrix, and has Sylvester equations as its special case. We discuss a few approaches to solve the problem, including a closed-form by solving a large linear system, as well as gradient descent methods which turn out to be more efficient. We establish the superiority of our approach in terms of model accuracy and smoothing compared to several popular baselines on real GCM climate datasets.
【Keywords】:
【Paper Link】 【Pages】:333-341
【Authors】: Zubin Abraham ; Malgorzata Liszewska ; Perdinan ; Pang-Ning Tan ; Julie Winkler ; Shiyuan Zhong
【Abstract】: Regression-based approaches are widely used in climate modeling to capture the relationship between a climate variable of interest and a set of predictor variables. These approaches are often designed to minimize the overall prediction errors. However, some climate modeling applications emphasize more on fitting the distribution properties of the observed data. For example, histogram equalization techniques such as quantile mapping have been successfully used to debias outputs from computer-simulated climate models to obtain more realistic projections of future climate scenarios. In this paper, we show the limitations of current regression-based approaches in terms of preserving the distribution of observed climate data and present a multi-objective regression framework that simultaneously fits the distribution properties and minimizes the prediction error. The framework is highly flexible and can be applied to linear, nonlinear, and conditional quantile models. The paper demonstrates the effectiveness of the framework in modeling the daily minimum and maximum temperature as well as precipitation for climate stations in the Great Lakes region. The framework showed marked improvement over traditional regression-based approaches in all 14 climate stations evaluated.
【Keywords】:
【Paper Link】 【Pages】:342-349
【Authors】: Robert W. Harrison ; Irene T. Weber ; Xiaxia Yu
【Abstract】: HIV rapidly evolves drug resistance in response to antiviral drugs used in AIDS therapy. Estimating the specific resistance of a given strain of HIV to individual drugs from sequence data has important benefits for both the therapy of individual patients and the development of novel drugs. We have developed an accurate classification method based on the sparse representation theory, and demonstrate that this method is highly effective with HIV-1 protease. The protease structure is represented using our newly proposed encoding method based on Delaunay triangulation, and combined with the mutated amino acid sequences of known drug-resistant strains to train a machine-learning algorithm both for classification and regression of drug-resistant mutations. An overall cross-validated classification accuracy of 97% is obtained when trained on a publically available data base of approximately 1.5 × 104 known sequences (Stanford HIV database: http://hivdb.stanford.edu/cgi-bin/GenoPhenoDS.cgi). Resistance to four FDA approved drugs is computed and comparisons with other algorithms demonstrate that our method shows significant improvements in classification accuracy.
【Keywords】:
【Paper Link】 【Pages】:350-358
【Authors】: Wei Fan ; Xiaoxiao Shi ; Philip S. Yu
【Abstract】: Finding the most influential entities as well as conducting causality analysis is an important topic in economics, healthcare, sensor networks, etc. One famous example in economics is the bankruptcy of Lehman Brothers that triggered the 2008 global financial crisis. In recent years, some works were proposed to infer the causal relationships among several entities, and subsequently find the most influential ones ( i.e., shakers). However, most of the previous works assume that the causal relationships and the shakers are static. In other words, they are assumed to be stable over time. This assumption may not necessarily be true, especially when we study volatile entities or long-term time series. In this paper, we propose a dynamic model called “DShaker” to capture the evolving causal relationships and dynamic shakers. The intuition is to model the causality propagation into a graph called “dynamic cascading graph”. We then find the optimal cascading graphs, by maximizing their likelihoods in a non-convex multi-objective optimization formulation. We solve it by mapping into a trace norm minimization problem. Experiments included three datasets in social sciences. The proposed method can effectively capture those entities with increasing impacts, while existing methods missed most of them. For example, in the experiment of studying the banks’ statistics from 1998 to 2007 (before the financial crisis), the proposed method successfully captures Lehman Brothers as one of the most precarious banks in subprime loans.
【Keywords】:
【Paper Link】 【Pages】:359-368
【Authors】: Leon Stenneth ; Philip S. Yu
【Abstract】: Users of mass transit systems such as those of buses and trains normally rely on accurate route maps, stop locations, and service schedules when traveling. If the route map, service schedule, or stop location has errors it can reduce the transit agency's ridership. In this paper, the problem of deriving transit systems by mining raw GPS data is studied. Specifically, we propose and evaluate novel classification features with spatial and temporal clustering techniques that derive bus stop locations, route geometries, and service schedules from GPS data. Subsequently, manual and expensive field visits to record and annotate the initial or updated route geometries, transit stop locations, or service schedules is no longer required by transit agencies. This facilitates a massive reduction in cost for transit agencies. The effectiveness of the proposed algorithms is validated on the third largest public transit system in the United States.
【Keywords】:
【Paper Link】 【Pages】:369-377
【Authors】: Ching-Yung Lin ; Zhen Wen
【Abstract】: Successful enterprises depend on high performing teams consisting of productive individuals, who can effective find valuable information. Predictive methods are highly desired to identify these inter-related, multi-typed entities that can potentially help to improve enterprise performance before they become productive. For such predictive methods, observation of their dynamic behavior in the organizational social networks may provide useful input. In this paper, we propose a novel approach to analyze and rank heterogeneous objects in multi-level networks by their value for improving productivity. Compared to existing approaches that either focus on static factors of productivity or single type of entities (e.g., individuals), our work offers two unique contributions. First, we propose a novel multi-level synchronicity network representation which allows us to exploit the structural characteristics of various entities’ dynamic behavior. Furthermore, based on the synchronicity networks, we propose a novel algorithm for Heterogeneous Multi-level networks Ranking, to simultaneously rank inter-related heterogeneous entities (e.g., topics, individuals and teams) by their value. Our experiments demonstrate that our approach significantly outperforms existing methods in both enterprise organizational social networks and public social media.
【Keywords】:
【Paper Link】 【Pages】:378-386
【Authors】: Hongbo Deng ; Jiawei Han ; Heng Ji ; Hao Li ; Yue Lu ; Hongning Wang
【Abstract】: With the development of social media and social networks, user-generated content, like forums, blogs and comments, are not only getting richer, but also ubiquitously interconnected with many other objects and entities, forming a heterogeneous information network between them. Sentiment analysis on such kinds of data can no longer ignore the information network, since it carries a lot of rich and valuable information, explicitly or implicitly, where some of them can be observed while others are not. In this paper, we propose a novel information network-based framework which can infer hidden similarity and dissimilarity between users by exploring similar and opposite opinions, so as to improve post-level and user-level sentiment classification in the same time. More specifically, we develop a new meta path-based measure for inferring pseudo-friendship as well as dissimilarity between users, and propose a semi-supervised refining model by encoding similarity and dissimilarity from both user-level and post-level relations. We extensively evaluate the proposed approach and compare with several state-of-the-art techniques on two real-world forum datasets. Experimental results show that our proposed model with 10.5% labeled samples can achieve better performance than a traditional supervised model trained on 61.7% data samples.
【Keywords】:
【Paper Link】 【Pages】:387-395
【Authors】: Aristides Gionis ; Evimaria Terzi ; Panayiotis Tsaparas
【Abstract】: The process of opinion formation through synthesis and contrast of different viewpoints has been the subject of many studies in economics and social sciences. Today, this process manifests itself also in online social networks and social media. The key characteristic of successful promotion campaigns is that they take into consideration such opinion-formation dynamics in order to create a overall favorable opinion about a specific information item, such as a person, a product, or an idea. In this paper, we adopt a well-established model for social-opinion dynamics and formalize the campaigndesign problem as the problem of identifying a set of target individuals whose positive opinion about an information item will maximize the overall positive opinion for the item in the social network. We call this problem CAMPAIGN. We study the complexity of the CAMPAIGN problem, and design algorithms for solving it. Our experiments on real data demonstrate the efficiency and practical utility of our algorithms.
【Keywords】:
【Paper Link】 【Pages】:396-404
【Authors】: Bin Liu ; Hui Xiong
【Abstract】: The wide spread use of location based social networks (LBSNs) has enabled the opportunities for better location based services through Point-of-Interest (POI) recommendation. Indeed, the problem of POI recommendation is to provide personalized recommendations of places of interest. Unlike traditional recommendation tasks, POI recommendation is personalized, location-aware, and context depended. In light of this difference, this paper proposes a topic and location aware POI recommender system by exploiting associated textual and context information. Specifically, we first exploit an aggregated latent Dirichlet allocation (LDA) model to learn the interest topics of users and to infer the interest POIs by mining textual information associated with POIs. Then, a Topic and Location-aware probabilistic matrix factorization (TL-PMF) method is proposed for POI recommendation. A unique perspective of TL-PMF is to consider both the extent to which a user interest matches the POI in terms of topic distribution and the word-of-mouth opinions of the POIs. Finally, experiments on real-world LBSNs data show that the proposed recommendation method outperforms state-of-the-art probabilistic latent factor models with a significant margin. Also, we have studied the impact of personalized interest topics and word-of-mouth opinions on POI recommendations.
【Keywords】:
【Paper Link】 【Pages】:405-413
【Authors】: Karthik Subbian ; Charu C. Aggarwal ; Jaideep Srivastava ; Philip S. Yu
【Abstract】: The problem of community detection is a challenging one because of the presence of hubs and noisy links, which tend to create highly imbalanced graph clusters. Often, these resulting clusters are not very intuitive and difficult to interpret. With the growing availability of network information, there is a significant amount of prior knowledge available about the communities in social, communication and several other networks. These community labels may be noisy and very limited, though they do help in community detection. In this paper, we explore the use of such noisy labeled information for finding high quality communities. We will present an adaptive density-based clustering which allows flexible incorporation of prior knowledge in to the community detection process. We use a random walk framework to compute the node densities and the level of supervision regulates the node densities and the quality of resulting density based clusters. Our framework is general enough to produce both overlapping and non-overlapping clusters. We empirically show that even with a tiny amount of supervision, our approach can produce superior communities compared to popular baselines.
【Keywords】:
【Paper Link】 【Pages】:414-422
【Authors】: Ramnath Balasubramanyan ; William W. Cohen
【Abstract】: We present a pseudo-observed variable based regularization technique for latent variable mixed-membership models that provides a mechanism to impose preferences on the characteristics of aggregate functions of latent and observed variables. The regularization framework is used to regularize topic models, which are latent variable mixed membership models for language modeling. In many domains, documents and words often exhibit only a slight degree of mixed-membership behavior that is inadequately modeled by topic models which are overly liberal in permitting mixed-membership behavior. The regularization introduced in the paper is used to control the degree of polysemy of words permitted by topic models and to prefer sparsity in topic distributions of documents in a manner that is much more flexible than permitted by modification of priors. The utility of the regularization in exploiting sentiment-indicative features is evaluated internally using document perplexity and externally by using the models to predict star counts in movie and product reviews based on the content of the reviews. Results of our experiments show that using the regularization to finely control the behavior of topic models leads to better perplexity and lower mean squared error rates in the star-prediction task.
【Keywords】:
【Paper Link】 【Pages】:423-431
【Authors】: Tanuja Ganu ; Sundararajan Sellamanickam ; Shirish Krishnaj Shevade
【Abstract】: We address the problems of sparse multiclass and multi-label classifier design and devise new algorithms using margin based ideas. Many online applications such as image classification or text categorization demand fast inference. State-of-the-art classifiers such as Support Vector Machines (SVM) are not preferred in such applications because of slow inference, which is mainly due to the large number of support vectors required to form the SVM classifier. We propose algorithms which solve primal problems directly by greedily adding the required number of basis functions into the classifier model. Experiments on various real-world data sets demonstrate that the proposed algorithms output significantly smaller number of basis functions, while achieving nearly the same generalization performance as that given by SVM and other state-of-the-art sparse classifiers. This enables the classifiers to perform faster inference, thereby making the proposed algorithms powerful alternatives to existing approaches.
【Keywords】:
【Paper Link】 【Pages】:432-440
【Authors】: Francisco Ocegueda-Hernandez ; Ricardo Vilalta
【Abstract】: The presence of sub-classes within a data sample suggests a class decomposition approach to classification, where each subclass is treated as a new class. Class decomposition can be effected using multiple linear classifiers in an attempt to outperform a single global linear classifier; the goal is to gain in model complexity while keeping error variance low. We describe a study aimed at understanding the conditions behind the success or failure of class decomposition when combined with linear classifiers. We identify two relevant data properties as indicators of the suitability of class decomposition: 1) linear separability; and 2) class overlap. We use well-known data complexity measures to evaluate the presence of these properties in a data sample. Our methodology indicates when to avoid performing class decomposition based on such data properties. In addition we conduct a similar analysis at a more granular level for data samples marked as suitable for class decomposition. This extra analysis shows how to improve in efficiency during class decomposition. From an empirical standpoint, we test our technique on several real-world classification problems; results validate our methodology.
【Keywords】:
【Paper Link】 【Pages】:441-449
【Authors】: Alexander G. Gray ; Hassan A. Kingravi ; Patricio A. Vela
【Abstract】: This paper presents a practical, and theoretically well-founded, approach to improve the speed of kernel manifold learning algorithms relying on spectral decomposition. Utilizing recent insights in kernel smoothing and learning with integral operators, we propose Reduced Set KPCA (RSKPCA), which also suggests an easy-to-implement method to remove or replace samples with minimal effect on the empirical operator. A simple data point selection procedure is given to generate a substitute density for the data, with accuracy that is governed by a user-tunable parameter ℓ. The effect of the approximation on the quality of the KPCA solution, in terms of spectral and operator errors, can be shown directly in terms of the density estimate error and as a function of the parameter ℓ. We show in experiments that RSKPCA can improve both training and evaluation time of KPCA by up to an order of magnitude, and compares favorably to the widely-used Nystrom and density-weighted Nystrom methods.
【Keywords】:
【Paper Link】 【Pages】:450-457
【Authors】: Noam Goldberg ; Sven Leyffer ; Todd Munson
【Abstract】: This paper proposes a convex relaxation of a sparse support vector machine (SVM) based on the perspective relaxation of mixed-integer nonlinear programs. We seek to minimize the zero-norm of the hyperplane normal vector with a standard SVM hinge-loss penalty and extend our approach to a zero-one loss penalty. The relaxation that we propose is a second-order cone formulation that can be efficiently solved by standard conic optimization solvers. We compare the optimization properties and classification performance of the second-order cone formulation with previous sparse SVM formulations suggested in the literature.
【Keywords】:
【Paper Link】 【Pages】:458-466
【Authors】: Shin Ando ; Einoshin Suzuki
【Abstract】: In this paper, we address a classification task under a time-sensitive setting, in which the amount of observation required to make a prediction is viewed as a practical cost. Such a setting is intrinsic in many systems where the potential reward of the action against the predicted event depends on the response time, e.g., surveillance/warning and diagnostic applications. Meanwhile, predictions are usually less reliable when based on fewer observations, i.e., there exists a trade-off between such temporal cost and the accuracy. We address the task as a classification of subsequences in a time series. The goal is to predict the occurrences of events from subsequent observations and to learn when to commit to the prediction considering the trade-off. We propose an ensemble of classifiers which respectively makes predictions based on subsequences of different lengths. The prediction of the ensemble is given by the earliest confident prediction among the individual classifiers. We propose a cutting-plane algorithm for jointly training an ensemble of linear classifiers considering their temporal dependence. We compare the proposed algorithm against conventional approaches over a collection of behavioral trajectory data.
【Keywords】:
【Paper Link】 【Pages】:467-475
【Authors】: Yan Liu ; Mohammad Taha Bahadori
【Abstract】: Learning temporal causal structures among multiple time series is one of the major tasks in mining time series data. Granger causality is one of the most popular techniques in uncovering the temporal dependencies among time series; however it faces two main challenges: (i) the spurious effect of unobserved time series and (ii) the computational challenges in high dimensional settings. In this paper, we utilize the confounder path delays to find a subset of time series that via conditioning on them we are able to cancel out the spurious confounder effects. After study of consistency of different Granger causality techniques, we propose Copula-Granger and show that while it is consistent in high dimensions, it can efficiently capture non-linearity in the data. Extensive experiments on a synthetic and a social networking dataset confirm our theoretical results.
【Keywords】:
【Paper Link】 【Pages】:476-484
【Authors】: Sreangsu Acharyya ; Arindam Banerjee ; Daniel Boley
【Abstract】: While Bregman divergences have been used for clustering and embedding problems in recent years, the facts that they are asymmetric and do not satisfy triangle inequality have been a major concern. In this paper, we investigate the relationship between two families of symmetrized Bregman divergences and metrics that satisfy the triangle inequality. The first family can be derived from any well-behaved convex function. The second family generalizes the Jensen-Shannon divergence, and can only be derived from convex functions with certain conditional positive definiteness structure. We interpret the required structure in terms of cumulants of infinitely divisible distributions, and related results in harmonic analysis. We investigate kmeans-type clustering problems using both families of symmetrized divergences, and give efficient algorithms for the same.
【Keywords】:
【Paper Link】 【Pages】:485-493
【Authors】: Samuel J. Blasiak ; Huzefa Rangwala ; Sithu Sudarsan
【Abstract】: In recent years, many private corporations and government organizations have digitized corpuses of legacy-paper documents. Often, these organizations hope to take advantage of digital representations to transform costly manual tasks associated with paper archives into less-costly computer-assisted tasks. The most common approach toward automated information extraction is through inverted indexing systems that allow fast keyword searches. Keyword-based indexing, however, is ineffective for tasks that require information from higher-level contexts. To allow for more effective information extraction from digital corpuses, we propose combining two common document processing tasks, (i) clustering and (ii) segmentation, into one process to simultaneously segment documents within a corpus and assign each segment to a category. We have developed a generative probabilistic model to accomplish this task, which we call the Joint Segmentation and Clustering (JSC) model. From experiments measuring segmentation and clustering ability, we show that our model can accurately partition documents and assign meaningful categories to each partition. In addition, experiments tracking predictive perplexity show that our JSC model outperforms basic topic modeling approaches in terms of conciseness of the induced representation.
【Keywords】:
【Paper Link】 【Pages】:494-502
【Authors】: Zhengzhang Chen ; Alok N. Choudhary ; John Jenkins ; Vipin Kumar ; Anatoli V. Melechko ; Jinfeng Rao ; Nagiza F. Samatova ; Fredrick H. M. Semazzi
【Abstract】: Real-world dynamic systems such as physical and atmosphere-ocean systems often exhibit a hierarchical system-subsystem structure. However, the paradigm of making this hierarchical/modular structure and the rich properties they encode a “first-class citizen” of machine learning algorithms is largely absent from the literature. Furthermore, traditional data mining approaches focus on designing new classifiers or ensembles of classifiers, while there is a lack of study on detecting and correcting prediction errors of existing forecasting (or classification) algorithms. In this paper, we propose DETECTOR, a hierarchical method for detecting and correcting forecast errors by employing the whole-part relationships between the target system and non-target systems. Experimental results show that DETECTOR can successfully detect and correct forecasting errors made by state-of-art classifier ensemble techniques and traditional single classifier methods at an average rate of 22%, corresponding to a 11% average forecasting accuracy increase, in seasonal forecasting of hurricanes and landfalling hurricanes in North Atlantic and North African rainfall.
【Keywords】:
【Paper Link】 【Pages】:503-511
【Authors】: Xi C. Chen ; Karsten Steinhaeuser ; Shyam Boriah ; Snigdhansu Chatterjee ; Vipin Kumar
【Abstract】: Time series data are common in a variety of fields ranging from economics to medicine and manufacturing. As a result, time series analysis and modeling has become an active research area in statistics and data mining. In this paper, we focus on a type of change we call contextual time series change (CTC) and propose a novel two-stage algorithm to address it. In contrast to traditional change detection methods, which consider each time series separately, CTC is defined as a change relative to the behavior of a group of related time series. As a result, our proposed method is able to identify novel types of changes not found by other algorithms. We demonstrate the unique capabilities of our approach with several case studies on real-world datasets from the financial and Earth science domains.
【Keywords】:
【Paper Link】 【Pages】:512-520
【Authors】: William W. Cohen ; Bhavana Bharat Dalvi
【Abstract】: In this paper, we propose a single low-dimensional representation for entities found in different datasets on the web. Our proposed PIC-D embeddings can represent large D-partite graphs using small number of dimensions enabling fast similarity queries. Our experiments show that this representation can be constructed in small amount of time (linear in number of dimensions). We demonstrate how it can be used for variety of similarity queries like set expansion, automatic set instance acquisition, and column classification. Our approach results in comparable precision with respect to task specific baselines and up to two orders of magnitude improvement in terms of query response time.
【Keywords】:
【Paper Link】 【Pages】:521-529
【Authors】: Anna Drummond ; Chris Jermaine ; Zografoula Vagena
【Abstract】: We investigate the idea of using a topic model such as the popular Latent Dirichlet Allocation model as a feature selection step for unsupervised document clustering, where documents are clustered using the proportion of the various topics that are present in each document. One concern with using “vanilla” LDA as a feature selection method for input to a clustering algorithm is that the Dirichlet prior on the topic mixing proportions is too smooth and well-behaved. It does not encourage a “bumpy” distribution of topic mixing proportion vectors, which is what one would desire as input to a clustering algorithm. As such, we propose two variant topic models that are designed to do a better job of producing topic mixing proportions that have a good clustering structure.
【Keywords】:
【Paper Link】 【Pages】:530-538
【Authors】: Avinava Dubey ; Ahmed Hefny ; Sinead Williamson ; Eric P. Xing
【Abstract】: A single, stationary topic model such as latent Dirichlet allocation is inappropriate for modeling corpora that span long time periods, as the popularity of topics is likely to change over time. A number of models that incorporate time have been proposed, but in general they either exhibit limited forms of temporal variation, or require computationally expensive inference methods. In this paper we propose nonparametric Topics over Time (npTOT), a model for time-varying topics that allows an unbounded number of topics and flexible distribution over the temporal variations in those topics’ popularity. We develop a collapsed Gibbs sampler for the proposed model and compare against existing models on synthetic and real document sets.
【Keywords】:
【Paper Link】 【Pages】:539-547
【Authors】: Zheng Fang ; Zhongfei Zhang
【Abstract】: Collective matrix factorization has achieved a remarkable success in document classification in the literature of transfer learning. However, the learned latent factors still suffer from the divergence between different domains and thus are usually not discriminative for an appropriate assignment of category labels. Based on these observations, we impose a discriminative regression model over the latent factors to enhance the capability of label prediction. Moreover, we propose to minimize the Maximum Mean Discrepancy in the latent manifold subspace, as opposed to typically in the original data space, to bridge the gap between different domains. Specifically, we formulate these objectives into a joint optimization framework with two matrix tri-factorizations for the source and target domains simultaneously. An iterative algorithm DTLM is developed and the theoretical analysis of its convergence is discussed. Empirical study on benchmark datasets validates that DTLM improves the classification accuracy consistently compared with the state-of-the-art transfer learning methods.
【Keywords】:
【Paper Link】 【Pages】:548-559
【Authors】: Dan He ; Douglas Stott Parker Jr.
【Abstract】: In this paper we consider the problem of mining influence in a network of topics, where we seek to model direct influences between topics over time, in the form of bursts of topic occurrence — so that influence is measured by topic co-occurrence in high-frequency intervals — and this influence propagates among topics, leading to frequent occurrences of the topics. Although it is clearly significant, this problem has gotten very little attention. To address the problem, we propose a novel model: SemInf. As topics can recur, influence in our model is not constant or single-timestamp (“one-shot”, as in social networks), but is instead multi-timestamp, with periods of influence that can span multiple time intervals. More specifically, this model of topic influence captures upward momentum in popularity over all time-stamps in a burst (a period of “elevated occurrence” of topics). A topic hierarchy is used to provide a distance measure among topics and characterize their semantic relatedness. Experiments on biomedical topics give some surprising results, showing both that our model is successful at identifying topics with high impact, and that it can be potentially used as an alternative model of impact in the scientific literature (which can be useful when citation information is not available). We also show that although semantic information helps boost performance of our model, it can work without such information. What's more, we show SemInf can be also generalized to other domains, such as topics in computer science research.
【Keywords】:
【Paper Link】 【Pages】:560-568
【Authors】: Douglas R. Heisterkamp ; Jesse Johnson
【Abstract】: This paper introduces an algorithm for determining data clusters called TILO/PRC (Topologically Intrinsic Lexicographic Ordering/Pinch Ratio Clustering). The theoretical foundation for this algorithm, developed in [14], uses ideas from topology (particularly knot theory) suggesting that it should be very flexible and robust with respect to noise. The TILO portion of the algorithm progressively improves a linear ordering of the points in a data set until the ordering satisfies a topological condition called strongly irreducible. The PRC algorithm then divides the data set based on this ordering and a heuristic metric called the pinch ratio. We demonstrate the effectiveness of TILO/PRC for finding clusters in a wide variety of real and synthetic data sets and compare the results to existing clustering methods. Moreover, because the output of TILO depends on the initial ordering, we consider the effects of different random orderings on the final clusters defined by PRC, and show that choosing an initial ordering based on a different clustering algorithm can improve the final clusters. These results verify that both the theoretical foundations of TILO and the heuristic notion of pinch ratio are reasonable.
【Keywords】:
【Paper Link】 【Pages】:569-577
【Authors】: Ee-Peng Lim ; Tuan-Anh Hoang
【Abstract】: When a user retweets, there are three behavioral factors that cause the actions. They are the topic virality, user virality and user susceptibility. Topic virality captures the degree to which a topic attracts retweets by users. For each topic, user virality and susceptibility refer to the likelihood that a user attracts retweets and performs retweeting respectively. To model a set of observed retweet data as a result of these three topic specific factors, we first represent the retweets as a three-dimensional tensor of the tweet authors, their followers, and the tweets themselves. We then propose the V2S model, a tensor factorization model, to simultaneously derive the three sets of behavioral factors. Our experiments on a real Twitter data set show that the V2S model can effectively mine the behavioral factors of users and tweet topics during an election event. We also demonstrate that the V2S model outperforms the other topic based models in retweet prediction.
【Keywords】:
【Paper Link】 【Pages】:578-586
【Authors】: Bing Hu ; Yanping Chen ; Eamonn J. Keogh
【Abstract】: Most literature on time series classification assumes that the beginning and ending points of the pattern of interest can be correctly identified, both during the training phase and later deployment. In this work, we argue that this assumption is unjustified, and this has in many cases led to unwarranted optimism about the performance of the proposed algorithms. As we shall show, the task of correctly extracting individual gait cycles, heartbeats, gestures, behaviors, etc., is generally much more difficult than the task of actually classifying those patterns. We propose to mitigate this problem by introducing an alignment-free time series classification framework. The framework requires only very weakly annotated data, such as “in this ten minutes of data, we see mostly normal heartbeats…,” and by generalizing the classic machine learning idea of data editing to streaming/continuous data, allows us to build robust, fast and accurate classifiers. We demonstrate on several diverse real-world problems that beyond removing unwarranted assumptions and requiring essentially no human intervention, our framework is both significantly faster and significantly more accurate than current state-of-the-art approaches.
【Keywords】:
【Paper Link】 【Pages】:587-595
【Authors】: Aijun An ; Mehdi Kargar ; Morteza ZiHayat
【Abstract】: Given an expert network, we tackle the problem of finding a team of experts that covers a set of required skills and also minimizes the communication cost as well as the personnel cost of the team. Since two costs need to be minimized, this is a bicriteria optimization problem. We show that the problem of minimizing these objectives is NP-hard. We use two approaches to solve this bicriteria optimization problem. In the first approach, we propose several (α, β)-approximation algorithms that receive a budget on one objective and minimizes the other objective within the budget with guaranteed performance bounds. In the second approach, an approximation algorithm is proposed to find a set of Pareto-optimal teams, in which each team is not dominated by other feasible teams in terms of the personnel and communication costs. The proposed approximation algorithms have provable performance bounds. Extensive experiments on real datasets demonstrate the effectiveness and scalability of the proposed algorithms.
【Keywords】:
【Paper Link】 【Pages】:596-604
【Authors】: Alexios Kotsifakos ; Panagiotis Papapetrou ; Vassilis Athitsos
【Abstract】: Sequences of event intervals appear in several application domains including sign language, sensor networks, medicine, human motion databases, and linguistics. Such sequences comprise events that occur at time intervals and are time stamped at their start and end time. In this paper, we propose a new method, called IBSM, for comparing such sequences. IBSM performs full sequence matching using a vector-based representation of the original sequence. At each time point an event vector is computed; hence, the original sequence is mapped to an ordered set of vectors, which we call event table. Given two sequences, their event tables are resized using bilinear interpolation, which ensures they are of the same size. The resulting event tables are then compared using the Euclidean distance. In addition, we propose two techniques for reducing the computational cost of IBSM when performing nearest neighbor search in a large database. Extensive experiments on eight real datasets show that IBSM outperforms existing state-of-the-art methods by up to a factor of two in terms of nearest neighbor classification accuracy, and by up to two orders of magnitude in terms of runtime.
【Keywords】:
【Paper Link】 【Pages】:605-613
【Authors】: San-Chuan Hung ; Perng-Hwa Kung ; Shou-De Lin ; Jing-Kai Lou ; Chin-Hua Tsai ; Fu-Min Wang
【Abstract】: Issues about information diffusion on social networks has been studied for decades. To simplify the analysis, most models consider the propagated information or media as single real values. Representing media as single values however would not suitable for certain situations such as the voter preference toward the candidates in an election. In such case, the representation would better be lists instead of single values as people sometimes can alter others’ preference through toward objects social inference. This paper studies the diffusion of preference on social networks, which is a novel problem to solve in this direction. First, we propose a preference propagation model that can handle the diffusion of vector-type information instead of only binary or numerical values. Furthermore, we theoretically prove the convergence of diffusion with the proposed model, and that a consensus among strongly connected nodes can eventually be reached with certain conditions. We further extract relevant information from a publicly available bibliography datasets to evaluate the proposed models, while such data can further serve as a benchmark for evaluating future models of the same purpose. Lastly, we exploit the extracted data to demonstrate the usefulness of our model and compare it with other well-known diffusion strategies such as independent cascade, linear threshold, and diffusion rank. We find that our model consistently outperforms other models.
【Keywords】:
【Paper Link】 【Pages】:614-622
【Authors】: James Bailey ; Jeffrey Chan ; Kotagiri Ramamohanarao ; Christopher Leckie ; Wei Liu
【Abstract】: Conventional non-negative tensor factorization (NTF) methods assume there is only one tensor that needs to be decomposed to low-rank factors. However, in practice data are usually generated from different time periods or by different class labels, which are represented by a sequence of multiple tensors associated with different labels. This raises the problem that when one needs to analyze and compare multiple tensors, existing NTF is unsuitable for discovering all potentially useful patterns: 1) if one factorizes each tensor separately, the common information shared by the tensors is lost in the factors, and 2) if one concatenates these tensors together and forms a larger tensor to factorize, the intrinsic discriminative subspaces that are unique to each tensor are not captured. The cause of such an issue is from the fact that conventional factorization methods handle data observations in an unsupervised way, which only considers features and not labels of the data. To tackle this problem, in this paper we design a novel factorization algorithm called CDNTF (common and discriminative subspace non-negative tensor factorization), which takes both features and class labels into account in the factorization process. CDNTF uses a set of labelled tensors as input and computes both their common and discriminative subspaces simultaneously as output. We design an iterative algorithm that solves the common and discriminative subspace factorization problem with a proof of convergence. Experiment results on solving graph classification problems demonstrate the power and the effectiveness of the subspaces discovered by our method.
【Keywords】:
【Paper Link】 【Pages】:623-631
【Authors】: Milos Hauskrecht ; Zitao Liu ; Lei Wu
【Abstract】: Development of accurate models of complex clinical time series data is critical for understanding the disease, its dynamics, and subsequently patient management and clinical decision making. Clinical time series differ from other time series applications mainly in that observations are often missing and made at irregular time intervals. In this work, we propose and test a new probabilistic approach for modeling clinical time series data that is optimized to handle irregularly sampled observations. Our model is defined by a sequence of Gaussian processes (GPs), each restricted to a window of a finite size, where dependencies among two consecutive Gaussian processes are represented using a linear dynamical system. We develop algorithms supporting both model learning and inference. Experiments on real-world clinical time series data show that our model is better for modeling clinical time series and that it outperforms or is close to alternative time series prediction models.
【Keywords】:
【Paper Link】 【Pages】:632-640
【Authors】: Ruilin Liu ; Philippos Mordohai ; Wendy Hui Wang ; Hui Xiong
【Abstract】: The Cloud-based infrastructure-as-a-service (IaaS) paradigm (e.g., Amazon EC2) enables a client who lacks computational resources to outsource her dataset and data mining tasks to the Cloud. However, as the Cloud may not be fully trusted, it raises serious concerns about the integrity of the mining results returned by the Cloud. To this end, in this paper, we provide a focused study about how to perform integrity verification of the k-means clustering task outsourced to an IaaS provider. We consider the untrusted sloppy IaaS service provider that intends to return wrong clustering results by terminating the iterations early to save computational cost. We develop both probabilistic and deterministic verification methods to catch the incorrect clustering result by the service provider. The deterministic method returns 100% integrity guarantee with cost that is much cheaper than executing k-means clustering locally, while the probabilistic method returns a probabilistic integrity guarantee with computational cost even cheaper than the deterministic approach. Our experimental results show that our verification methods can effectively and efficiently capture the sloppy service provider.
【Keywords】:
【Paper Link】 【Pages】:641-649
【Authors】: Zhongqi Lu ; Weike Pan ; Evan Wei Xiang ; Qiang Yang ; Lili Zhao ; ErHeng Zhong
【Abstract】: Collaborative filtering (CF) aims to predict users’ ratings on items according to historical user-item preference data. In many real-world applications, preference data are usually sparse, which would make models overfit and fail to give accurate predictions. Recently, several research works show that by transferring knowledge from some manually selected source domains, the data sparseness problem could be mitigated. However for most cases, parts of source domain data are not consistent with the observations in the target domain, which may misguide the target domain model building. In this paper, we propose a novel criterion based on empirical prediction error and its variance to better capture the consistency across domains in CF settings. Consequently, we embed this criterion into a boosting framework to perform selective knowledge transfer. Comparing to several state-of-the-art methods, we show that our proposed selective transfer learning framework can significantly improve the accuracy of rating prediction on several real-world recommendation tasks.
【Keywords】:
【Paper Link】 【Pages】:650-658
【Authors】: Shyam Boriah ; Ankush Khandelwal ; Vipin Kumar ; Varun Mithal ; Karsten Steinhaeuser
【Abstract】: Mapping land cover change is an important problem for the scientific community as well as policy makers. Traditionally, bi-temporal classification of satellite data is used to identify areas of land cover change. However, these classification products often have errors due to classifier inaccuracy or poor data, which poses significant issues when using them for land cover change detection. In this paper, we propose a generative model for land cover label sequences and use it to reassign a more accurate sequence of land cover labels to every pixel. Empirical evaluation on real and synthetic data suggests that the proposed approach is effective in capturing the characteristics of land cover classification and change processes, and produces significantly improved classification and change detection products.
【Keywords】:
【Paper Link】 【Pages】:659-667
【Authors】: Lada A. Adamic ; Christos Faloutsos ; Theodore J. Iwashyna ; B. Aditya Prakash ; Hanghang Tong
【Abstract】: Preventing contagion in networks is an important problem in public health and other domains. Targeting nodes to immunize based on their network interactions has been shown to be far more effective at stemming infection spread than immunizing random subsets of nodes. However, the assumption that selected nodes can be rendered completely immune does not hold for infections for which there is no vaccination or effective treatment. Instead, one can confer fractional immunity to some nodes by allocating variable amounts of infection-prevention resource to them. We formulate the problem to distribute a fixed amount of resource across nodes in a network such that the infection rate is minimized, prove that it is NP-complete and derive a highly effective and efficient linear-time algorithm. We demonstrate the efficiency and accuracy of our algorithm compared to several other methods using simulation on real-world network datasets including US-MEDICARE and state-level interhospital patient transfer data. We find that concentrating resources at a small subset of nodes using our algorithm is up to 6 times more effective than distributing them uniformly (as is current practice) or using network-based heuristics. To the best of our knowledge, we are the first to formulate the problem, use truly nation-scale network data and propose effective algorithms.
【Keywords】:
【Paper Link】 【Pages】:668-676
【Authors】: Eamonn J. Keogh ; Thanawin Rakthanmanon
【Abstract】: Time series shapelets are a recent promising concept in time series data mining. Shapelets are time series snippets that can be used to classify unlabeled time series. Shapelets not only provide interpretable results, which are useful for domain experts and developers alike, but shapelet-based classifiers have been shown by several independent research groups to have superior accuracy on many datasets. Moreover, shapelets can be seen as generalizing the lazy nearest neighbor classifier to an eager classifier. Thus, as a deployed classification tool, shapelets can be many orders of magnitude faster than any rival with comparable accuracy. Although shapelets are a useful concept, the current literature bemoans the fact that shapelet discovery is a time-consuming task. In spite of several efforts to speed up shapelet discovery algorithms, including the use of specialist hardware, the current state-of-the-art algorithms are still intractable on large datasets. In this work, we propose a fast shapelet discovery algorithm that outperforms the current state-of-the-art by two or three orders of magnitude, while producing models with accuracy that is not perceptibly different.
【Keywords】:
【Paper Link】 【Pages】:677-685
【Authors】: Huzefa Rangwala ; Zeehasham Rasheed
【Abstract】: Current bio-technologies allow sequencing of genomes from multiple organisms, that co-exist as communities within ecological environments. This collective genomic process (called metagenomics) has spurred the development of several computational tools for the quantification of abundance, diversity and role of different species within different communities. Unsupervised clustering algorithms (also called binning algorithms) have been developed to group similar metagenome sequences. We have developed an algorithm called MC-MinH that uses the min-wise hashing approach, along with a greedy clustering algorithm to group 16S and whole metagenomic sequences. We represent unequal length sequences using contiguous subsequences or k-mers, and then approximate the computation of pairwise similarity using independent min-wise hashing. The performance of our algorithm is evaluated on several real and simulated metagenomic benchmarks. We demonstrate that our approach is computationally efficient and produces accurate clustering results when evaluated using external ground truth.
【Keywords】:
【Paper Link】 【Pages】:686-694
【Authors】: Ümit V. Çatalyürek ; Kamer Kaya ; Ahmet Erdem Sariyüce ; Erik Saule
【Abstract】: The betweenness metric has always been intriguing and used in many analyses. Yet, it is one of the most computationally expensive kernels in graph mining. For that reason, making betweenness centrality computations faster is an important and well-studied problem. In this work, we propose the framework, BADIOS, which compresses a network and shatters it into pieces so that the centrality computation can be handled independently for each piece. Although BADIOS is designed and tuned for betweenness centrality, it can easily be adapted for other centrality metrics. Experimental results show that the proposed techniques can be a great arsenal to reduce the centrality computation time for various types and sizes of networks. In particular, it reduces the computation time of a 4.6 million edges graph from more than 5 days to less than 16 hours.
【Keywords】:
【Paper Link】 【Pages】:695-703
【Authors】: Jiliang Tang ; Huan Liu
【Abstract】: Feature selection is widely used in preparing high-dimensional data for effective data mining. Attribute-value data in traditional feature selection differs from social media data, although both can be large-scale. Social media data is inherently not independent and identically distributed (i.i.d.), but linked. Furthermore, there is a lot of noise. The quality of social media data can vary drastically. These unique properties present challenges as well as opportunities for feature selection. Motivated by these differences, we propose a novel feature selection framework, CoSelect, for social media data. In particular, CoSelect can exploit link information by applying social correlation theories, incorporate instance selection with feature selection, and select relevant instances and features simultaneously. Experimental results on real-world social media datasets demonstrate the effectiveness of our proposed framework and its potential in mining social media data.
【Keywords】:
【Paper Link】 【Pages】:704-712
【Authors】: Arnold P. Boedihardjo ; Feng Chen ; Haili Dong ; Chang-Tien Lu ; Bingsheng Wang
【Abstract】: Energy crisis and climate change have caused a global concern and motivated efforts to reduce energy consumption. Studies have shown that providing appliance-level consumption information can help users conserve a significant amount of energy. Existing methods focus on learning parallel signal signatures, but the inherent relationships between the signatures have not been well explored. This paper presents the Hierarchical Probabilistic Model for Energy Disaggregation (HPMED). We derive the discriminative features from low sample rate power readings to characterise device functional modes. The HPMED model bridges the discriminative features, working states, and aggregated consumption. To address the analytical intractable problem, an efficient algorithm is proposed to approximately infer the latent states for disaggregation task. Extensive experiments on a real-world dataset demonstrated the effectiveness of the proposed approach.
【Keywords】:
【Paper Link】 【Pages】:713-721
【Authors】: Yuguo Chen ; Jiawei Han ; Ming Ji ; Jialu Liu ; Lu Su ; Chi Wang ; Hongning Wang
【Abstract】: In typical studies of node grouping detection, the grouping is presumed to have a certain type of correlation with the network structure (e.g., densely connected groups of nodes that are loosely connected in between). People have defined different fitness measures (modularity, conductance, etc.) to quantify such correlation, and group the nodes by optimizing a certain fitness measure. However, a particular grouping with desired semantics, as the target of the detection, is not promised to be detectable by each measure. We study a fundamental problem in the process of node grouping discovery: Given a particular grouping in a network, whether and to what extent it can be discovered with a given fitness measure. We propose two approaches of testing the detectability, namely ranking-based and correlation-based randomization tests. Our methods are evaluated on both synthetic and real datasets, which shows the proposed methods can effectively predict the detectability of groupings of various types, and support explorative process of node grouping discovery.
【Keywords】:
【Paper Link】 【Pages】:722-730
【Authors】: Zhifeng Hao ; Bo Liu ; Yanshan Xiao ; Philip S. Yu
【Abstract】: This paper presents a novel approach, called MODS, to build an accurate time evolving classifier from multiple one-class data streams learning time evolving classifier. Our proposed MODS approach works in two steps. In the first step, we first construct local one-class classifiers on the labeled positive examples from each sub-data stream respectively. We then collect the informative examples (support vectors) around each local one-class classifier, which can support the decision boundary of the classifier. This is called support vector preservation principle. In the second step, we construct a global one-class classifier on the collected informative examples. By using the support vector preservation principle, our proposed MODS explicitly addresses the problem of building accurate classifier from multiple one-class data streams. Extensive experiments on real life data streams have demonstrated that our MODS approach can achieve high performance and efficiency for the multiple one-class data streams learning in comparison with other approaches.
【Keywords】:
【Paper Link】 【Pages】:731-739
【Authors】: Longbing Cao ; Zhifeng Hao ; Bo Liu ; Yanshan Xiao ; Philip S. Yu
【Abstract】: In textual data stream environment, concept drift can occur at any time, existing approaches partitioning streams into chunks can have problem if the chunk boundary does not coincide with the change point which is impossible to predict. Since concept drift can occur at any point of the streams, it will certainly occur within chunks, which is called random concept drift. The paper proposed an approach, which is called chunk level-based concept drift method (CLCD), that can overcome this chunking problem by continuously monitoring chunk characteristics to revise the classifier based on transfer learning in positive and unlabeled (PU) textual data stream environment. Our proposed approach works in three steps. In the first step, we propose core vocabulary-based criteria to justify and identify random concept drift. In the second step, we put forward the extension of LELC (PU learning by extracting likely positive and negative micro-clusters)[1], called soft-LELC, to extract representative examples from unlabeled data, and assign a confidence score to each extracted example. The assigned confidence score represents the degree of belongingness of an example towards its corresponding class. In the third step, we set up a transfer learning-based SVM to build an accurate classifier for the chunks where concept drift is identified in the first step. Extensive experiments have shown that CLCD can capture random concept drift, and outperforms state-of-the-art methods in positive and unlabeled textual data stream environments.
【Keywords】:
【Paper Link】 【Pages】:740-748
【Authors】: Ankit Agrawal ; Zhengzhang Chen ; Yu Cheng ; Alok N. Choudhary ; Haotian Liu ; Md. Mostofa Ali Patwary ; Yusheng Xie ; Kunpeng Zhang
【Abstract】: We investigate a class of emerging online marketing challenges in social networks; macro behavioral targeting (MBT) is introduced as non-personalized broadcasting efforts to massive populations. We propose a new probabilistic graphical model for MBT. Further, a linear-time approximation method is proposed to circumvent an intractable parametric representation of user behaviors. We compare the proposed model with the existing state-of-the-art method on real datasets from social networks. Our model outperforms in all categories by comfortable margins.
【Keywords】:
【Paper Link】 【Pages】:749-757
【Authors】: Xueqi Cheng ; Jiafeng Guo ; Shenghua Liu ; Yanfeng Wang ; Xiaohui Yan
【Abstract】: Nowadays, short texts are very prevalent in various web applications, such as microblogs, instant messages. The severe sparsity of short texts hinders existing topic models to learn reliable topics. In this paper, we propose a novel way to tackle this problem. The key idea is to learn topics by exploring term correlation data, rather than the high-dimensional and sparse term occurrence information in documents. Such term correlation data is less sparse and more stable with the increase of the collection size, and can well capture the necessary information for topic learning. To obtain reliable topics from term correlation data, we first introduce a novel way to compute term correlation in short texts by representing each term with its co-occurred terms. Then we formulated the topic learning problem as symmetric non-negative matrix factorization on the term correlation matrix. After learning the topics, we can easily infer the topics of documents. Experimental results on three data sets show that our method provides substantially better performance than the baseline methods.
【Keywords】:
【Paper Link】 【Pages】:758-766
【Authors】: Huiwen Yu ; Dayu Yuan
【Abstract】: Finding a maximum coverage by k sets from a given collection (Max-k-Cover), finding a minimum number of sets with a required coverage (Partial-Cover) are both important combinatorial optimization problems. Various problems from data mining, machine learning, social network analysis, operational research, etc. can be generalized as a set coverage problem. The standard greedy algorithm is efficient as an in-memory algorithm. However, when we are facing very large-scale dataset or in an online environment, we seek a new algorithm which makes only one pass through the entire dataset. Previous one-pass algorithms for the Max-k-Cover problem cannot be extended to the Partial-Cover problem and do not enjoy the prefix-optimal property. In this paper, we propose a novel one-pass streaming algorithm which produces a prefix-optimal ordering of sets, which can easily be used to solve the Max-k-Cover and the Partial-Cover problems. Our algorithm consumes space linear to the size of the universe of elements. The processing time for a set is linear to the size of this set. We also show with the aid of computer simulation that the approximation ratio of the Max-k-Cover problem is around 0.3. We conduct experiments on extensive datasets to compare our algorithm with existing one-pass algorithms on the Max-k-Cover problem, and with the standard greedy algorithm on the Partial-Cover problem. We demonstrate the efficiency and quality of our algorithm.
【Keywords】:
【Paper Link】 【Pages】:767-775
【Authors】: Zheng Chen ; Chengtao Li ; Jian-Tao Sun ; Jianwen Zhang
【Abstract】: This paper deals with the problem of jointly mining topics, sentiments, and the association between them from online reviews in an unsupervised way. Previous methods often treat a sentiment as a special topic and assume a word is generated from a flat mixture of topics, where the discriminative performance of sentiment analysis is not satisfied. A key reason is that providing rich priors on the polarity of a word for the flat mixture is difficult as the polarity often depends on the topic. To solve the problem we propose a novel model. We decompose the generative process of a word's sentiment polarity to a two-level hierarchy: the first level determines whether a word is used as a sentiment word or just an ordinary topic word, and the second level (if the word is used as a sentiment word) determines the polarity of it. With the decomposition, we provide separate prior for the the first level to encourage the discrimination between sentiment words and ordinary topic words. This prior is relatively easy to obtain compared to the concrete prior of the word polarities. We construct the prior based on part-of-speech tags of words and embed the prior into the model. Experiments on four real online review data sets show that our model consistently outperforms previous methods in the task of sentiment analysis, and simultaneously performs well in the sub-tasks of discovering ordinary topics, sentiment-specific topics, and extracting topic-specific sentiment words.
【Keywords】:
【Paper Link】 【Pages】:776-784
【Authors】: Tong Zhao ; Naiwen Bian ; Chunping Li ; Mengya Li
【Abstract】: Community Question Answering (CQA) services provide an open platform for people to share their knowledge and have attracted great attention for its rapidly increasing popularity. As more knowledge is shared in CQA, how to use the repository for solving new questions has become a crucial problem. In this paper, we tackle the problem by finding experts from the question answering history first and then recommending the appropriate experts to answer the new questions. We develop the Topic-level Expert Learning (TEL) model to find experts on topic level in CQA. Our proposed model incorporates link analysis into content analysis. The main difference between TEL and other generative models is that TEL can automatically adjust and update the sampling parameters during iterations in order to better model the experts on topic level. The experiments are conducted on datasets crawled from Yahoo! Answers and the results show that our method can effectively find experts to answer new questions and can better predict best responders for new questions. Our model achieves significant improvement over the baseline methods on multiple metrics. Beyond these metric performances, TEL converges fast within a few iterations.
【Keywords】:
【Paper Link】 【Pages】:785-793
【Authors】: John Canny ; Huasha Zhao
【Abstract】: Incremental model-update strategies are widely used in machine learning and data mining. By “incremental update” we refer to models that are updated many times using small subsets of the training data. Two well-known examples are stochastic gradient and MCMC. Both provide fast sequential performance and have generated many of the best-performing methods for particular problems (logistic regression, SVM, LDA etc.). But these methods are difficult to adapt to parallel or cluster settings because of the overhead of distributing model updates through the network. Updates can be locally batched to reduce communication overhead, but convergence typically suffers as the batch size increases. In this paper we introduce and analyze butterfly mixing, an approach which interleaves communication with computation. We evaluate butterfly mixing on stochastic gradient algorithms for logistic regression and SVM, on two datasets. Results show that butterfly mix steps are fast and failure-tolerant, and overall we achieved a 3.3x speedup over full mix (AllReduce) on an Amazon EC2 cluster.
【Keywords】:
【Paper Link】 【Pages】:794-802
【Authors】: Jing Jiang ; Minghui Qiu ; Feida Zhu
【Abstract】: Textual information exchanged among users on online social network platforms provides deep understanding into users’ interest and behavioral patterns. However, unlike traditional text-dominant settings such as offline publishing, one distinct feature for online social network is users’ rich interactions with the textual content, which, unfortunately, has not yet been well incorporated in the existing topic modeling frameworks. In this paper, we propose an LDA-based behavior-topic model (B-LDA) which jointly models user topic interests and behavioral patterns. We focus the study of the model on online social network settings such as microblogs like Twitter where the textual content is relatively short but user interactions on them are rich. We conduct experiments on real Twitter data to demonstrate that the topics obtained by our model are both informative and insightful. As an application of our B-LDA model, we also propose a Twitter followee recommendation algorithm combining B-LDA and LDA, which we show in a quantitative experiment outperforms LDA with a significant margin.
【Keywords】:
【Paper Link】 【Pages】:803-811
【Authors】: Wei Ding ; Xindong Wu ; Shichao Zhang ; Xiaofeng Zhu
【Abstract】: This paper takes manifold learning and regression simultaneously into account to perform unsupervised spectral feature selection. We first extract the bases of the data, and then represent the data sparsely using the extracted bases by proposing a novel joint graph sparse coding model, JGSC for short. We design a new algorithm TOSC to compute the resulting objective function of JGSC, and then theoretically prove that the proposed objective function converges to its global optimum via the proposed TOSC algorithm. We repeat the extraction and the TOSC calculation until the value of the objective function of JGSC satisfies pre-defined conditions. Eventually the derived new representation of the data may only have a few non-zero rows, and we delete the zero rows (a.k.a. zero-valued features) to conduct feature selection on the new representation of the data. Our empirical studies demonstrate that the proposed method outperforms several state-of-the-art algorithms on real datasets in term of the kNN classification performance.
【Keywords】: