Proceedings of the 2017 SIAM International Conference on Data Mining, Houston, Texas, USA, April 27-29, 2017. SIAM 【DBLP Link】
【Paper Link】 【Pages】:1-10
【Authors】:
【Abstract】: Frontmatter includes welcome letters and table of contents.
【Keywords】:
【Paper Link】 【Pages】:1-9
【Authors】: Suhang Wang ; Charu C. Aggarwal ; Huan Liu
【Abstract】: Neural networks have become very popular in recent years because of the astonishing success of deep learning in various domains such as image and speech recognition. In many of these domains, specific architectures of neural networks, such as convolutional networks, seem to fit the particular structure of the problem domain very well, and can therefore perform in an astonishingly effective way. However, the success of neural networks is not universal across all domains. Indeed, for learning problems without any special structure, or in cases where the data is somewhat limited, neural networks are known not to perform well with respect to traditional machine learning methods such as random forests. In this paper, we show that a carefully designed neural network with random forest structure can have better generalization ability. In fact, this architecture is more powerful than random forests, because the back-propagation algorithm reduces to a more powerful and generalized way of constructing a decision tree. Furthermore, the approach is efficient to train and requires a small constant factor of the number of training examples. This efficiency allows the training of multiple neural networks in order to improve the generalization accuracy. Experimental results on 10 real-world benchmark datasets demonstrate the effectiveness of the proposed enhancements.
【Keywords】:
【Paper Link】 【Pages】:10-18
【Authors】: Zhi Nie ; Binbin Lin ; Shuai Huang ; Naren Ramakrishnan ; Wei Fan ; Jieping Ye
【Abstract】: The decision tree model has gained great popularity both in academia and industry due to its capability of learning highly non-linear decision boundaries, and at the same time, still preserving interpretability that usually translates into transparency of decision-making. However, it has been a longstanding challenge for learning robust decision tree models since the learning process is usually sensitive to data and many existing tree learning algorithms lead to overfitted tree structures due to the heuristic and greedy nature of these algorithms. Pruning is usually needed as an ad-hoc procedure to prune the tree structure, which is, however, not guided by a rigorous optimization formulation but by some intuitive statistical justification. Motivated by recent developments in sparse learning, in this paper, we propose a novel formulation that recognizes an interesting connection between decision tree post-pruning and sparse learning, where the tree structure can be embedded as constraints in the sparse learning framework via the use of a max-heap constraint as well as a sparsity constraint. This novel formulation leads to a non-convex optimization problem which can be solved by an iterative shrinkage algorithm in which the proximal operator can be solved by an efficient max-heap projection algorithm. A stability selection method is further proposed for enabling robust model selection in practice and guarantees the selected nodes preserve tree structure. Extensive experimental results demonstrate that our proposed method achieves better predictive performance than many existing benchmark methods across a wide range of real-world datasets.
【Keywords】:
【Paper Link】 【Pages】:19-27
【Authors】: Yi Ding ; Sheng-Jun Huang ; Chen Zu ; Daoqiang Zhang
【Abstract】: Linear classifier is an essential part of machine learning, and improving its robustness has attracted much effort. Logistic regression (LR) is one of the most widely used linear classifier for its simplicity and probabilistic output. To reduce the risk of overfitting, LR was enhanced by introducing a generalized logistic loss (GLL) with a L2-norm regularization, aiming to maximize the minimum margin. However, the strategy of maximizing minimal margin is less robust to noisy data. In this paper, we incorporate GLL with margin distribution to exploit the statistical information from the training data, and propose a margin distribution logistic machine (MDLM) for better generalization performance and robustness. Furthermore, we extend MDLM to a multi-class version and learn different classes simultaneously by utilizing more information shared across these classes. Extensive experimental results validate the effectiveness of MDLM on both binary classification and multi-class classification.
【Keywords】:
【Paper Link】 【Pages】:28-35
【Authors】: Yanbing Xue ; Milos Hauskrecht
【Abstract】: Annotation of classification data by humans can be a time-consuming and tedious process. Finding ways of reducing the annotation effort is critical for building the classification models in practice and for applying them to a variety of classification tasks. In this paper, we develop a new active learning framework that combines two strategies to reduce the annotation effort. First, it relies on label uncertainty information obtained from the human in terms of the Likert-scale feedback. Second, it uses active learning to annotate examples with the greatest expected change. We propose a Bayesian approach to calculate the expectation and an incremental SVM solver to reduce the time complexity of the solvers. We show the combination of our active learning strategy and the Likert-scale feedback can learn classification models more rapidly and with a smaller number of labeled instances than methods that rely on either Likert-scale labels or active learning alone.
【Keywords】:
【Paper Link】 【Pages】:36-44
【Authors】: Ryan McBride ; Ke Wang ; Viswanadh Nekkanti ; Wenyuan Li
【Abstract】: In real life applications, we often face the following risk clearance problem: given a set of instances with a known numeric outcome Y (e.g., a toxicity level), we want to learn a model to identify new instances that have a low risk, i.e., the probability of the Y value exceeding a certain maximum MAX is less than some threshold t. This problem guarantees that the cleared instances have the minimum precision of 1 – t for Y ≤ MAX. By clearing such low risk instances, we can allocate costly resources to the remaining high risk instances. In this work, we formulate this problem as Risk Clearance with a goal of maximizing the clearance of low risk instances. Existing classification models fail to solve Risk Clearance adequately, so we develop algorithms designed specifically for this problem. We then validate that our approach improves on existing work via experiments on an industrial case study.
【Keywords】:
【Paper Link】 【Pages】:45-53
【Authors】: Yashu Liu ; Shuang Qiu ; Ping Zhang ; Pinghua Gong ; Fei Wang ; Guoliang Xue ; Jieping Ye
【Abstract】: Computational Drug Discovery, which uses computational techniques to facilitate and improve the drug discovery process, has aroused considerable interests in recent years. Drug Repositioning (DR) and Drug-Drug Interaction (DDI) prediction are two key problems in drug discovery and many computational techniques have been proposed for them in the last decade. Although these two problems have mostly been researched separately in the past, both DR and DDI can be formulated as the problem of detecting positive interactions between data entities (DR is between drug and disease, and DDI is between pairwise drugs). The challenge in both problems is that we can only observe a very small portion of positive interactions. In this paper, we propose a novel framework called Dyadic Positive-Unlabeled learning (DyPU) to solve the problem of detecting positive interactions. DyPU forces positive data pairs to rank higher than the average score of unlabeled data pairs. Moreover, we also derive the dual formulation of the proposed method with the rectifier scoring function and we show that the associated non-trivial proximal operator admits a closed form solution. Extensive experiments are conducted on real drug data sets and the results show that our method achieves superior performance comparing with the state-of-the-art.
【Keywords】:
【Paper Link】 【Pages】:54-62
【Authors】: Muhammad Yousefnezhad ; Daoqiang Zhang
【Abstract】: Multivariate Pattern (MVP) classification holds enormous potential for decoding visual stimuli in the human brain by employing task-based fMRI data sets. There is a wide range of challenges in the MVP techniques, i.e. decreasing noise and sparsity, defining effective regions of interest (ROIs), visualizing results, and the cost of brain studies. In overcoming these challenges, this paper proposes a novel model of neural representation, which can automatically detect the active regions for each visual stimulus and then utilize these anatomical regions for visualizing and analyzing the functional activities. Therefore, this model provides an opportunity for neuroscientists to ask this question: what is the effect of a stimulus on each of the detected regions instead of just study the fluctuation of voxels in the manually selected ROIs. Moreover, our method introduces analyzing snapshots of brain image for decreasing sparsity rather than using the whole of fMRI time series. Further, a new Gaussian smoothing method is proposed for removing noise of voxels in the level of ROIs. The proposed method enables us to combine different fMRI data sets for reducing the cost of brain studies. Experimental studies on 4 visual categories (words, consonants, objects and nonsense photos) confirm that the proposed method achieves superior performance to state-of-the-art methods.
【Keywords】:
【Paper Link】 【Pages】:63-71
【Authors】: Qiong Hu ; Tomasz Imielinski
【Abstract】: With increasing demand for efficient data analysis, execution time of itemset mining becomes critical for many large-scale or time-sensitive applications. We propose a dynamic approach for itemset mining that allows us to achieve flexible trade-offs between efficiency and completeness. ALPINE is to our knowledge the first algorithm to progressively mine itemsets and closed itemsets “support-wise”. It guarantees that all itemsets with support exceeding the current checkpoint's support have been found before it proceeds further. Thus, it is very attractive for extremely long mining tasks with very high dimensional data because it can offer intermediate meaningful and complete results. This feature is the most important contribution of ALPINE, which is also fast but not necessarily the fastest algorithm around. Another critical advantage of ALPINE is that it does not require the apriori decided minimum support threshold.
【Keywords】:
【Paper Link】 【Pages】:72-80
【Authors】: Ioakeim Perros ; Fei Wang ; Ping Zhang ; Peter Walker ; Richard W. Vuduc ; Jyotishman Pathak ; Jimeng Sun
【Abstract】: We study the problem of Polyadic Prediction, where the input consists of an ordered tuple of objects, and the goal is to predict a measurement associated with them. Many tasks can be naturally framed as Polyadic Prediction problems. In drug discovery, for instance, it is important to estimate the treatment effect of a drug on various tissue-specific diseases, as it is expressed over the available genes. Thus, we essentially predict the expression value measurements for several (drug, gene, tissue) triads. To tackle Polyadic Prediction problems, we propose a general framework, called Polyadic Regression, predicting measurements associated with multiple objects. Our framework is inductive, in the sense of enabling predictions for new objects, unseen during training. Our model is expressive, exploring high-order, polyadic interactions in an efficient manner. An alternating Proximal Gradient Descent procedure is proposed to fit our model. We perform an extensive evaluation using real-world chemogenomics data, where we illustrate the superior performance of Polyadic Regression over the prior art. Our method achieves an increase of 0.06 and 0.1 in Spearman correlation between the predicted and the actual measurement vectors, for predicting missing polyadic data and predicting polyadic data for new drugs, respectively.
【Keywords】:
【Paper Link】 【Pages】:81-89
【Authors】: Honglei Liu ; Bian Wu
【Abstract】: Learning functional networks from spike trains is a fundamental problem with many critical applications in neuroscience. However, most of existing works focus on inferring the functional network purely from observational data, which could lead to undiscovered or spurious connections. We demonstrate that by adopting experimental data with interventions applied, the accuracy of the inferred network can be significantly improved. Nevertheless, doing interventions in real experiments is often expensive and must be chosen with care. Hence, in this paper, we design an active learning framework to iteratively choose interventions and learn the functional network. In particular, we propose two models, the variance model and the validation model, to effectively select the most informative interventions. The variance model works best to reveal undiscovered connections while the validation model has the advantage of eliminating spurious connections. Experimental results with both synthetic and real datasets show that when these two models are applied, we could achieve substantially better accuracy than using the same amount of observational data or other baseline methods to choose interventions.
【Keywords】:
【Paper Link】 【Pages】:90-98
【Authors】: Jinghui Chen ; Saket Sathe ; Charu C. Aggarwal ; Deepak S. Turaga
【Abstract】: In this paper, we introduce autoencoder ensembles for unsupervised outlier detection. One problem with neural networks is that they are sensitive to noise and often require large data sets to work robustly, while increasing data size makes them slow. As a result, there are only a few existing works in the literature on the use of neural networks in outlier detection. This paper shows that neural networks can be a very competitive technique to other existing methods. The basic idea is to randomly vary on the connectivity architecture of the autoencoder to obtain significantly better performance. Furthermore, we combine this technique with an adaptive sampling method to make our approach more efficient and effective. Experimental results comparing the proposed approach with state-of-the-art detectors are presented on several benchmark data sets showing the accuracy of our approach.
【Keywords】:
【Paper Link】 【Pages】:99-107
【Authors】: Liang Wu ; Jundong Li ; Xia Hu ; Huan Liu
【Abstract】: The explosive use of social media, in information dissemination and communication, has also made it a popular platform for the spread of rumors. Rumors could be easily propagated and received by a large number of users in social media, resulting in catastrophic effects in the physical world in a very short period. It is a challenging task, if not impossible, to apply classical supervised learning methods to the early detection of rumors, since the labeling process is time-consuming and labor-intensive. Motivated by the fact that abundant label information of historical rumors is publicly available, in this paper, we propose to investigate whether knowledge learned from historical data could potentially help identify newly emerging rumors. In particular, since a disputed factual claim arouses certain reactions such as curiosity, skepticism, and astonishment, we identify and utilize patterns from prior labeled data to help reveal emergent rumors. Experimental results on real-world data sets demonstrate the effectiveness. Further experiments are conducted to show how much earlier it can detect an emerging rumor than traditional approaches.
【Keywords】:
【Paper Link】 【Pages】:108-116
【Authors】: Daniel Y. T. Chino ; Alceu Ferraz Costa ; Agma J. M. Traina ; Christos Faloutsos
【Abstract】: Is it possible to spot review frauds and spamming on social media and online stores? In this paper we analyze the joint distribution of the inter-arrival times and volume of events such as comments and online reviews and show that it is possible to accurately rank and detect suspicious users such as spammers, bots and fraudsters. We propose VolTime, a generative model that fits well the inter-arrival time distribution (IAT) of real users. Thus, VOLTIME automatically spots and ranks suspicious users. Experiments on several real datasets, ranging from Reddit comments and phone calls to Flipkart product reviews, show that VolTime is able to accurately fit the activity volume and IAT of real data. Additionally, we show that VolTime ranks suspicious users with a precision higher than 90% for a sensitivity of 70%.
【Keywords】:
【Paper Link】 【Pages】:117-125
【Authors】: Houping Xiao ; Jing Gao ; Long H. Vu ; Deepak S. Turaga
【Abstract】: The detection of malicious behavior, that is, judging if a host/domain is malicious or benign (i.e., negative or positive labels), is complicated by the issue of imbalanced label distributions, as well as the limited amount of ground truth available to train supervised models or build rules. To tackle these challenges, we propose a novel framework to learn cost-sensitive models on both network hosts and external domains simultaneously, based on a bipartite connectivity graph constructed between them. We also explicitly incorporate behavioral features of the hosts computed from the network data as well as lexical and reputational features computed for the external domains into the proposed framework. Specifically, we model the predicted labels, measure the misclassification errors by the Hamming distance between the predicted and true labels, incorporate different costs for different misclassification types (i.e., false negative or false positive), and constrain connected nodes to share the same labels in high probability. The proposed framework is then formulated as an optimization problem, which minimizes the total cost, that is, the misclassification costs multiplied by the misclassification errors. As the Hamming distance function is non-differentiable, we introduce a continuous loss function to approximate it with performance guaranteed. We develop an effective algorithm with good convergence property via Stochastic Gradient Descent technique. Experimental results on both synthetic and a real network dataset collected from an enterprise demonstrate the effectiveness of the proposed framework.
【Keywords】:
【Paper Link】 【Pages】:126-134
【Authors】: Roel Bertens ; Jilles Vreeken ; Arno Siebes
【Abstract】: Our world is filled with both beautiful and brainy people, but how often does a Nobel Prize winner also wins a beauty pageant? Let us assume that someone who is both very beautiful and very smart is more rare than what we would expect from the combination of the number of beautiful and brainy people. Of course there will still always be some individuals that defy this stereotype; these beautiful brainy people are exactly the class of anomaly we focus on in this paper. They do not posses intrinsically rare qualities, it is the unexpected combination of factors that makes them stand out. In this paper we define the above described class of anomaly and propose a method to quickly identify them in transaction data. Further, as we take a pattern set based approach, our method readily explains why a transaction is anomalous. The effectiveness of our method is thoroughly verified with a wide range of experiments on both real world and synthetic data.
【Keywords】:
【Paper Link】 【Pages】:135-143
【Authors】: Heidar Davoudi ; Morteza Zihayat ; Aijun An
【Abstract】: User acquisition is one of the most challenging problems for online news providers. In fact, due to availability of different news media, users have a lot of choices in selecting the news source. To date, most of digital news portals have tried to approach the solution indirectly by targeting the user satisfaction through the recommendation systems. In contrast, we address the problem directly by identifying valuable visitors who are likely potential subscribers in the future. First, we suggest that the decision for subscription is not a sudden, instantaneous action, but is the informed decision based on positive experience with digital medium. As such, we propose effective engagement measures and show that they are effective in building the predictive model for subscription. We design a model that not only predicts the potential subscribers but also answers queries about the subscription occurrence time. The proposed model can be used to predict the subscription time and recommend accurately the “potential users” to the current marketing campaign. We evaluate the proposed model using a real dataset from The Globe and Mail which is a major newspaper in Canada. The experimental results show that the proposed model outperforms the traditional state-of-the-art approaches significantly.
【Keywords】:
【Paper Link】 【Pages】:144-152
【Authors】: Dhivya Eswaran ; Stephan Günnemann ; Christos Faloutsos
【Abstract】: Given a friendship network, how certain are we that Smith is a progressive (vs. conservative)? How can we propagate these certainties through the network? While Belief propagation marked the beginning of principled label-propagation to classify nodes in a graph, its numerous variants proposed in the literature fail to take into account uncertainty during the propagation process. As we show, this limitation leads to counter-intuitive results for even simple graphs. Motivated by these observations, we formalize axioms that any node classification algorithm should obey and propose NetConf which satisfies these axioms and handles arbitrary network effects (homophily/heterophily) at scale. Our contributions are: (1) Axioms: We state axioms that any node classification algorithm should satisfy; (2) Theory: NetConf is grounded in a Bayesian-theoretic framework to model uncertainties, has a closed-form solution and comes with precise convergence guarantees; (3) Practice: Our method is easy to implement and scales linearly with the number of edges in the graph. On experiments using real world data, we always match or outperform BP while taking less processing time.
【Keywords】:
【Paper Link】 【Pages】:153-161
【Authors】: Yexun Zhang ; Yanfeng Wang ; Wenbin Cai ; Siyuan Zhou ; Ya Zhang
【Abstract】: In many classification tasks, the data distribution is imbalanced and different misclassifications involve different costs. In addition, the data collected are often lack in labels and it is expensive and tedious to label them manually. Motivated by these two problems, we propose a novel active cost-sensitive classification algorithm based on the Expected Error Reduction (EER) framework, aiming to selectively label examples which can directly optimize the expected misclassification costs. However, the native EER (N-EER) framework is inefficient and impractical due to the considerable requirement for model retraining. In this paper, we propose an efficient EER (E-EER) to overcome the inefficiency of N-EER with the application of cost-sensitive classification which is realized by incorporating the cost information into the expected loss calculation. We first present a formal formulation for EER, then the active cost-sensitive classification algorithm is derived. In order to achieve E-EER, we derive an efficient model update rule for logistic regression (LR) and cost-sensitive support vector machines (C-SVM), respectively, to avoid model retraining, which are employed as the base learners. Furthermore, we theoretically analyze the error bound of our algorithm to provide a guarantee for its generalization performance. Extensive experiments demonstrate the effectiveness and efficiency of our method.
【Keywords】:
【Paper Link】 【Pages】:162-170
【Authors】: Michael T. Lash ; Qihang Lin ; W. Nick Street ; Jennifer G. Robinson ; Jeffrey W. Ohlmann
【Abstract】: Inverse classification is the process of perturbing an instance in a meaningful way such that it is more likely to conform to a specific class. Historical methods that address such a problem are often framed to leverage only a single classifier, or specific set of classifiers. These works are often accompanied by naive assumptions. In this work we propose generalized inverse classification (GIC), which avoids restricting the classification model that can be used. We incorporate this formulation into a refined framework in which GIC takes place. Under this framework, GIC operates on features that are immediately actionable. Each change incurs an individual cost, either linear or non-linear. Such changes are subjected to occur within a specified level of cumulative change (budget). Furthermore, our framework incorporates the estimation of features that change as a consequence of direct actions taken (indirectly changeable features). To solve such a problem, we propose three real-valued heuristic-based methods and two sensitivity analysis-based comparison methods, each of which is evaluated on two freely available real-world datasets. Our results demonstrate the validity and benefits of our formulation, framework, and methods.
【Keywords】:
【Paper Link】 【Pages】:171-179
【Authors】: Xiaowei Jia ; Ankush Khandelwal ; Guruprasad Nayak ; James Gerber ; Kimberly Carlson ; Paul C. West ; Vipin Kumar
【Abstract】: Successful land cover prediction can provide promising insights in the applications where manual labeling is extremely difficult. However, traditional machine learning models are plagued by temporal variation and noisy features when directly applied to land cover prediction. Moreover, these models cannot take fully advantage of the spatio-temporal relationship involved in land cover transitions. In this paper, we propose a novel spatio-temporal framework to discover the transitions among land covers and at the same time conduct classification at each time step. Based on the proposed model, we incrementally update the model parameters in the prediction process, thus to mitigate the impact of the temporal variation. Our experiments in two challenging land cover applications demonstrate the superiority of the proposed method over multiple baselines. In addition, we show the efficacy of spatio-temporal transition modeling and incremental learning through extensive analysis.
【Keywords】:
【Paper Link】 【Pages】:180-188
【Authors】: Xinyue Liu ; Xiangnan Kong ; Ann B. Ragin
【Abstract】: The analysis of brain imaging data has attracted much attention recently. A popular analysis is to discover a network representation of brain from the neuroimaging data, where each node denotes a brain region and each edge represents a functional association or structural connection between two brain regions. Motivated by the multi-subject and multi-collection settings in neuroimaging studies, in this paper, we consider brain network discovery under two novel settings: 1) unified setting: Given a collection of subjects, discover a single network that is good for all subjects. 2) contrasting setting: Given two collections of subjects, discover a single network that best discriminates two collections. We show that the existing formulation of graphical Lasso (GLasso) cannot address above problems properly. Two novel models, UGLasso (Unified Graphical Lasso) and CGLasso(Contrasting Graphical Lasso), are proposed to address these two problems respectively. We evaluate our methods on synthetic data and two real-world functional magnetic resonance imaging (fMRI) datasets. Empirical results demonstrate the effectiveness of the proposed methods.
【Keywords】:
【Paper Link】 【Pages】:189-197
【Authors】: Bokai Cao ; Lifang He ; Xiaokai Wei ; Mengqi Xing ; Philip S. Yu ; Heide Klumpp ; Alex D. Leow
【Abstract】: Brain network embedding is the process of converting brain network data to discriminative representations of subjects, so that patients with brain disorders and normal controls can be easily separated. Computer-aided diagnosis based on such representations is potentially transformative for investigating disease mechanisms and for informing therapeutic interventions. However, existing methods either limit themselves to extracting graph-theoretical measures and subgraph patterns, or fail to incorporate brain network properties and domain knowledge in medical science. In this paper, we propose t-BNE, a novel Brain Network Embedding model based on constrained tensor factorization. t-BNE incorporates 1) symmetric property of brain networks, 2) side information guidance to obtain representations consistent with auxiliary measures, 3) orthogonal constraint to make the latent factors distinct with each other, and 4) classifier learning procedure to introduce supervision from labeled data. The Alternating Direction Method of Multipliers (ADMM) framework is utilized to solve the optimization objective. We evaluate t-BNE on three EEG brain network datasets. Experimental results illustrate the superior performance of the proposed model on graph classification tasks with significant improvement 20.51%, 6.38% and 12.85%, respectively. Furthermore, the derived factors are visualized which could be informative for investigating disease mechanisms under different emotion regulation tasks.
【Keywords】:
【Paper Link】 【Pages】:198-206
【Authors】: Chao Che ; Cao Xiao ; Jian Liang ; Bo Jin ; Jiayu Zho ; Fei Wang
【Abstract】: Parkinson's disease (PD) is a chronic disease that develops over years and varies dramatically in its clinical manifestations. A preferred strategy to resolve this heterogeneity and thus enable better prognosis and targeted therapies is to segment out more homogeneous patient sub-populations. However, it is challenging to evaluate the clinical similarities among patients because of the longitudinality and temporality of their records. To address this issue, we propose a deep model that directly learns patient similarity from longitudinal and multi-modal patient records with an Recurrent Neural Network (RNN) architecture, which learns the similarity between two longitudinal patient record sequences through dynamically matching temporal patterns in patient sequences. Evaluations on real world patient records demonstrate the promising utility and efficacy of the proposed architecture in personalized predictions.
【Keywords】:
【Paper Link】 【Pages】:207-215
【Authors】: Yale Chang ; Junxiang Chen ; Michael H. Cho ; Peter J. Castaldi ; Edwin K. Silverman ; Jennifer G. Dy
【Abstract】: Clustering is a challenging problem because given the same data set, it can be grouped in multiple different ways. Which of these clustering solutions is interesting depends on its domain application. Thus, incorporating domain expert input often improves clustering performance. However, most existing semi-supervised clustering techniques can only incorporate instance-level constraints (a few labels or must-link/cannot-link constraints), which domain experts may not be comfortable providing in knowledge discovery problems because categories are not known. Fortunately, domain experts often have an idea regarding properties that clustering solutions should have in order to be useful in domain application based on domain relevant scores. In this paper, we provide a framework for jointly optimizing the usefulness and quality of a clustering solution. Experiments on a synthetic data, a benchmark data, and a real-world disease subtyping problem demonstrate the usefulness of our proposed approach.
【Keywords】:
【Paper Link】 【Pages】:216-227
【Authors】: Xiao Fu ; Kejun Huang ; Otilia Stretcu ; Hyun Ah Song ; Evangelos E. Papalexakis ; Partha P. Talukdar ; Tom M. Mitchell ; Nicholas D. Sidiropoulos ; Christos Faloutsos ; Barnabás Póczos
【Abstract】: How close can we zoom in to observe brain activity? Our understanding is limited by the resolution of imaging modalities that exhibit good spatial but poor temporal resolution, or vice-versa. In this paper, we propose BrainZoom, an efficient imaging algorithm that cross-leverages multi-modal brain signals. BrainZoom (a) constructs high resolution brain images from multi-modal signals, (b) is scalable, and (c) is flexible in that it can easily incorporate various priors on the brain activities, such as sparsity, low rank, or smoothness. We carefully formulate the problem to tackle nonlinearity in the measurements (via variable splitting) and auto-scale between different modal signals, and judiciously design an inexact alternating optimization-based algorithmic framework to handle the problem with provable convergence guarantees. Our experiments using a popular realistic brain signal simulator to generate fMRI and MEG demonstrate that high spatio-temporal resolution brain imaging is possible from these two modalities. The experiments also suggest that smoothness seems to be the best prior, among several we tried.
【Keywords】:
【Paper Link】 【Pages】:228-236
【Authors】: Amit Dhurandhar ; Margareta Ackerman ; Xiang Wang
【Abstract】: Clustering is a widely-used data mining tool, which aims to discover partitions of similar items in data. We introduce a new clustering paradigm, accordant clustering, which enables the discovery of (predefined) group level insights. Unlike previous clustering paradigms that aim to understand relationships amongst the individual members, the goal of accordant clustering is to uncover insights at the group level through the analysis of their members. Group level insight can often support a call to action that cannot be informed through previous clustering techniques. We propose the first accordant clustering algorithm, and prove that it finds near-optimal solutions when data possesses inherent cluster structure. The insights revealed by accordant clusterings enabled experts in the field of medicine to isolate successful treatments for a neurodegenerative disease, and those in finance to discover patterns of unnecessary spending.
【Keywords】:
【Paper Link】 【Pages】:237-245
【Authors】: Anni Coden ; Marina Danilevsky ; Daniel Gruhl ; Linda Kato ; Meena Nagarajan
【Abstract】: Data analysis tasks often require grouping of information to identify trends and associations. However, as the number of elements rises to the hundreds and thousands the cost of having a person perform the groupings unassisted quickly becomes prohibitive. Previous approaches have combined traditional clustering techniques with manual interaction steps, yielding human-in-the-loop clustering algorithms that incorporate user feedback by reweighting features or adjusting a similarity function. But in the real world, many grouping tasks lack both a feature set and a well-defined (dis)similarity metric, having only a subject matter expert with an implicit understanding of the correct relationships between elements based on the domain and the task at hand. We present a refine-and-lock clustering interaction model and demonstrate its effectiveness for cognitive-assisted human clustering over other interaction models such as split/merge and must-link/can't-link. Our approach offers effective automatic clustering assistance even in the absence of clear features or a definitive similarity metric; ensures that every cluster has final user approval; and exhibits at least a 3.94× improvement over other interactive clustering approaches in time to completion.
【Keywords】:
【Paper Link】 【Pages】:246-254
【Authors】: Aghiles Salah ; Mohamed Nadif
【Abstract】: Co-clustering has proven effective to deal with high dimensional sparse data, such as document-term matrices encountered in text mining. Apart from being high dimensional and sparse, the data sets from the aforementioned domain are also directional in nature. Most existing co-clustering approaches are, however, based on popular modelling assumptions, such as Gaussian or Multinomial, which are inadequate for directional data. Moreover, it is well known that, due to high dimensionality and sparsity, co-clustering approaches, like one-sided clustering methods, tend to generate highly skewed solutions with very unbalanced or even empty clusters, especially when the number of required clusters is large. In this paper, we rely on the recently proposed block von Mises-Fisher mixture model (dbmovMFs), which constitutes a general framework for co-clustering directional data distributed on the surface of a unit hypersphere, i.e, L2 normalized data. In order to overcome the above difficulties, we propose to modify dbmovMFs in a principled way by introducing a conscience mechanism which discourages bad local solutions having empty or very small/large clusters. This gives rise to a new scalable co-clustering algorithm which is guaranteed to increase monotonically a spherical k-means like criterion by intertwining row and column clusterings at each step. Moreover, empirical results, on several real-world datasets, provide strong support for the effectiveness of the proposed approach.
【Keywords】:
【Paper Link】 【Pages】:255-263
【Authors】: Natali Ruchansky ; Mark Crovella ; Evimaria Terzi
【Abstract】: Matrix completion is a problem that arises in many data-analysis settings where the input consists of a partially-observed matrix (e.g., recommender systems, traffic matrix analysis etc.). Classical approaches to matrix completion assume that the input partially-observed matrix is low rank. The success of these methods depends on the number of observed entries and the rank of the matrix; the larger the rank, the more entries need to be observed in order to accurately complete the matrix. In this paper, we deal with matrices that are not necessarily low rank themselves, but rather they contain low-rank submatrices. We propose Targeted, which is a general framework for completing such matrices. In this framework, we first extract the low-rank submatrices and then apply a matrix-completion algorithm to these low-rank submatrices as well as the remainder matrix separately. Although for the completion itself we use state-of-the-art completion methods, our results demonstrate that Targeted achieves significantly smaller reconstruction errors than other classical matrix-completion methods. One of the key technical contributions of the paper lies in the identification of the low-rank submatrices from the input partially-observed matrices.
【Keywords】:
【Paper Link】 【Pages】:264-272
【Authors】: Charalampos Mavroforakis ; Dóra Erdös ; Mark Crovella ; Evimaria Terzi
【Abstract】: In many applications, e.g., recommender systems and biological data analysis, the datasets of interest are positive definite (PD) matrices. Such matrices are usually similarity matrices, obtained by the multiplication of a matrix of preferences or observations with its transpose. Oftentimes, such real-world matrices are missing many entries and a fundamental data-analysis task, known by the term PD-matrix completion, is the inference of these missing entries. In this paper, we introduce the active version of PD-matrix completion, in which we assume access to an oracle that, at a given cost, returns the value of an unobserved entry of the PD matrix. In this setting, we consider the following question: “given a fixed budget, which entries should we query so that the completion of the new matrix is much more indicative of the underlying data?”. The main contribution of the paper is the formalization of the above question as the ActivePDCompletion problem and the design of novel and effective algorithms for solving it in practice.
【Keywords】:
【Paper Link】 【Pages】:273-281
【Authors】: Christian Böhm ; Martin Perdacher ; Claudia Plant
【Abstract】: Today's microprocessors consist of multiple cores each of which can perform multiple additions, multiplications, or other operations simultaneously in one clock cycle. To maximize performance, two types of parallelism must be applied in a data mining algorithm: MIMD (Multiple Instruction Multiple Data) where different CPU cores execute different code and follow different threads of control, and SIMD (Single Instruction Multiple Data) where within a core, the same operation is executed at once on various data. It is commonly agreed among data mining practitioners and researchers that dis-proportionally few works consider the performance potential of today's popular micro-architectures. In this paper, we consider the wide-spread clustering algorithm K-means as a highly relevant use-case for knowledge discovery on big data. We propose Multi-core K-Means (MKM), a completely re-engineered clustering algorithm which applies MIMD and SIMD parallelism. MKM uses a sophisticated strategy for the access of data vectors and cluster representatives to minimize data transfer between main memory, cache, and registers. For SIMD parallelism it is also essential to avoid branching operations like if-then: we propose to code cluster IDs and distances in joint variables to perform the argmin operation SIMD-parallel and without any branching. Our experiments demonstrate a speed-up which is almost linear in the number of cores. On a pair of shared-memory quad-core processors, MKM is between 95 and 140 times faster than non-parallel K-means, 4–6 times faster than auto-vectorized fully parallel standard K-means, and 2.1 times faster than K-means based on BLAS.
【Keywords】:
【Paper Link】 【Pages】:282-290
【Authors】: Chang Wei Tan ; Geoffrey I. Webb ; François Petitjean
【Abstract】: Time series classification maps time series to labels. The nearest neighbour algorithm (NN) using the Dynamic Time Warping (DTW) similarity measure is a leading algorithm for this task. NN compares each time series to be classified to every time series in the training database. With a training database of N time series of lengths L, each classification requires ν(N · L2) computations. The databases used in almost all prior research have been relatively small (with less than 10,000 samples) and much of the research has focused on making DTW's complexity linear with L, leading to a runtime complexity of O(N · L). As we demonstrate with an example in remote sensing, real-world time series databases are now reaching the million-to-billion scale. This wealth of training data brings the promise of higher accuracy, but raises a significant challenge because N is becoming the limiting factor. As DTW is not a metric, indexing objects induced by its space is extremely challenging. We tackle this task in this paper. We develop TSI, a novel algorithm for Time Series Indexing which combines a hierarchy of K-means clustering with DTW-based lower-bounding. We show that, on large databases, TSI makes it possible to classify time series orders of magnitude faster than the state of the art.
【Keywords】:
【Paper Link】 【Pages】:291-299
【Authors】: Vishal Kakkar ; Shirish K. Shevade ; S. Sundararajan ; Dinesh Garg
【Abstract】: AUC (Area under the ROC curve) is an important performance measure for applications where the data is highly imbalanced. Efficient AUC optimization is a challenging research problem as the objective function is non-decomposable and non-continuous. Using a max-margin based surrogate loss function, AUC optimization problem can be approximated as a pairwise RankSVM learning problem. Batch learning algorithms for solving the kernelized version of this problem suffer from scalability issues. Therefore, recent years have witnessed an increased interest in the development of online or single-pass algorithms that design a nonlinear classifier by maximizing the AUC performance. However, on many real-world datasets, the AUC performance of these classifiers was observed to be inferior to that of the classifiers designed using batch learning algorithms. Further, many practical imbalanced data classification problems demand fast inference, which underlines the need for designing sparse nonlinear classifiers. Motivated by these observations, we design a scalable algorithm for maximizing the AUC performance by greedily adding the required number of basis functions into the classifier model. The resulting sparse classifier performs faster inference and its AUC performance is comparable with that of the classifier designed using batch mode. Our experimental results show that the level of sparsity achievable can be an order of magnitude larger than that achieved by the Kernel RankSVM model without significantly affecting the AUC performance.
【Keywords】:
【Paper Link】 【Pages】:300-308
【Authors】: Ryan R. Curtin
【Abstract】: k-means is a widely used clustering algorithm, but for k clusters and a dataset size of N, each iteration of Lloyd's algorithm costs O(kN) time. This is problematic because increasingly, applications of k-means involve both large N and large k, and there are no accelerated variants that handle this situation. To this end, we propose a dual-tree algorithm that gives the exact same results as standard k-means; when using cover trees, we bound the single-iteration runtime of the algorithm as O(N + k log k), under some assumptions. To our knowledge these are the first sub-O(kN) bounds for exact Lloyd iterations. The algorithm performs competitively in practice, especially for large N and k in low dimensions. Further, the algorithm is tree-independent, so any type of tree may be used.
【Keywords】:
【Paper Link】 【Pages】:309-317
【Authors】: Wilhelmiina Hämäläinen ; Geoffrey I. Webb
【Abstract】: We present theoretical analysis and a suite of tests and procedures for addressing a broad class of redundant and misleading association rules we call specious rules. Specious dependencies, also known as spurious, apparent, or illusory associations, refer to a well-known phenomenon where marginal dependencies are merely products of interactions with other variables and disappear when conditioned on those variables. The most extreme example is Yule-Simpson's paradox where two variables present positive dependence in the marginal contingency table but negative in all partial tables defined by different levels of a confounding factor. It is accepted wisdom that in data of any nontrivial dimensionality it is infeasible to control for all of the exponentially many possible confounds of this nature. In this paper, we consider the problem of specious dependencies in the context of statistical association rule mining. We define specious rules and show they offer a unifying framework which covers many types of previously proposed redundant or misleading association rules. After theoretical analysis, we introduce practical algorithms for detecting and pruning out specious association rules efficiently under many key goodness measures, including Mutual information and exact hypergeometric probabilities. We demonstrate that the procedure greatly reduces the number of associations discovered, providing an elegant and effective solution to the problem of association mining discovering large numbers of misleading and redundant rules.
【Keywords】:
【Paper Link】 【Pages】:318-326
【Authors】: Yao Zhang ; Bijaya Adhikari ; Steve T. K. Jan ; B. Aditya Prakash
【Abstract】: Given a social network, how to find communities of nodes based on their diffusive characteristics? There exist two important types of nodes, for information propagation: nodes that are influential (“kernel nodes”), and nodes that serve as “bridges” to boost the diffusion (“media nodes”). How to find these nodes and uncover connections between them? In addition, it is also important to discover the hidden community structure of these nodes, which can help study their interactions, predict links and also understand the information flow in such networks. In this paper, we give an intuitive and novel optimization-based formulation for this task, which aims to discover media nodes as well as community structures of kernel nodes. We prove our task is computationally challenging, and develop an effective and practical algorithm MeiKe (pronounced as ‘Mike’). It first obtains media nodes via a new successive summarization based approach, and then finds kernel nodes including their community structures. Experimental results show that MeiKe finds high-quality media and kernel communities which match our expectations and ground-truth (outperforming non-trivial baselines by 40% in F1-score). Our case studies also demonstrate the applicability of MeiKe on a variety of datasets.
【Keywords】:
【Paper Link】 【Pages】:327-335
【Authors】: Suhang Wang ; Jiliang Tang ; Charu C. Aggarwal ; Yi Chang ; Huan Liu
【Abstract】: Network embedding is to learn low-dimensional vector representations for nodes of a given social network, facilitating many tasks in social network analysis such as link prediction. The vast majority of existing embedding algorithms are designed for unsigned social networks or social networks with only positive links. However, networks in social media could have both positive and negative links, and little work exists for signed social networks. From recent findings of signed network analysis, it is evident that negative links have distinct properties and added value besides positive links, which brings about both challenges and opportunities for signed network embedding. In this paper, we propose a deep learning framework SiNE for signed network embedding. The framework optimizes an objective function guided by social theories that provide a fundamental understanding of signed social networks. Experimental results on two real-world datasets of social media demonstrate the effectiveness of the proposed framework SiNE.
【Keywords】:
【Paper Link】 【Pages】:336-344
【Authors】: Esther Galbrun ; Behzad Golshan ; Aristides Gionis ; Evimaria Terzi
【Abstract】: Motivated by applications that arise in online social media and collaboration networks, there has been a lot of work on community-search. In this class of problems, the goal is to find a subgraph that satisfies a certain connectivity requirement and contains a given collection of seed nodes. In this paper, we extend the community-search problem by associating each individual with a profile. The profile is a numeric score that quantifies the position of an individual with respect to a topic. We adopt a model where each individual starts with a latent profile and arrives to a conformed profile through a dynamic conformation process, which takes into account the individual's social interaction and the tendency to conform with one's social environment. In this framework, social tension arises from the differences between the conformed profiles of neighboring individuals as well as from the differences between individuals' conformed and latent profiles. Given a network of individuals, their latent profiles and this conformation process, we extend the community-search problem by requiring the output subgraphs to have low social tension. From the technical point of view, we study the complexity of this problem and propose algorithms for solving it effectively. Our experimental evaluation in a number of social networks reveals the efficacy and efficiency of our methods.
【Keywords】:
【Paper Link】 【Pages】:345-353
【Authors】: Marc Mitri ; Fragkiskos D. Malliaros ; Michalis Vazirgiannis
【Abstract】: Community detection constitutes an important task for investigating the internal structure of networks, with a plethora of applications in a wide range of disciplines. A particularly important point, which is rarely taken into account while developing community detection algorithms, is their sensitivity (or stability) to network uncertainty. In many cases, the input graph data is incomplete or noisy (e.g., due to noise introduced during the collection of the data or for privacy preserving reasons). Then, the following question arises: how stable are the results produced by an algorithm with respect to the uncertainty (i.e., noise level) of the input data? In this paper, we propose a quantitative way to address this problem. We have considered several graph perturbation models to introduce uncertainty to the graph. Then, we examine the sensitivity of an algorithm, with respect to functional and structural characteristics of the detected communities under various perturbation levels. We have studied the performance of some of the most widely used community detection algorithms in practice, and our experimental results indicate that random walk based community detection algorithms tend to be robust under various conditions of network uncertainty.
【Keywords】:
【Paper Link】 【Pages】:354-362
【Authors】: Xuan-Hong Dang ; Hongyuan You ; Ambuj K. Singh ; Scott T. Grafton
【Abstract】: In many real-world applications, data is represented in the form of networks with structures and attributes changing over time. The dynamic changes not only happen at nodes/edges, forming local subnetwork processes, but also eventually influence global states of networks. The need to understand what these local network processes are, how they evolve and consequently govern the progression of global network states has become increasingly important. In this paper, we explore these questions and develop a novel algorithm for mining a succinct set of subnetworks that are predictive and evolve along with the progression of global network states. Our algorithm is designed in the framework of logistic regression that fits a model for multi-states of network samples. Its objective function considers both the spatial network topology and temporal smooth transition between adjacent global network states, and we show that its global optimum solution can be achieved via steepest descent. Extensive experimental analysis on both synthetic and real world datasets demonstrates the effectiveness of our algorithm against competing methods, not only in the prediction accuracy but also in terms of domain relevance of the discovered subnetworks.
【Keywords】:
【Paper Link】 【Pages】:363-371
【Authors】: Hsiang-Fu Yu ; Mikhail Bilenko ; Chih-Jen Lin
【Abstract】: Many recommender systems have only implicit user feedback. The two possible ratings are positive and negative, but only part of positive entries are observed. One-class matrix factorization (MF) is a popular approach for such scenarios by treating some missing entries as negative. Two major ways to select negative entries are by sub-sampling a set with similar size to that of observed positive entries or by including all missing entries as negative. They are referred to as “subsampled” and “full” approaches in this work, respectively. Currently detailed comparisons between these two selection schemes on large-scale data are still lacking. One important reason is that the “full” approach leads to a hard optimization problem after treating all missing entries as negative. In this paper, we successfully develop efficient optimization techniques to solve this challenging problem so that the “full” approach becomes practically viable. We then compare in detail the two approaches “subsampled” and “full” for selecting negative entries. Results show that the “full” approach of including much more missing entries as negative yields better results.
【Keywords】:
【Paper Link】 【Pages】:372-380
【Authors】: Ziyu Lu ; Hui Li ; Nikos Mamoulis ; David W. Cheung
【Abstract】: Location-based social networks such as Foursquare and Plancast have gained increasing popularity. On those sites, users can organize and participate in group activities; hence, recommending venues to a group is of practical importance. In this paper, we study the problem of recommending venues to groups of users and propose a Hierarchical Bayesian Model (HBGG) for this purpose. First, a generative group geographical topic model (GG) which exploits group membership, group mobility regions and group preferences is proposed. And we integrate social structure into one-class collaborative filtering as social-based collaborative filtering (SOCF) to leverage social wisdom. Through the shared latent group features, HBGG connects the group geographical model with SOCF framework for group recommendation. Experimental results on two real datasets show that our methods outperforms the state-of-the-art group recommenders, especially on cold-start user groups.
【Keywords】:
【Paper Link】 【Pages】:381-389
【Authors】: Chuxu Zhang ; Lu Yu ; Yan Wang ; Chirag Shah ; Xiangliang Zhang
【Abstract】: To address the issue of data sparsity and cold-start in recommender system, social information (e.g., user-user trust links) has been introduced to complement rating data for improving the performances of traditional model-based recommendation techniques such as matrix factorization (MF) and Bayesian personalized ranking (BPR). Although effective, the utilization of the explicit user-user relationships extracted directly from such social information has three main limitations. First, it is difficult to obtain explicit and reliable social links. Only a small portion of users indicate explicitly their trusted friends in recommender systems. Second, the “cold-start” users are “cold” not only on rating but also on socializing. There is no significant amount of explicit social information that can be useful for “cold-start” users. Third, an active user can be socially connected with others who have different taste/preference. Direct usage of explicit social links may mislead recommendation. To address these issues, we propose to extract implicit and reliable social information from user feedbacks and identify top-k semantic friends for each user. We incorporate the top-k semantic friends information into MF and BPR frameworks to solve the problems of ratings prediction and items ranking, respectively. The experimental results on three real-world datasets show that our proposed approaches achieve better results than the state-of-the-art MF with explicit social links (with 3.0% improvement on RMSE), and social BPR (with 9.1% improvement on AUC).
【Keywords】:
【Paper Link】 【Pages】:390-398
【Authors】: Daniel Basaran ; Eirini Ntoutsi ; Arthur Zimek
【Abstract】: A collection of datasets crawled from Amazon, “Amazon reviews”, is popular in the evaluation of recommendation systems. These datasets, however, contain redundancies (duplicated recommendations for variants of certain items). These redundancies went unnoticed in earlier use of these datasets and thus incurred to a certain extent wrong conclusions in the evaluation of algorithms tested on these datasets. We analyze the nature and amount of these redundancies and their impact on the evaluation of recommendation methods. While the general and obvious conclusion is that redundancies should be avoided and datasets should be carefully preprocessed, we observe more specifically that their impact depends on the complexity of the methods. With this work, we also want to raise the awareness of the importance of data quality, model understanding, and appropriate evaluation.
【Keywords】:
【Paper Link】 【Pages】:399-407
【Authors】: Yang Li ; Suhang Wang ; Tao Yang ; Quan Pan ; Jiliang Tang
【Abstract】: Vacation rental websites such as Airbnb have become increasingly popular where rentals are typically short-term and travels or vacations related. Reasonable rental prices play a crucial role in improving user experiences and engagements in these websites. However, the unique properties of their rentals challenge traditional house rentals that are often long-term and study or work related. Therefore, in this paper we investigate the novel problem of price recommendation in vacation rental websites. We identify some important factors that affect the rental prices and propose a framework that consists of Multi-Scale Affinity Propagation (MSAP) to cluster houses, Nash Equilibrium filter to remove unreasonable price and Linear Regression model with Normal Noise (LRNN) to predict the reasonable prices. Experimental results demonstrate the effectiveness of the proposed framework. We conduct further experiments to understand the important factors in rental price recommendation.
【Keywords】:
【Paper Link】 【Pages】:408-416
【Authors】: Abhinav Mishra ; Leman Akoglu
【Abstract】: Entity ranking by importance or authority through relational information is an important problem in network science. A large body of existing work addresses the problem for homogeneous networks. With the emergence of richer networks, containing various types of entities and meta-data (e.g., attributes) in which edges carry rich semantic information, it becomes essential to build models that can leverage all available data in a meaningful way. In this work, we consider the ranking problem in heterogeneous information networks (HIN) with side information. Specifically, we introduce a new model called HINside that has two key properties: (i) it explicitly represents the interactions (i.e., authority transfer rates or ATR) between different types of nodes, and (ii) it carefully incorporates the geo-location information of the entities to account for the distance and the competition between them. Besides an intuitive local formula, our model has a matrix form for which we derive a closed-form solution. Thanks to its closed form, HINSIDE lends itself to be used within various learning-to-rank objectives, for the estimation of its parameters (the ATR) provided training data. We formulate two kinds of objective functions for parameter learning with efficient estimation procedures. We validate the effectiveness of our proposed model and the learning procedures on samples from two real-world graphs, where we show the advantages of HINside over popular existing models, including Pagerank and degree centrality.
【Keywords】:
【Paper Link】 【Pages】:417-425
【Authors】: Bijaya Adhikari ; Yao Zhang ; Aditya Bharadwaj ; B. Aditya Prakash
【Abstract】: Modern networks are very large in size and also evolve with time. As their size grows, the complexity of performing network analysis grows as well. Getting a smaller representation of a temporal network with similar properties will help in various data mining tasks. In this paper, we study the novel problem of getting a smaller diffusion-equivalent representation of a set of time-evolving networks. We first formulate a well-founded and general temporal-network condensation problem based on the so-called system-matrix of the network. We then propose NetCon-dense, a scalable and effective algorithm which solves this problem using careful transformations in sub-quadratic running time, and linear space complexities. Our extensive experiments show that we can reduce the size of large real temporal networks (from multiple domains such as social, co-authorship and email) significantly without much loss of information. We also show the wide-applicability of Net-Condense by leveraging it for several tasks: for example, we use it to understand, explore and visualize the original datasets and to also speed-up algorithms for the influence-maximization problem on temporal networks.
【Keywords】:
【Paper Link】 【Pages】:426-434
【Authors】: Aristides Gionis ; Polina Rozenshtein ; Nikolaj Tatti ; Evimaria Terzi
【Abstract】: Network sparsification aims to reduce the number of edges of a network while maintaining its structural properties; such properties include shortest paths, cuts, spectral measures, or network modularity. Sparsification has multiple applications, such as, speeding up graph-mining algorithms, graph visualization, as well as identifying the important network edges. In this paper we consider a novel formulation of the network-sparsification problem. In addition to the network, we also consider as input a set of communities. The goal is to sparsify the network so as to preserve the network structure with respect to the given communities. We introduce two variants of the community-aware sparsification problem, leading to sparsifiers that satisfy different connectedness community properties. From the technical point of view, we prove hardness results and devise effective approximation algorithms. Our experimental results on a large collection of datasets demonstrate the effectiveness of our algorithms.
【Keywords】:
【Paper Link】 【Pages】:435-443
【Authors】: Leto Peel
【Abstract】: We address the problem of semi-supervised learning in relational networks, networks in which nodes are entities and links are the relationships or interactions between them. Typically this problem is confounded with the problem of graph-based semi-supervised learning (GSSL), because both problems represent the data as a graph and predict the missing class labels of nodes. However, not all graphs are created equally. In GSSL a graph is constructed, often from independent data, based on similarity. As such, edges tend to connect instances with the same class label. Relational networks, however, can be more heterogeneous and edges do not always indicate similarity. For instance, instead of links tending to connect nodes with the same class label, they may tend to connect nodes with different class labels (link-heterogeneity). Or having the same class label might not imply the same type of connectivity across the whole network (class-heterogeneity), e.g. in a network of sexual interactions we may observe links between opposite genders in some parts of the graph and links between the same genders in others. Performing classification in networks with different types of heterogeneity is a hard problem that is made harder still by the fact we do not know a-priori the type or level of heterogeneity. In this work we present two scalable approaches for graph-based semi-supervised learning for the more general case of relational networks. We demonstrate these approaches on synthetic and real-world networks that display different link patterns within and between classes. Compared to state-of-the-art baseline approaches, ours give better classification performance and do so without prior knowledge of how classes interact. In particular, our two-step label propagation algorithm gives consistently good accuracy and precision, while also being highly efficient and can perform classification in networks of over 1.6 million nodes and 30 million edges in around 12 seconds.
【Keywords】:
【Paper Link】 【Pages】:444-452
【Authors】: Jundong Li ; Liang Wu ; Osmar R. Zaïane ; Huan Liu
【Abstract】: Relational learning exploits relationships among instances manifested in a network to improve the predictive performance of many network mining tasks. Due to its empirical success, it has been widely applied in myriad domains. In many cases, individuals in a network are highly idiosyncratic. They not only connect to each other with a composite of factors but also are often described by some content information of high dimensionality specific to each individual. For example in social media, as user interests are quite diverse and personal; posts by different users could differ significantly. Moreover, social content of users is often of high dimensionality which may negatively degrade the learning performance. Therefore, it would be more appealing to tailor the prediction for each individual while alleviating the issue related to the curse of dimensionality. In this paper, we study a novel problem of Personalized Relational Learning and propose a principled framework PRL to personalize the prediction for each individual in a network. Specifically, we perform personalized feature selection and employ a small subset of discriminative features customized for each individual and some common features shared by all to build a predictive model. On this account, the proposed personalized model is more human interpretable. Experiments on real-world datasets show the superiority of the proposed PRL framework over traditional relational learning methods.
【Keywords】:
【Paper Link】 【Pages】:453-461
【Authors】: Daniel Rugeles ; Kaiqi Zhao ; Gao Cong ; Manoranjan Dash ; Shonali Krishnaswamy
【Abstract】: Biclustering is a data mining technique that allows simultaneous clustering of two variables. A common biclustering task for categorical variables is to find ‘heavy’ biclusters, i.e., biclusters with high co-occurrence values. Although algorithms have been proposed to extract heavy biclusters, they provide little information about relative importance of each bicluster, as well as importance of the variables for each bicluster. To address these problems, there have been attempts to apply mixture models using information theory or Bayesian method. Although they are able to rank the biclusters and the variables for each bicluster, they do not target at extracting heavy biclusters. Furthermore, these models constrain the search for biclusters in such a way that every cell in the matrix must participate in some bicluster. We attempt to alleviate these limitations using dual topic models. First of all, we develop a generalized LDA topic model that extracts dual topics, i.e., topics in opposite directions – row- and column-topics. To obtain better topics, it applies mutual reinforcement, i.e., considering column-topics while constructing row-topics, and vice versa. Heavy biclusters, the high co-occurred relationship, are extracted using thresholds. We show that our model Dual Topic to Biclusters (DT2B) is effective in extracting heavy biclusters by experimenting over a simulated data, a text corpus (NIPS author-document) and a microarray gene expression data. Results show that biclusters extracted by DT2B are better.
【Keywords】:
【Paper Link】 【Pages】:462-470
【Authors】: Xenophon Evangelopoulos ; Austin J. Brockmeier ; Tingting Mu ; John Yannis Goulermas
【Abstract】: In this work we propose a highly scalable algorithm for solving the combinatorial data analysis problem of seriation. Seriation is a technique for optimizing a permutation of data instances, with respect to some proximity measure such that nearby instances in the linear arrangement are more similar. One consistent objective function for seriation is the 2-SUM minimization problem, which uses the 2-norm between instance locations to penalize non-zero similarity values, and can be written as a quadratic function of the permutation vector. Recently, two convex relaxations of the 2-SUM problem have been proposed, which can be solved as constrained quadratic programs using interior point methods; however, the interior point solvers become expensive when the problem size increases. In this paper we present a graduated non-convexity method for vector-based relaxations of the 2-SUM that yields better approximate solutions and scales to very large problem sizes. We conduct a number of experiments on real and synthetic datasets. The experimental results demonstrate that our proposed algorithm outperforms other approaches that solve the 2-SUM, and is the only competitive approach that can scale to large problem sizes.
【Keywords】:
【Paper Link】 【Pages】:471-479
【Authors】: Arun Reddy Nelakurthi ; Hanghang Tong ; Ross Maciejewski ; Nadya Bliss ; Jingrui He
【Abstract】: Sentiment analysis has been studied for decades, and it is widely used in many real applications such as media monitoring. In sentiment analysis, when addressing the problem of limited labeled data from the target domain, transfer learning, or domain adaptation, has been successfully applied, which borrows information from a relevant source domain with abundant labeled data to improve the prediction performance in the target domain. The key to transfer learning is how to model the relatedness among different domains. For sentiment analysis, a common practice is to assume similar sentiment polarity for the common keywords shared by different domains. However, existing methods largely overlooked the human factor, i.e., the users who expressed such sentiment. In this paper, we address this problem by explicitly modeling the human factor related to sentiment classification. In particular, we assume that the content generated by the same user across different domains is biased in the same way in terms of the sentiment polarity. In other words, optimistic/pessimistic users demonstrate consistent sentiment patterns, no matter what the context is. To this end, we propose a new graph-based approach named U-Cross, which models the relatedness of different domains via both the shared users and keywords. It is non-parametric and semi-supervised in nature. Furthermore, we also study the problem of shared user selection to prevent ‘negative transfer’. In the experiments, we demonstrate the effectiveness of U-Cross by comparing it with existing state-of-the-art techniques on three real data sets.
【Keywords】:
【Paper Link】 【Pages】:480-488
【Authors】: Subhabrata Mukherjee ; Kashyap Popat ; Gerhard Weikum
【Abstract】: Online reviews provided by consumers are a valuable asset for e-Commerce platforms, influencing potential consumers in making purchasing decisions. However, these reviews are of varying quality, with the useful ones buried deep within a heap of non-informative reviews. In this work, we attempt to automatically identify review quality in terms of its helpfulness to the end consumers. In contrast to previous works in this domain exploiting a variety of syntactic and community-level features, we delve deep into the semantics of reviews as to what makes them useful, providing interpretable explanation for the same. We identify a set of consistency and semantic factors, all from the text, ratings, and timestamps of user-generated reviews, making our approach generalizable across all communities and domains. We explore review semantics in terms of several latent factors like the expertise of its author, his judgment about the fine-grained facets of the underlying product, and his writing style. These are cast into a Hidden Markov Model – Latent Dirichlet Allocation (HMM-LDA) based model to jointly infer: (i) reviewer expertise, (ii) item facets, and (iii) review helpfulness. Large-scale experiments on five real-world datasets from Amazon show significant improvement over state-of-the-art baselines in predicting and ranking useful reviews.
【Keywords】:
【Paper Link】 【Pages】:489-497
【Authors】: Ramakrishnan Kannan ; Hyenkyun Woo ; Charu C. Aggarwal ; Haesun Park
【Abstract】: The problem of outlier detection is extremely challenging in many domains such as text, in which the attribute values are typically non-negative, and most values are zero. In such cases, it often becomes difficult to separate the outliers from the natural variations in the patterns in the underlying data. In this paper, we present a matrix factorization method, which is naturally able to distinguish the anomalies with the use of low rank approximations of the underlying data. Our iterative algorithm TONMF is based on Block Coordinate Descent (BCD) framework. Our approach has significant advantages over traditional methods for text outlier detection. Finally, we present experimental results illustrating the effectiveness of our method over competing methods.
【Keywords】:
【Paper Link】 【Pages】:498-506
【Authors】: Tyler M. Tomita ; Mauro Maggioni ; Joshua T. Vogelstein
【Abstract】: Random Forest (RF) remains one of the most widely used general purpose classification methods. Two recent large-scale empirical studies demonstrated it to be the best overall classification method among a variety of methods evaluated. One of its main limitations, however, is that it is restricted to only axis-aligned recursive partitions of the feature space. Consequently, RF is particularly sensitive to the orientation of the data. Several studies have proposed “oblique” decision forest methods to address this limitation. However, these methods either have a time and space complexity significantly greater than RF, are sensitive to unit and scale, or empirically do not perform as well as RF on real data. One promising oblique method that was proposed alongside the canonical RF method, called Forest-RC (F-RC), has not received as much attention by the community. Despite it being just as old as RF, virtually no studies exist investigating its theoretical or empirical performance. In this work, we demonstrate that F-RC empirically outperforms RF and another recently proposed oblique method called Random Rotation Random Forest, while approximately maintaining the same computational complexity. Furthermore, a variant of F-RC which rank transforms the data prior to learning is especially invariant to affine transformations and robust to data corruption. Open source code is available.
【Keywords】:
【Paper Link】 【Pages】:507-515
【Authors】: Suhang Wang ; Yilin Wang ; Jiliang Tang ; Charu C. Aggarwal ; Suhas Ranganath ; Huan Liu
【Abstract】: Feature selection has been proven to be effective and efficient in preparing high-dimensional data for many mining and learning tasks. Features of real-world high-dimensional data such as words of documents, pixels of images and genes of microarray data, usually present inherent hierarchical structures. In a hierarchical structure, features could share certain properties. Such information has been exploited to help supervised feature selection but it is rarely investigated for unsupervised feature selection, which is challenging due to the lack of labels. Since real world data is often unlabeled, it is of practical importance to study the problem of feature selection with hierarchical structures in an unsupervised setting. In particular, we provide a principled method to exploit hierarchical structures of features and propose a novel framework HUFS, which utilizes the given hierarchical structures to help select features without labels. Experimental study on real-world datasets is conducted to assess the effectiveness of the proposed framework.
【Keywords】:
【Paper Link】 【Pages】:516-524
【Authors】: Lin Chen ; Jiliang Tang ; Baoxin Li
【Abstract】: Supervised multi-class learning arises in many application domains such as biology, computer vision, social network analysis, and information retrieval. These applications often involve high-dimensional data, which not only significantly increase the time and space requirement of the underlying algorithms but also degrade their performance due to the curse of dimensionality. Feature selection has been proven effective and efficient for preparing high-dimensional data for many learning tasks. Traditional feature selection algorithms for multi-class data assume the independence of label categories and select features with the capability to distinguish samples from different classes. However, class labels in multi-class data may be correlated and little work exists for exploiting label correlation in multi-class feature selection. In this paper, we investigate label correlation in feature selection for multi-class data. In particular, we provide a principled approach for capturing label correlation and propose an Embedded Supervised Feature Selection (ESFS) framework, which embeds label correlation modeling in supervised feature selection for multi-class data. Experiments on both synthetic data and various types of public benchmark datasets show that the proposed framework effectively captures the multi-class label correlation and significantly outperforms existing state-of-the-art baseline methods.
【Keywords】:
【Paper Link】 【Pages】:525-533
【Authors】: Kailash Budhathoki ; Jilles Vreeken
【Abstract】: Discovering correlated variables is one of the core problems in data analysis. Many measures for correlation have been proposed, yet it is surprisingly ill-defined in general. That is, most, if not all, measures make very strong assumptions on the data distribution or type of dependency they can detect. In this work, we provide a general theory on correlation, without making any such assumptions. Simply put, we propose correlation by compression. To this end, we propose two correlation measures based on solid information theoretic foundations, i.e. Kolmogorov complexity. The proposed correlation measures possess interesting properties desirable for any sensible correlation measure. However, Kolmogorov complexity is not computable, and hence we propose practical and computable instantiations based on the Minimum Description Length (MDL) principle. In practice, we can apply the proposed measures on any type of data by instantiating them with any lossless real-world compressors that reward pairwise dependencies. Extensive experiments show that the correlation measures works well in practice, have high statistical power, and find meaningful correlations on binary data, while they are easily extendible to other data types.
【Keywords】:
【Paper Link】 【Pages】:534-542
【Authors】: Rajiv Khanna ; Joydeep Ghosh ; Russell A. Poldrack ; Oluwasanmi Koyejo
【Abstract】: Modern treatments of structured Principal Component Analysis often focus on the estimation of a single component under various assumptions or priors, such as sparsity and smoothness, and then the procedure is extended to multiple components by sequential estimation interleaved with deflation. While prior work has highlighted the importance of proper deflation for ensuring the quality of the estimated components, to our knowledge, proposed techniques have only been developed and applied to non-probabilistic principal component analyses, and are not trivially extended to probabilistic analyses. This work introduces a novel, robust and efficient deflation method for Probabilistic Principal Component Analysis using tools recently developed for constrained probabilistic estimation via information projection. The components estimated using the proposed deflation regain some of the interpretability of classic PCA such as straightforward estimates of variance explained, while retaining the ability to incorporate rich prior structure. Moreover, sequential estimation allows for scaling probabilistic techniques to be at par with their deterministic counterparts. Experimental results on simulated data demonstrate the utility of the proposed deflation in terms of component recovery, and evaluation on neuroimaging data show both qualitative and quantitative improvements in the quality of the estimated components. We also present timing experiments on real data to illustrate the importance of sequential estimation with proper deflation for scalability.
【Keywords】:
【Paper Link】 【Pages】:543-551
【Authors】: John Boaz Lee ; Xiangnan Kong ; Yihan Bao ; Constance M. Moore
【Abstract】: The analysis of multiple time series data, which are generated from a networked system, has attracted much attention recently. This technique has been used in a wide range of applications including functional brain network analysis of neuroimaging data and social influence analysis. In functional brain network analysis, the activity of different brain regions can be represented as multiple time series. An important task in the analysis is to identify the latent network from the observed time series data. In this network, the edges (functional connectivity) capture the correlation between different time series (brain regions). Conventional network extraction approaches usually focus on capturing the connectivity through linear measures under unsupervised settings. In this paper, we study the problem of identifying deep nonlinear connections under group-contrasting settings, where we have two groups of time series samples, and the goal is to identify nonlinear connections that are discriminative across the two groups. We propose a method called GCC (Graph Construction CNN) which is based on deep convolutional neural networks for the task of network construction. The CNN in our model learns a nonlinear edge-weighting function to assign discriminative values to the edges of a network. Experiments on a real-world ADHD dataset show that our proposed method can effectively identify the nonlinear connections among different brain regions. We also demonstrate the extensibility of our proposed framework by combining it with an autoencoder to capture subgraph patterns from the constructed networks.
【Keywords】:
【Paper Link】 【Pages】:552-560
【Authors】: Sara Morsy ; George Karypis
【Abstract】: Grade prediction for courses not yet taken by students is important so as to guide them while registering for next-term courses. Moreover, it can help their advisers for designing personalized degree plans and modifying them based on the students' performance. In this paper, we present cumulative knowledge-based regression models with different course-knowledge spaces for the task of next-term grade prediction. These models utilize historical student-course grade data as well as the information available about the courses that capture the relationships between courses in terms of the knowledge components provided by them. Our experiments on a large dataset obtained from the College of Science and Engineering at University of Minnesota show that our proposed methods achieve better performance than competing methods and that these performance gains are statistically significant.
【Keywords】:
【Paper Link】 【Pages】:561-569
【Authors】: Miguel Araujo ; Miguel Almeida ; Jaime Ferreira ; Luís Moura Silva ; Pedro Bizarro
【Abstract】: Bank transaction fraud results in over $13B annual losses for banks, merchants, and card holders worldwide. Much of this fraud starts with a Point-of-Compromise (a data breach or a ”skimming” operation) where credit and debit card digital information is stolen, resold, and later used to perform fraud. We introduce this problem and present an automatic Points-of-Compromise (POC) detection procedure. BreachRadar is a distributed alternating algorithm that assigns a probability of being compromised to the different possible locations. We implement this method using Apache Spark and show its linear scalability in the number of machines and transactions. BreachRadar is applied to two datasets with billions of real transaction records and fraud labels where we provide multiple examples of real Points-of-Compromise we are able to detect. We further show the effectiveness of our method when injecting Points-of-Compromise in one of these datasets, simultaneously achieving over 90% precision and recall when only 10% of the cards have been victims of fraud.
【Keywords】:
【Paper Link】 【Pages】:570-578
【Authors】: Si Zhang ; Dawei Zhou ; Mehmet Yigit Yildirim ; Scott Alcorn ; Jingrui He ; Hasan Davulcu ; Hanghang Tong
【Abstract】: Dense subgraphs are fundamental patterns in graphs, and dense subgraph detection is often the key step of numerous graph mining applications. Most of the existing methods aim to find a single subgraph with a high density. However, dense subgraphs at different granularities could reveal more intriguing patterns in the underlying graph. In this paper, we propose to hierarchically detect dense subgraphs. The key idea of our method (HiDDen) is to envision the density of subgraphs as a relative measure to its background (i.e., the subgraph at the coarse granularity). Given that the hierarchical dense subgraph detection problem is essentially a nonconvex quadratic programming problem, we propose effective and efficient alternative projected gradient based algorithms to solve it. The experimental evaluations on real graphs demonstrate that (1) our proposed algorithms find subgraphs with an up to 40% higher density in almost every hierarchy; (2) the densities of different hierarchies exhibit a desirable variety across different granularities; (3) our projected gradient descent based algorithm scales linearly w.r.t the number of edges of the input graph; and (4) our methods are able to reveal interesting patterns in the underlying graphs (e.g., synthetic ID in financial fraud detection).
【Keywords】:
【Paper Link】 【Pages】:579-587
【Authors】: Yao Zhou ; Lei Ying ; Jingrui He
【Abstract】: Nowadays, crowdsourcing has been commonly used to enlist label information both effectively and efficiently. One major challenge in crowdsourcing is the diverse worker quality, which determines the accuracy of the label information provided by such workers. Motivated by the observation that in many crowdsourcing platforms, the same set of workers typically work on the same set of tasks, we propose to model the diverse worker quality by studying their behaviors across multiple related tasks. To this end, we propose an optimization framework named MultiC2 for learning from task and worker dual heterogeneity. It uses a weight tensor to represent the workers' behaviors across multiple tasks, and seeks to find the optimal solution of the tensor by exploiting its structured information. We then propose an iterative algorithm to solve the optimization framework and analyze its computational complexity. To infer the true label of an example, we construct a worker ensemble based on the estimated tensor, whose decisions will be weighted using a set of entropy weight. Finally, we test the performance of MultiC2 on various data sets, and demonstrate its superiority over state-of-the-art crowdsourcing techniques.
【Keywords】:
【Paper Link】 【Pages】:588-596
【Authors】: Yan Zhao ; Xiao Fang ; David Simchi-Levi
【Abstract】: Randomized experiments have been used to assist decision- making in many areas. They help people select the optimal treatment for the test population with certain statistical guarantee. However, subjects can show significant hetero-geneity in response to treatments. The problem of customizing treatment assignment based on subject characteristics is known as uplift modeling, differential response analysis, or personalized treatment learning in literature. A key feature for uplift modeling is that the data is unlabeled. It is impossible to know whether the chosen treatment is optimal for an individual subject because response under alternative treatments is unobserved. This presents a challenge to both the training and the evaluation of uplift models. In this paper we describe how to obtain an unbiased estimate of the key performance metric of an uplift model, the expected response. We present a new uplift algorithm which creates a forest of randomized trees. The trees are built with a splitting criterion designed to directly optimize their uplift performance based on the proposed evaluation method. Both the evaluation method and the algorithm apply to arbitrary number of treatments and general response types. Experimental results on synthetic data and industry-provided data show that our algorithm leads to significant performance improvement over other applicable methods.
【Keywords】:
【Paper Link】 【Pages】:597-605
【Authors】: Robert Pienta ; Minsuk Kahng ; Zhiyuan Lin ; Jilles Vreeken ; Partha P. Talukdar ; James Abello ; Ganesh Parameswaran ; Duen Horng Chau
【Abstract】: Visualization is a powerful paradigm for exploratory data analysis. Visualizing large graphs, however, often results in excessive edges crossings and overlapping nodes. We propose a new scalable approach called Facets that helps users adaptively explore large million-node graphs from a local perspective, guiding them to focus on nodes and neighborhoods that are most subjectively interesting to users. We contribute novel ideas to measure this interestingness in terms of how surprising a neighborhood is given the background distribution, as well as how well it matches what the user has chosen to explore. Facets uses Jensen-Shannon divergence over information-theoretically optimized histograms to calculate the subjective user interest and surprise scores. Participants in a user study found Facets easy to use, easy to learn, and exciting to use. Empirical runtime analyses demonstrated Facets's practical scalability on large real-world graphs with up to 5 million edges, returning results in fewer than 1.5 seconds.
【Keywords】:
【Paper Link】 【Pages】:606-614
【Authors】: Fang Jin ; Feng Chen ; Rupinder Paul Khandpur ; Chang-Tien Lu ; Naren Ramakrishnan
【Abstract】: Event detection in online social media has primarily focused on identifying abnormal spikes, or bursts, in activity. However, disruptive events such as socio-economic disasters, civil unrest, and even power outages, often involve abnormal troughs or lack of activity, leading to absenteeism. We present the first study, to our knowledge, that models absenteeism and uses detected absenteeism instances as a basis for event detection in location-based social networks such as Twitter. The proposed framework addresses the challenges of (i) early detection of absenteeism, (ii) identifying the locus of the absenteeism, and (iii) identifying groups or communities underlying the absenteeism. Our approach uses the formalism of graph wavelets to represent the spatiotemporal structure of user activity in a location-based social network. This formalism facilitates multiscale analysis, enabling us to detect anomalous behavior at different graph resolutions, which in turn allows the identification of event locations and underlying groups. The effectiveness of our approach is evaluated using Twitter activity related to civil unrest events in Latin America.
【Keywords】:
【Paper Link】 【Pages】:615-623
【Authors】: Huda Nassar ; David F. Gleich
【Abstract】: A multimodal network encodes relationships between the same set of nodes in multiple settings, and network alignment is a powerful tool for transferring information and insight between a pair of networks. We propose a method for multimodal network alignment that computes a matrix which indicates the alignment, but produces the result as a low-rank factorization directly. We then propose new methods to compute approximate maximum weight matchings of low-rank matrices to produce an alignment. We evaluate our approach by applying it on synthetic networks and use it to de-anonymize a multimodal transportation network.
【Keywords】:
【Paper Link】 【Pages】:624-632
【Authors】: Jose Cadena ; Feng Chen ; Anil Vullikanti
【Abstract】: Scan statistics is a popular approach used for detecting “hotspots” and “anomalies” in spatio-temporal and network data. This methodology involves maximizing a score function over all connected subgraphs, which is NP-hard in general. A number of heuristics have been proposed for these problems, but they do not provide any quality guarantees. In this paper, we develop a framework for designing algorithms for optimizing a large class of scan statistics for networks, subject to connectivity constraints. Our algorithms run in time that scales linearly on the size of the graph and depends on a parameter we call the “effective solution size”, while providing rigorous approximation guarantees. In contrast, most prior methods have super-linear running times in terms of graph size. Extensive empirical evidence demonstrates the effectiveness and efficiency of our proposed algorithms in comparison with state-of-the-art methods. Our approach improves on the performance relative to all prior methods, giving up to over 25% increase in the score. Further, our algorithms scale to networks with up to a million nodes, which is 1–2 orders of magnitude larger than all prior applications.
【Keywords】:
【Paper Link】 【Pages】:633-641
【Authors】: Xiao Huang ; Jundong Li ; Xia Hu
【Abstract】: Network embedding is to learn low-dimensional vector representations for nodes in a network. It has shown to be effective in a variety of tasks such as node classification and link prediction. While embedding algorithms on pure networks have been intensively studied, in many real-world applications, nodes are often accompanied with a rich set of attributes or features, aka attributed networks. It has been observed that network topological structure and node attributes are often strongly correlated with each other. Thus modeling and incorporating node attribute proximity into network embedding could be potentially helpful, though non-trivial, in learning better vector representations. Meanwhile, real-world networks often contain a large number of nodes and features, which put demands on the scalability of embedding algorithms. To bridge the gap, in this paper, we propose an accelerated attributed network embedding algorithm AANE, which enables the joint learning process to be done in a distributed manner by decomposing the complex modeling and optimization into many sub-problems. Experimental results on several real-world datasets demonstrate the effectiveness and efficiency of the proposed algorithm.
【Keywords】:
【Paper Link】 【Pages】:642-650
【Authors】: Yao Zhang ; Yun Xiong ; Xinyue Liu ; Xiangnan Kong ; Yangyong Zhu
【Abstract】: Sparse inverse covariance estimation has attracted lots of interests since it can recover the structure of the underlying Gaussian graphical model. This is a useful tool to demonstrate the connections among objects (nodes). Previous works on sparse inverse covariance estimation mainly focus on learning one single type of connections from the observed activities with a lasso, group lasso or tree-structure penalty. However, in many real-world applications, the observed activities on the nodes can be related to multiple types of connections. In this paper, we consider the problem of learning heterogeneous connectivities from the observed activities by incorporating meta paths extracted from a heterogeneous information network (HIN), an information network with multiple types of nodes and links, into the conventional graphical lasso framework. We aim at extracting the strongest type of relation between any pairs of entities and ignoring other minor relations. Specially, we introduce two novel kinds of constraints: meta path constraints and exclusive constraints, which ensure the unique type of relation among a pair of objects. This problem is highly challenging due to the non-convex optimization. We proposed a method based upon the alternating direction method of multipliers (ADMM) to efficiently solve the problem. The conducted experiments on both synthetic and real-world datasets illustrate the effectiveness of the proposed method.
【Keywords】:
【Paper Link】 【Pages】:651-659
【Authors】: Koray Mancuhan ; Chris Clifton
【Abstract】: Corporations are retaining ever-larger corpuses of personal data; the frequency or breaches and corresponding privacy impact have been rising accordingly. One way to mitigate this risk is through use of anonymized data, limiting the exposure of individual data to only where it is absolutely needed. This would seem particularly appropriate for data mining, where the goal is generalizable knowledge rather than data on specific individuals. In practice, corporate data miners often insist on original data, for fear that they might ”miss something” with anonymized or differentially private approaches. This paper provides a theoretical justification for the use of anonymized data. Specifically, we show that a support vector classifier trained on anatomized data satisfying ℓ-diversity should be expected to do as well as on the original data. Anatomy preserves all data values, but introduces uncertainty in the mapping between identifying and sensitive values, thus satisfying ℓ-diversity. The theoretical effectiveness of the proposed approach is validated using several publicly available datasets, showing that we outperform the state of the art for support vector classification using training data protected by k-anonymity, and are comparable to learning on the original data.
【Keywords】:
【Paper Link】 【Pages】:660-668
【Authors】: Reinhard Heckel ; Michail Vlachos
【Abstract】: The ease of digital data dissemination has spurred an amplified interest in technologies related to data privacy and right protection. We examine how both goals can be achieved simultaneously by constructing modified data instances that are both differentially private and right protected. The proposed method first produces a sketch of the dataset via random projection and then perturbs the sketch just enough to ensure privacy. The right-protection mechanism inserts small noise in the dataset which subsequently can be used to verify ownership. We provide analytical privacy, right-protection, and utility guarantees. Our utility guarantees ensure approximate preservation of pairwise distances, thus mining operations such as search, classification, and clustering can be performed on the differentially private and right protected dataset.
【Keywords】:
【Paper Link】 【Pages】:669-677
【Authors】: Michael Hay ; Liudmila Elagina ; Gerome Miklau
【Abstract】: Given a collection of rankings of a set of items, rank aggregation seeks to compute a ranking that can serve as a single best representative of the collection. Rank aggregation is a well-studied problem and a number of effective algorithmic solutions have been proposed in the literature. However, when individuals are asked to contribute a ranking, they may be concerned that their personal preferences will be disclosed inappropriately to others. This acts as a disincentive to individuals to respond honestly in expressing their preferences and impedes data collection and data sharing. We address this problem by investigating rank aggregation under differential privacy, which requires that a released output (here, the aggregate ranking computed from individuals' rankings) remain almost the same if any one individual's ranking is removed from the input. We propose a number of differentially-private rank aggregation algorithms: two are inspired by non-private approximate rank aggregators from the existing literature; another uses a novel rejection sampling method to sample privately from a complex distribution. For all the methods we propose, we quantify, both theoretically and empirically, the “cost” of privacy in terms of the quality of the rank aggregation computed.
【Keywords】:
【Paper Link】 【Pages】:678-686
【Authors】: Shuai Yuan ; Pang-Ning Tan ; Kendra Spence Cheruvelil ; C. Emi Fergus ; Nicholas K. Skaff ; Patricia A. Soranno
【Abstract】: Hash-based feature learning is a widely-used data mining approach for dimensionality reduction and for building linear models that are comparable in performance to their nonlinear counterpart. Unfortunately, such an approach is inapplicable to many real-world data sets because they are often riddled with missing values. Substantial data preprocessing is therefore needed to impute the missing values before the hash-based features can be derived. Biases can be introduced during this preprocessing because it is performed independently of the subsequent modeling task, which can result in the models constructed from the imputed hash-based features being suboptimal. To overcome this limitation, we present a novel framework called H-FLIP that simultaneously estimates the missing values while constructing a set of nonlinear hash-based features from the incomplete data. The effectiveness of the framework is demonstrated through experiments using both synthetic and real-world data sets.
【Keywords】:
【Paper Link】 【Pages】:687-695
【Authors】: Keerthiram Murugesan ; Jaime G. Carbonell
【Abstract】: This paper presents a novel multitask multiple-kernel learning framework that efficiently learns the kernel weights leveraging the relationship across multiple tasks. The idea is to automatically infer this task relationship in the RKHS space corresponding to the given base kernels. The problem is formulated as a regularization-based approach called Multi-Task Multiple Kernel Relationship Learning (MK-MTRL), which models the task relationship matrix from the weights learned from latent feature spaces of task-specific base kernels. Unlike in previous work, the proposed formulation allows one to incorporate prior knowledge for simultaneously learning several related task. We propose an alternating minimization algorithm to learn the model parameters, kernel weights and task relationship matrix. In order to tackle large-scale problems, we further propose a two-stage MK-MTRL online learning algorithm and show that it significantly reduces the computational time, and also achieves performance comparable to that of the joint learning framework. Experimental results on benchmark datasets show that the proposed formulations outperform several state-of-the-art multitask learning methods.
【Keywords】:
【Paper Link】 【Pages】:696-704
【Authors】: Jussi Korpela ; Emilia Oikarinen ; Kai Puolamäki ; Antti Ukkonen
【Abstract】: Confidence intervals are a popular way to visualize and analyze data distributions. Unlike p-values, they can convey information both about statistical significance as well as effect size. However, very little work exists on applying confidence intervals to multivariate data. In this paper we define confidence intervals for multivariate data that extend the one-dimensional definition in a natural way. In our definition every variable is associated with its own confidence interval as usual, but a data vector can be outside of a few of these, and still be considered to be within the confidence area. We analyze the problem and show that the resulting confidence areas retain the good qualities of their one-dimensional counterparts: they are informative and easy to interpret. Furthermore, we show that the problem of finding multivariate confidence intervals is hard, but provide efficient approximate algorithms to solve the problem.
【Keywords】:
【Paper Link】 【Pages】:705-713
【Authors】: Nayyar A. Zaidi ; Geoffrey I. Webb
【Abstract】: With the emergence of big data, there has been a growing interest in optimization routines that lead to faster convergence of Logistic Regression (LR). Among many optimization methods such as Gradient Descent, Quasi-Newton, Conjugate Gradient, etc., the Trust-region based truncated Newton method (TRON) algorithm has been shown to converge the fastest. The TRON algorithm also forms an important component of the highly efficient and widely used liblinear package. It has been shown that the WANBIA-C trick of scaling with the log of the naive Bayes conditional probabilities can greatly accelerate the convergence of LR trained using (first-order) Gradient Descent and (approximate second-order) Quasi-Newton optimization. In this work we study the applicability of the WANBIA-C trick to TRON. We first devise a TRON algorithm optimizing the softmax objective function and then demonstrate that WANBIA-C style preconditioning can be beneficial for TRON, leading to an extremely fast (batch) LR algorithm. Second, we present a comparative analysis of one-vs-all LR and softmax LR in terms of the 0–1 Loss, Bias, Variance, RMSE, Log-Loss, Training and Classification time, and show that softmax LR leads to significantly better RMSE and Log-Loss. We evaluate our proposed approach on 51 benchmark datasets.
【Keywords】:
【Paper Link】 【Pages】:714-722
【Authors】: Yada Zhu ; Jianbo Li ; Jingrui He
【Abstract】: Many complex real applications involve the collection of time series data with multiple modalities and of multiple resolutions. For example, in aluminum smelting processes, the recorded process variables typically reflect various aspects of these processes, such as pressure and temperature, and they are often obtained with different time resolutions, such as every 5 minutes and every day. How can we effectively leverage both the multi-modality property and the multi-resolution property of the data for the sake of more accurate prediction of key process indicators (e.g., the cell temperature of the aluminum smelting processes)? Different from existing techniques, which can only model the multi-modality property or the multi-resolution property, in this paper, for the first time, we propose to jointly model the two properties such that the prediction results are consistent across multiple modalities and multiple resolutions. To this end, we construct an optimization framework, which is based on a novel regularizer imposing such consistency. Then, we design an effective and efficient optimization algorithm based on randomized block coordinate descent. Its performance is evaluated on both synthetic and real data sets, outperforming state-of-the-art techniques.
【Keywords】:
【Paper Link】 【Pages】:723-731
【Authors】: Kohei Miyaguchi ; Shin Matsushima ; Kenji Yamanishi
【Abstract】: Discovering a true sparse model capable of generating data is a challenging yet important problem for understanding the nature of the source of the data. A major part of the challenge arises from the fact that the number of possible sparse models grows exponentially as the dimensionality of the models increases. In this study, we consider a method for estimating the true model over an exponentially large number of sparse models based on the minimum description length principle. We show that a novel criterion derived by continuous relaxation of the stochastic complexity induces selection of the true model by solving the l1-regularization problem for which the hyperparameters are appropriately chosen. Moreover, we provide an efficient optimization algorithm for finding the appropriate hyperparameters and select the sparse model accordingly. The experimental results we obtained for the problem of sparse graphical modeling indicate that the proposed method estimates the true model effectively in comparison to existing methods for choosing hyperparameters to solve the l1-regularization problem.
【Keywords】:
【Paper Link】 【Pages】:732-740
【Authors】: Ching-Pei Lee ; Po-Wei Wang ; Weizhu Chen ; Chih-Jen Lin
【Abstract】: Distributed optimization has become an important research topic for dealing with extremely large volume of data available in the Internet companies nowadays. Additional machines make computation less expensive, but inter-machine communication becomes prominent in the optimization process, and efficient optimization methods should reduce the amount of the communication in order to achieve shorter overall running time. In this work, we utilize the advantages of the recently proposed, theoretically fast-convergent common-directions method, but tackle its main drawback of excessive spatial and computational costs to propose a limited-memory algorithm. The result is an efficient, linear-convergent optimization method for paraliel/distributed optimization. We further discuss how our method can exploit the problem structure to efficiently train regularized empirical risk minimization (ERM) models. Experimental results show that our method outperforms state-of-the-art distributed optimization methods for ERM problems.
【Keywords】:
【Paper Link】 【Pages】:741-749
【Authors】: Martin Wistuba ; Nicolas Schilling ; Lars Schmidt-Thieme
【Abstract】: Automating machine learning by providing techniques that autonomously find the best algorithm, hyperparameter configuration and preprocessing is helpful for both researchers and practitioners. Therefore, it is not surprising that automated machine learning has become a very interesting field of research. While current research is mainly focusing on finding good pairs of algorithms and hyperparameter configurations, we will present an approach that automates the process of creating a top performing ensemble of several layers, different algorithms and hyperparameter configurations. These kinds of ensembles are called jokingly Frankenstein ensembles and proved their benefit on versatile data sets in many machine learning challenges. We compare our approach Automatic Frankensteining with the current state of the art for automated machine learning on 80 different data sets and can show that it outperforms them on the majority using the same training time. Furthermore, we compare Automatic Frankensteining on a large scale data set to more than 3,500 machine learning expert teams and are able to outperform more than 3,000 of them within 12 CPU hours.
【Keywords】:
【Paper Link】 【Pages】:750-758
【Authors】: Frank Schoeneman ; Suchismit Mahapatra ; Varun Chandola ; Nils Napp ; Jaroslaw Zola
【Abstract】: Spectral dimensionality reduction is frequently used to identify low-dimensional structure in high-dimensional data. However, learning manifolds, especially from the streaming data, is computationally and memory expensive. In this paper, we argue that a stable manifold can be learned using only a fraction of the stream, and the remaining stream can be mapped to the manifold in a significantly less costly manner. Identifying the transition point at which the manifold is stable is the key step. We present error metrics that allow us to identify the transition point for a given stream by quantitatively assessing the quality of a manifold learned using Isomap. We further propose an efficient mapping algorithm, called S-Isomap, that can be used to map new samples onto the stable manifold. We describe experiments on a variety of data sets that show that the proposed approach is computationally efficient without sacrificing accuracy.
【Keywords】:
【Paper Link】 【Pages】:759-767
【Authors】: Zhong Chen ; Zhide Fang ; Wei Fan ; Andrea Edwards ; Kun Zhang
【Abstract】: Sparse online learning and cost-sensitive learning are two important areas of machine learning and data mining research. Each has been well studied with many interesting algorithms developed. However, very limited published work addresses the joint study of these two fields. In this paper, to tackle the high-dimensional data streams with skewed distributions, we introduce a framework of cost-sensitive sparse online learning. Our proposed framework is a substantial extension of the influential Truncated Gradient (TG) method by formulating a new convex optimization problem, where the two mutual restraint factors, misclassification cost and sparsity, can be simultaneously and favorably balanced. We theoretically analyze the regret and cost bounds of the proposed algorithm, and pinpoint its theoretical merit compared to the existing related approaches. Large-scale empirical comparisons to five baseline methods on eight real-world streaming datasets demonstrate the encouraging performance of the developed method. Algorithm implementation and datasets are available upon request.
【Keywords】:
【Paper Link】 【Pages】:768-776
【Authors】: Shujian Yu ; Zubin Abraham
【Abstract】: When using statistical models (such as a classifier) in a streaming environment, there is often a need to detect and adapt to concept drifts to mitigate any deterioration in the model's predictive performance over time. Unfortunately, the ability of popular concept drift approaches in detecting these drifts in the relationship of the response and predictor variable is often dependent on the distribution characteristics of the data streams, as well as its sensitivity on parameter tuning. This paper presents Hierarchical Linear Four Rates (HLFR), a framework that detects concept drifts for different data stream distributions (including imbalanced data) by leveraging a hierarchical set of hypothesis tests in an online setting. The performance of HLFR is compared to benchmark approaches using both simulated and real-world datasets spanning the breadth of concept drift types. HLFR significantly outperforms benchmark approaches in terms of accuracy, G-mean, recall, delay in detection and adaptability across the various datasets.
【Keywords】:
【Paper Link】 【Pages】:777-785
【Authors】: Rose Yu ; Yaguang Li ; Cyrus Shahabi ; Ugur Demiryurek ; Yan Liu
【Abstract】: Traffic forecasting is a vital part of intelligent transportation systems. It becomes particularly challenging due to short-term (e.g., accidents, constructions) and long-term (e.g., peak-hour, seasonal, weather) traffic patterns. While most of the previously proposed techniques focus on normal condition forecasting, a single framework for extreme condition traffic forecasting does not exist. To address this need, we propose to take a deep learning approach. We build a deep neural network based on long short term memory (LSTM) units. We apply Deep LSTM to forecast peak-hour traffic and manage to identify unique characteristics of the traffic data. We further improve the model for post-accident forecasting with Mixture Deep LSTM model. It jointly models the normal condition traffic and the pattern of accidents. We evaluate our model on a real-world large-scale traffic dataset in Los Angeles. When trained end-to-end with suitable regularization, our approach achieves 30%–50% improvement over baselines. We also demonstrate a novel technique to interpret the model with signal stimulation. We note interesting observations from the trained neural network.
【Keywords】:
【Paper Link】 【Pages】:786-794
【Authors】: Zongge Liu ; Hyun Ah Song ; Vladimir Zadorozhny ; Christos Faloutsos ; Nicholas D. Sidiropoulos
【Abstract】: In this paper, we address the challenge of recovering a time sequence of counts from aggregated historical data. For example, given a mixture of the monthly and weekly sums, how can we find the daily counts of people infected with flu? In general, what is the best way to recover historical counts from aggregated, possibly overlapping historical reports, in the presence of missing values? Equally importantly, how much should we trust this reconstruction? We propose H-Fuse, a novel method that solves above problems by allowing injection of domain knowledge in a principled way, and turning the task into a well-defined optimization problem. H-Fuse has the following desirable properties: (a) Effectiveness, recovering historical data from aggregated reports with high accuracy; (b) Self-awareness, providing an assessment of when the recovery is not reliable; (c) Scalability, computationally linear on the size of the input data. Experiments on the real data (epidemiology counts from the Tycho project [13]) demonstrates that H-FUSE reconstructs the original data 30 – 81% better than the least squares method.
【Keywords】:
【Paper Link】 【Pages】:795-803
【Authors】: Apratim Bhattacharyya ; Jilles Vreeken
【Abstract】: Discovering the key structure of a database is one of the main goals of data mining. In pattern set mining we do so by discovering a small set of patterns that together describe the data well. The richer the class of patterns we consider, and the more powerful our description language, the better we will be able to summarise the data. In this paper we propose Squish, a novel greedy MDL-based method for summarising sequential data using rich patterns that are allowed to interleave. Experiments show Squish is orders of magnitude faster than the state of the art, results in better models, as well as discovers meaningful semantics in the form patterns that identify multiple choices of values.
【Keywords】:
【Paper Link】 【Pages】:804-812
【Authors】: Zhenhui Li ; Guanjie Zheng ; Amal Agarwal ; Lingzhou Xue ; Thomas Lauvaux
【Abstract】: Causality analysis, beyond “mere” correlations, has become increasingly important for scientific discoveries and policy decisions. Many of these real-world applications involve time series data. A key observation is that the causality between time series could vary significantly over time. For example, a rain could cause severe traffic jams during the rush hours, but has little impact on the traffic at midnight. However, previous studies mostly look at the whole time series when determining the causal relationship between them. Instead, we propose to detect the partial time intervals with causality. As it is time consuming to enumerate all time intervals and test causality for each interval, we further propose an efficient algorithm that can avoid unnecessary computations based on the bounds of F-test in the Granger causality test. We use both synthetic datasets and real datasets to demonstrate the efficiency of our pruning techniques and that our method can effectively discover interesting causal intervals in the time series data.
【Keywords】:
【Paper Link】 【Pages】:813-821
【Authors】: Martin P. Seybold
【Abstract】: For a given sequence of location measurements, the goal of the geometric map matching problem is to compute a sequence of movements along edges of a spatially embedded graph which provides a ‘good explanation’ for the measurements. The problem gets challenging as real world data, like traces or graphs from the OpenStreetMap project, does not exhibit homogeneous data quality. Graph details and errors vary in areas and each trace has changing noise and precisions. Hence formalizing what a ‘good explanation’ is, becomes quite difficult. We propose a novel map matching approach which locally adapts to the data quality by constructing what we call dominance decompositions. While our approach is computationally more expensive than previous approaches, our experiments show that it allows for high quality map matching even in presence of highly variable data quality without parameter tuning.
【Keywords】:
【Paper Link】 【Pages】:822-830
【Authors】: Martin Jankowiak ; Manuel Gomez-Rodriguez
【Abstract】: Social media users and microbloggers post about a wide variety of (off-line) collective social activities as they participate in them, ranging from concerts and sporting events to political rallies and civil protests. In this context, people who take part in the same collective social activity often post closely related content from nearby locations at similar times, resulting in distinctive spatiotemporal patterns. Can we automatically detect these patterns and thus provide insights into the associated activities? In this paper, we propose a modeling framework for clustering streaming spatiotemporal data, the Spatial Dirichlet Hawkes Process (SDHP), which allows us to automatically uncover a wide variety of spatiotemporal patterns of collective social activity from geolocated online traces. Moreover, we develop an efficient, online inference algorithm based on Sequential Monte Carlo that scales to millions of geolocated posts. Experiments on synthetic data and real data gathered from Twitter show that our framework can recover a wide variety of meaningful social activity patterns in terms of both content and spatiotemporal dynamics, that it yields interesting insights about these patterns, and that it can be used to estimate the location from where a tweet was posted.
【Keywords】:
【Paper Link】 【Pages】:831-839
【Authors】: Nikolaj Tatti
【Abstract】: One of the classic data mining tasks is to discover bursts, time intervals, where events occur at abnormally high rate. In this paper we revisit Kleinberg's seminal work, where bursts are discovered by using exponential distribution with a varying rate parameter: the regions where it is more advantageous to set the rate higher are deemed bursty. The model depends on two parameters, the initial rate and the change rate. The initial rate, that is, the rate that is used when there are no burstiness was set to the average rate over the whole sequence. The change rate is provided by the user. We argue that these choices are suboptimal: it leads to worse likelihood, and may lead to missing some existing bursts. We propose an alternative problem setting, where the model parameters are selected by optimizing the likelihood of the model. While this tweak is trivial from the problem definition point of view, this changes the optimization problem greatly. To solve the problem in practice, we propose efficient (1 + ϵ) approximation schemes. Finally, we demonstrate empirically that with this setting we are able to discover bursts that would have otherwise be undetected.
【Keywords】:
【Paper Link】 【Pages】:840-847
【Authors】:
【Abstract】: Backmatter includes author index
【Keywords】: