12th IEEE International Conference on Data Mining, ICDM 2012, Brussels, Belgium, December 10-13, 2012. IEEE Computer Society 【DBLP Link】
【Paper Link】 【Pages】:1-10
【Authors】: Gergely Ács ; Claude Castelluccia ; Rui Chen
【Abstract】: Differential privacy has emerged as one of the most promising privacy models for private data release. It can be used to release different types of data, and, in particular, histograms, which provide useful summaries of a dataset. Several differentially private histogram releasing schemes have been proposed recently. However, most of them directly add noise to the histogram counts, resulting in undesirable accuracy. In this paper, we propose two sanitization techniques that exploit the inherent redundancy of real-life datasets in order to boost the accuracy of histograms. They lossily compress the data and sanitize the compressed data. Our first scheme is an optimization of the Fourier Perturbation Algorithm (FPA) presented in [13]. It improves the accuracy of the initial FPA by a factor of 10. The other scheme relies on clustering and exploits the redundancy between bins. Our extensive experimental evaluation over various real-life and synthetic datasets demonstrates that our techniques preserve very accurate distributions and considerably improve the accuracy of range queries over attributed histograms.
【Keywords】: Fourier analysis; data compression; data privacy; pattern clustering; perturbation techniques; publishing; FPA; Fourier perturbation algorithm; attributed histograms; compressed data sanitization techniques; differential privacy models; differentially private histogram publishing; differentially private histogram releasing schemes; lossy compression; pattern clustering; real-life dataset inherent redundancy; real-life datasets; synthetic datasets; Data privacy; Databases; Discrete Fourier transforms; Histograms; Noise; Privacy; Sensitivity; Differential privacy; Fourier transform; clustering; histogram; lossy compression
【Paper Link】 【Pages】:11-20
【Authors】: B. Aditya Prakash ; Jilles Vreeken ; Christos Faloutsos
【Abstract】: Given a snapshot of a large graph, in which an infection has been spreading for some time, can we identify those nodes from which the infection started to spread? In other words, can we reliably tell who the culprits are? In this paper we answer this question affirmatively, and give an efficient method called NETSLEUTH for the well-known Susceptible-Infected virus propagation model. Essentially, we are after that set of seed nodes that best explain the given snapshot. We propose to employ the Minimum Description Length principle to identify the best set of seed nodes and virus propagation ripple, as the one by which we can most succinctly describe the infected graph. We give an highly efficient algorithm to identify likely sets of seed nodes given a snapshot. Then, given these seed nodes, we show we can optimize the virus propagation ripple in a principled way by maximizing likelihood. With all three combined, NETSLEUTH can automatically identify the correct number of seed nodes, as well as which nodes are the culprits. Experimentation on our method shows high accuracy in the detection of seed nodes, in addition to the correct automatic identification of their number. Moreover, we show NETSLEUTH scales linearly in the number of nodes of the graph.
【Keywords】: epidemics; graph theory; microorganisms; NETSLEUTH; culprit spotting; epidemics; graph snapshot; infected graph; infection; minimum description length principle; seed nodes; susceptible-infected virus propagation model; virus propagation ripple; Data models; Encoding; Equations; Integrated circuit modeling; Nickel; Reliability; Silicon; culprits; diffusion; epidemics; seeds
【Paper Link】 【Pages】:21-30
【Authors】: Ferit Akova ; Murat Dundar ; Yuan (Alan) Qi ; Bartek Rajwa
【Abstract】: We present a new direction for semi-supervised learning where self-adjusting generative models replace fixed ones and unlabeled data can potentially improve learning even when labeled data is only partially-observed. We model each class data by a mixture model and use a hierarchical Dirichlet process (HDP) to model observed as well as unobserved classes. We extend the standard HDP model to accommodate unlabeled samples and introduce a new sharing strategy, within the context of Gaussian mixture models, that restricts sharing with covariance matrices while leaving the mean vectors free. Our research is mainly driven by real-world applications with evolving data-generating mechanisms where obtaining a fully-observed labeled data set is impractical. We demonstrate the feasibility of the proposed approach for semi-supervised learning in two such applications.
【Keywords】: Gaussian processes; covariance matrices; data models; learning (artificial intelligence); Gaussian mixture models; HDP; covariance matrices; data model; data-generating mechanisms; hierarchical Dirichlet process; mean vectors; partially observed settings; real-world applications; self-adjusting generative models; semisupervised learning; sharing strategy; standard HDP model; unlabeled data; unobserved classes; Context modeling; Covariance matrix; Data models; Gaussian mixture model; Semisupervised learning; Vectors; class discovery; gaussian mixture model; hierarchical dirichlet process; partially-observed data sets; semi-supervised learning
【Paper Link】 【Pages】:31-40
【Authors】: Tahseen Al-Khateeb ; Mohammad M. Masud ; Latifur Khan ; Charu C. Aggarwal ; Jiawei Han ; Bhavani M. Thuraisingham
【Abstract】: Concept-evolution has recently received a lot of attention in the context of mining data streams. Concept-evolution occurs when a new class evolves in the stream. Although many recent studies address this issue, most of them do not consider the scenario of recurring classes in the stream. A class is called recurring if it appears in the stream, disappears for a while, and then reappears again. Existing data stream classification techniques either misclassify the recurring class instances as another class, or falsely identify the recurring classes as novel. This increases the prediction error of the classifiers, and in some cases causes unnecessary waste in memory and computational resources. In this paper we address the recurring class issue by proposing a novel "class-based" ensemble technique, which substitutes the traditional "chunk-based" ensemble approaches and correctly distinguishes between a recurring class and a novel one. We analytically and experimentally confirm the superiority of our method over state-of-the-art techniques.
【Keywords】: data mining; pattern classification; chunk- based ensemble approaches; class-based ensemble; class-based ensemble technique; classifier prediction error; computational resources; concept evolution; data stream classification techniques; data stream mining; memory waste; novel class detection; recurring class detection; state-of-the-art techniques; stream classification; Classification algorithms; Data models; Educational institutions; Electronic mail; Humans; Prediction algorithms; Training; novel class; recurring class; stream classification
【Paper Link】 【Pages】:41-50
【Authors】: Malak Alshawabkeh ; Javed A. Aslam ; Jennifer G. Dy ; David R. Kaeli
【Abstract】: Utilizing the concept of hypothesis margins to measure the quality of a set of features has been a growing line of research in the last decade. However, most previous algorithms have been developed under the large hypothesis margin principles of the 1-NN algorithm, such as Simba. Little attention has been paid so far to exploiting the hypothesis margins of boosting to evaluate features. Boosting is well known to maximize the training examples' hypothesis margins, in particular, the average margins which are known to be the first statistics that considers the whole margin distribution. In this paper, we describe how to utilize the training examples' mean margins of boosting to select features. A weight criterion, termed Margin Fraction (MF), is assigned to each feature that contributes to the average margin distribution combined in the final output produced by boosting. Applying the idea of MF to a sequential backward selection method, a new embedded selection algorithm is proposed, called SBS-MF. Experimentation is carried out using different data sets, which compares the proposed SBS-MF with two boosting based feature selection approaches, as well as to Simba. The results show that SBS-MF is effective in most of the cases.
【Keywords】: learning (artificial intelligence); statistics; 1-NN algorithm; MF; SBS-MF; Simba; average margin distribution; boosting based feature selection approach; boosting hypothesis margin; embedded selection algorithm; feature weighting; margin distribution; margin fraction; sequential backward selection method; statistics; training example mean margins; weight criterion; Additives; Boosting; Computers; Educational institutions; Predictive models; Training; Weight measurement; Feature selection; average margin; boosting
【Paper Link】 【Pages】:51-60
【Authors】: Fatemeh Azmandian ; Ayse Yilmazer ; Jennifer G. Dy ; Javed A. Aslam ; David R. Kaeli
【Abstract】: Effective outlier detection requires the data to be described by a set of features that captures the behavior of normal data while emphasizing those characteristics of outliers which make them different than normal data. In this work, we present a novel non-parametric evaluation criterion for filter-based feature selection which caters to outlier detection problems. The proposed method seeks the subset of features that represents the inherent characteristics of the normal dataset while forcing outliers to stand out, making them more easily distinguished by outlier detection algorithms. Experimental results on real datasets show the advantage of our feature selection algorithm compared to popular and state-of-the-art methods. We also show that the proposed algorithm is able to overcome the small sample space problem and perform well on highly imbalanced datasets. Furthermore, due to the highly parallelizable nature of the feature selection, we implement the algorithm on a graphics processing unit (GPU) to gain significant speedup over the serial version. The benefits of the GPU implementation are two-fold, as its performance scales very well in terms of the number of features, as well as the number of data points.
【Keywords】: data mining; graphics processing units; GPU-accelerated feature selection; filter-based feature selection; graphics processing unit; imbalanced datasets; local kernel density ratio; nonparametric evaluation criterion; normal data; outlier detection problems; small sample space problem; Detection algorithms; Educational institutions; Feature extraction; Graphics processing units; Kernel; Search problems; USA Councils; Feature Selection; GPU Acceleration; Imbalanced Data; Outlier Detection
【Paper Link】 【Pages】:61-70
【Authors】: Balamurugan P. ; Shirish Krishnaj Shevade ; T. Ravindra Babu
【Abstract】: Structural Support Vector Machines (SSVMs) have recently gained wide prominence in classifying structured and complex objects like parse-trees, image segments and Part-of-Speech (POS) tags. Typical learning algorithms used in training SSVMs result in model parameters which are vectors residing in a large-dimensional feature space. Such a high-dimensional model parameter vector contains many non-zero components which often lead to slow prediction and storage issues. Hence there is a need for sparse parameter vectors which contain a very small number of non-zero components. L1-regularizer and elastic net regularizer have been traditionally used to get sparse model parameters. Though L1-regularized structural SVMs have been studied in the past, the use of elastic net regularizer for structural SVMs has not been explored yet. In this work, we formulate the elastic net SSVM and propose a sequential alternating proximal algorithm to solve the dual formulation. We compare the proposed method with existing methods for L1-regularized Structural SVMs. Experiments on large-scale benchmark datasets show that the proposed dual elastic net SSVM trained using the sequential alternating proximal algorithm scales well and results in highly sparse model parameters while achieving a comparable generalization performance. Hence the proposed sequential alternating proximal algorithm is a competitive method to achieve sparse model parameters and a comparable generalization performance when elastic net regularized Structural SVMs are used on very large datasets.
【Keywords】: learning (artificial intelligence); pattern classification; support vector machines; L1-regularizer; complex object classification; dual elastic net SSVM; elastic net regularizer; high-dimensional model parameter vector; large-dimensional feature space; learning algorithms; nonzero components; scalable sparse structural SVM; sequential alternating proximal method; sparse model parameters; sparse parameter vectors; structural support vector machines; structured object classification; Computer science; Context; Electronic mail; Scalability; Support vector machine classification; Vectors; Alternating Proximal method; Structural SVMs
【Paper Link】 【Pages】:71-80
【Authors】: Suhrid Balakrishnan ; Sumit Chopra ; David Applegate ; Simon Urbanek
【Abstract】: Ever wonder why that Kia Ad ran during Iron Chef? Traditional advertising methodology on television is a fascinating mix of marketing, branding, measurement, and predictive modeling. While still a robust business, it is at risk with the recent growth of online and time-shifted (recorded) television. A particular issue is that traditional methods for television advertising are far less efficient than their counterparts in the online world which employ highly sophisticated computational techniques. This paper formalizes an approach to eliminate some of these inefficiencies by recasting the process of television advertising media campaign generation in a computational framework. We describe efficient mathematical approaches to solve for the task of finding optimal campaigns for specific target audiences. In two case studies, our campaigns report gains in key operational metrics of up to 56% compared to campaigns generated by traditional methods.
【Keywords】: advertising; mathematical analysis; television; Iron Chef; Kia ad; branding; computational techniques; computational television advertising; marketing; mathematical approaches; measurement; online television; operational metrics; predictive modeling; television advertising media campaign generation; time-shifted television; Advertising; Computational modeling; Measurement; Media; Optimization; Predictive models; TV; Computational Advertising; Media Campaign Generation; Optimization; Television Advertising
【Paper Link】 【Pages】:81-90
【Authors】: Nicola Barbieri ; Francesco Bonchi ; Giuseppe Manco
【Abstract】: We study social influence from a topic modeling perspective. We introduce novel topic-aware influence-driven propagation models that experimentally result to be more accurate in describing real-world cascades than the standard propagation models studied in the literature. In particular, we first propose simple topic-aware extensions of the well-known Independent Cascade and Linear Threshold models. Next, we propose a different approach explicitly modeling authoritativeness, influence and relevance under a topic-aware perspective. We devise methods to learn the parameters of the models from a dataset of past propagations. Our experimentation confirms the high accuracy of the proposed models and learning schemes.
【Keywords】: learning (artificial intelligence); social sciences; independent cascade model; influence-driven propagation model; learning scheme; linear threshold model; modeling authoritativeness; topic modeling perspective; topic-aware social influence propagation model; Atmospheric modeling; Computational modeling; Data models; Greedy algorithms; Integrated circuit modeling; Social network services
【Paper Link】 【Pages】:91-100
【Authors】: Mansurul Bhuiyan ; Mahmudur Rahman ; Mahmuda Rahman ; Mohammad Al Hasan
【Abstract】: Graphlet frequency distribution (GFD) has recently become popular for characterizing large networks. However, the computation of GFD for a network requires the exact count of embedded graphlets in that network, which is a computationally expensive task. As a result, it is practically infeasible to compute the GFD for even a moderately large network. In this paper, we propose GUISE, which uses a Markov Chain Monte Carlo (MCMC) sampling method for constructing the approximate GFD of a large network. Our experiments on networks with millions of nodes show that GUISE obtains the GFD within few minutes, whereas the exhaustive counting based approach takes several days.
【Keywords】: Markov processes; Monte Carlo methods; graph theory; sampling methods; GFD; GUISE; MCMC; Markov chain Monte Carlo sampling method; computationally expensive task; embedded graphlets; graphlet frequency distribution; large graph analysis; uniform graphlet sampling; Biological information theory; Context; Markov processes; Monte Carlo methods; Probability distribution; Radiation detectors; Vectors
【Paper Link】 【Pages】:101-110
【Authors】: Wei Bi ; James T. Kwok
【Abstract】: Hierarchical multilabel classification (HMC) allows an instance to have multiple labels residing in a hierarchy. A popular loss function used in HMC is the H-loss, which penalizes only the first classification mistake along each prediction path. However, the H-loss metric can only be used on tree-structured label hierarchies, but not on DAG hierarchies. Moreover, it may lead to misleading predictions as not all misclassifications in the hierarchy are penalized. In this paper, we overcome these deficiencies by proposing a hierarchy-aware loss function that is more appropriate for HMC. Using Bayesian decision theory, we then develop a Bayes-optimal classifier with respect to this loss function. Instead of requiring an exhaustive summation and search for the optimal multilabel, the proposed classification problem can be efficiently solved using a greedy algorithm on both tree-and DAG-structured label hierarchies. Experimental results on a large number of real-world data sets show that the proposed algorithm outperforms existing HMC methods.
【Keywords】: Bayes methods; decision theory; greedy algorithms; pattern classification; Bayes-optimal classifier; Bayesian decision theory; DAG-structured label hierarchy; H-loss metric; greedy algorithm; hierarchical multilabel classification; hierarchy-aware loss function; minimum Bayes risk; optimal multilabel; tree-structured label hierarchy; Bayesian methods; Bismuth; Decision theory; Greedy algorithms; Optimization; Prediction algorithms; Support vector machines; Bayesian decision theory; hierarchical classification; multilabel classification
【Paper Link】 【Pages】:111-120
【Authors】: Konstantinos Blekas ; Aristidis Likas
【Abstract】: We present a new regression mixture model where each mixture component is a multi-kernel version of the Relevance Vector Machine (RVM). In the proposed model, we exploit the enhanced modeling capability of RVMs due to their embedded sparsity enforcing properties. %The main contribution of this %work is the employment of RVM models as components of a mixture %model and their application to the time series clustering problem. Moreover, robustness is achieved with respect to the kernel parameters, by employing a weighted multi-kernel scheme. The mixture model is trained using the maximum a posteriori (MAP) approach, where the Expectation Maximization (EM) algorithm is applied offering closed form update equations for the model parameters. An incremental learning methodology is also presented to tackle the parameter initialization problem of the EM algorithm. The efficiency of the proposed mixture model is empirically demonstrated on the time series clustering problem using various artificial and real benchmark datasets and by performing comparisons with other regression mixture models.
【Keywords】: expectation-maximisation algorithm; learning (artificial intelligence); pattern clustering; regression analysis; support vector machines; time series; EM algorithm; MAP approach; RVM model; closed form update equation; expectation maximization algorithm; incremental learning methodology; kernel parameter; maximum-a-posteriori approach; mixture model; model parameter; multikernel relevance vector machines; parameter initialization problem; regression mixture model; sparsity enforcing property; time series clustering problem; weighted multikernel scheme; Covariance matrix; Data models; Hidden Markov models; Kernel; Mathematical model; Training; Vectors; Relevance Vector Machines; incremental EM learning; mixture models; multi-kernel; sparse prior
【Paper Link】 【Pages】:121-130
【Authors】: Ceren Budak ; Divyakant Agrawal ; Amr El Abbadi
【Abstract】: Recent studies on the diffusion of information in social networks have largely focused on models based on the influence of local friends. In this paper, we challenge the generalizability of this approach and revive theories introduced by social scientists in the context of diffusion of innovations to model user behavior. To this end, we study various diffusion models in two different online social networks, Digg and Twitter. We first evaluate the applicability of two representative local influence models and show that the behavior of most social networks users are not captured by these local models. Next, driven by theories introduced in the diffusion of innovations research, we introduce a novel diffusion model called Gaussian Logit Curve Model (GLCM) that models user behavior with respect to the behavior of the general population. Our analysis shows that GLCM captures user behavior significantly better than local models, especially in the context of Digg. Aiming to capture both the local and global signals, we introduce various hybrid models and evaluate them through statistical methods. Our methodology models each user separately, automatically determining which users are driven by their local relations and which users are better defined through adopter categories, therefore capturing the complexity of human behavior.
【Keywords】: Gaussian processes; behavioural sciences; innovation management; social networking (online); Digg; GLCM; Gaussian logit curve model; Twitter; diffusion models; human behavior complexity; hybrid models; information diffusion; innovation diffusion; local friends influence; local influence models; model user behavior; online social networks; social network users behavior; social scientists; statistical methods; Logistics; Mathematical model; Maximum likelihood estimation; Sociology; Technological innovation; Twitter; ?rth logistic regression; diffusion models; diffusion of innovations; gaussian logit curve; social networks
【Paper Link】 【Pages】:131-140
【Authors】: Kai-Wei Chang ; Biplab Deka ; Wen-mei W. Hwu ; Dan Roth
【Abstract】: Time series shapelet discovery algorithm finds subsequences from a set of time series for use as primitives for time series classification. This algorithm has drawn a lot of interest because of the interpretability of its results. However, computation requirements restrict the algorithm from dealing with large data sets and may limit its application in many domains. In this paper, we address this issue by redesigning the algorithm for implementation on highly parallel Graphics Process Units (GPUs). We investigate several concepts of GPU programming and propose a dynamic programming algorithm that is suitable for implementation on GPUs. Results show that the proposed GPU implementation significantly reduces the running time of the shapelet discovery algorithm. For example, on the largest sample dataset from the original authors, the running time is reduced from half a day to two minutes.
【Keywords】: dynamic programming; graphics processing units; pattern classification; time series; GPU programming; dynamic programming algorithm; graphics processing units; pattern-based time series classification; time series shapelet discovery algorithm; Graphics processing units; Heuristic algorithms; Instruction sets; Programming; Registers; Signal processing algorithms; Time series analysis; Classification; GPU; Pattern-based Classification; Time Series
【Paper Link】 【Pages】:141-150
【Authors】: Sanjay Chawla ; Yu Zheng ; Jiafeng Hu
【Abstract】: We propose a novel two-step mining and optimization framework for inferring the root cause of anomalies that appear in road traffic data. We model road traffic as a time-dependent flow on a network formed by partitioning a city into regions bounded by major roads. In the first step we identify link anomalies based on their deviation from their historical traffic profile. However, link anomalies on their own shed very little light on what caused them to be anomalous. In the second step we take a generative approach by modeling the flow in a network in terms of the origin-destination (OD) matrix which physically relates the latent flow between origin and destination and the observable flow on the links. The key insight is that instead of using all of link traffic as the observable vector we only use the link anomaly vector. By solving an L1 inverse problem we infer the routes (the origin-destination pairs) which gave rise to the link anomalies. Experiments on a very large GPS data set consisting on nearly eight hundred million data points demonstrate that we can discover routes which can clearly explain the appearance of link anomalies. The use of optimization techniques to explain observable anomalies in a generative fashion is, to the best of our knowledge, entirely novel.
【Keywords】: Global Positioning System; data mining; matrix algebra; road traffic; vectors; GPS data set; link anomaly vector; optimization framework; origin-destination matrix; road traffic anomalies; time-dependent flow; two-step mining; Covariance matrix; Global Positioning System; Optimization; Principal component analysis; Roads; Trajectory; Vectors; anomaly detection; data mining; gps data; road traffic
【Paper Link】 【Pages】:151-160
【Authors】: Yang Chen ; Feng Chen ; Jing Dai ; T. Charles Clancy ; Yao-Jan Wu
【Abstract】: This paper describes an efficient and effective design of Robust Spatio-Temporal Prediction based on Student's t distribution, namely, St-RSTP, to provide estimations based on observations over spatio-temporal neighbors. The proposed St-RSTP is more resilient to outliers or other small departures from model assumptions than its ancestor, the Spatio-Temporal Random Effects (STRE) model. STRE is a state-of-the-art statistical model with linear order complexity for large scale processing. However, it assumes Gaussian observations, which has the well-known limitation of non-robustness. In our StRSTP design, the measurement error follows Student's t distribution, instead of a traditional Gaussian distribution. This design reduces the influence of outliers, improves prediction quality, and keeps the problem analytically intractable. We propose a novel approximate inference approach, which approximates the model into the form that separates the high dimensional latent variables into groups, and then estimates the posterior distributions of different groups of variables separately in the framework of Expectation Propagation. As a good property, our approximate approach degeneralizes to the standard STRE based prediction, when the degree of freedom of the Student's t distribution is set to infinite. Extensive experimental evaluations based on both simulation and real-life data sets demonstrated the robustness and the efficiency of our Student-t prediction model. The proposed approach provides critical functionality for stochastic processes on spatio-temporal data.
【Keywords】: Gaussian processes; inference mechanisms; random processes; statistical distributions; Gaussian observation; STRE based prediction; STRE model; St-RSTP; Student-t distribution; approximate inference approach; expectation propagation; high dimensional latent variable; large scale processing; linear order complexity; measurement error; posterior distributions; robust spatio-temporal prediction; spatio-temporal random effect; statistical model; stochastic process; Approximation methods; Equations; Estimation; Mathematical model; Predictive models; Robustness; Vectors; Expectation Propagation; Spatio-Temporal Process; Students t Distribution
【Paper Link】 【Pages】:161-170
【Authors】: Radha Chitta ; Rong Jin ; Anil K. Jain
【Abstract】: Kernel clustering algorithms have the ability to capture the non-linear structure inherent in many real world data sets and thereby, achieve better clustering performance than Euclidean distance based clustering algorithms. However, their quadratic computational complexity renders them non-scalable to large data sets. In this paper, we employ random Fourier maps, originally proposed for large scale classification, to accelerate kernel clustering. The key idea behind the use of random Fourier maps for clustering is to project the data into a low-dimensional space where the inner product of the transformed data points approximates the kernel similarity between them. An efficient linear clustering algorithm can then be applied to the points in the transformed space. We also propose an improved scheme which uses the top singular vectors of the transformed data matrix to perform clustering, and yields a better approximation of kernel clustering under appropriate conditions. Our empirical studies demonstrate that the proposed schemes can be efficiently applied to large data sets containing millions of data points, while achieving accuracy similar to that achieved by state-of-the-art kernel clustering algorithms.
【Keywords】: Fourier transforms; computational complexity; pattern classification; pattern clustering; Euclidean distance based clustering algorithms; kernel clustering algorithms; large scale classification; nonlinear structure; quadratic computational complexity; random Fourier features; random Fourier maps; real world data sets; transformed data matrix; Accuracy; Approximation algorithms; Approximation methods; Clustering algorithms; Complexity theory; Kernel; Vectors; Kernel clustering; Kernel k-means; Random Fourier features; Scalability
【Paper Link】 【Pages】:171-180
【Authors】: Hanbo Dai ; Feida Zhu ; Ee-Peng Lim ; HweeHwa Pang
【Abstract】: Bipartite graphs can model many real life applications including users-rating-products in online marketplaces, users-clicking-webpages on the World Wide Web and users referring- users in social networks. In these graphs, the anomalousness of nodes in one partite often depends on that of their connected nodes in the other partite. Previous studies have shown that this dependency can be positive (the anomalousness of a node in one partite increases or decreases along with that of its connected nodes in the other partite) or negative (the anomalousness of a node in one partite rises or falls in opposite direction to that of its connected nodes in the other partite). In this paper, we unify both positive and negative mutual dependency relationships in an unsupervised framework for detecting anomalous nodes in bipartite graphs. This is the first work that integrates both mutual dependency principles to model the complete set of anomalous behaviors of nodes that cannot be identified by either principle alone. We formulate our principles and design an iterative algorithm to simultaneously compute the anomaly scores of nodes in both partites. Moreover, we mathematically prove that the ranking of nodes by anomaly scores in each partite converges. Our framework is examined on synthetic graphs and the results show that our model outperforms existing models with only positive or negative mutual dependency principles. We also apply our framework to two real life datasets: Goodreads as a users-rating-books setting and Buzzcity as a users-clicking advertisements setting. The results show that our method is able to detect suspected spamming users and spammed books in Goodreads and achieve higher precision in identifying fraudulent advertisement publishers than existing approaches.
【Keywords】: advertising; electronic publishing; graph theory; iterative methods; security of data; Buzzcity setting; Goodreads setting; World Wide Web; anomaly detection; bipartite graph; fraudulent advertisement publisher identification; iterative algorithm; mutual dependency principle; negative node dependency; node ranking; online marketplace; positive node dependency; social networks; spammed books detection; suspected spamming user detection; synthetic graph; unsupervised framework; users-clicking-webpages application; users-rating-products application; Bipartite graph; Computational modeling; Convergence; Eigenvalues and eigenfunctions; Image edge detection; Mathematical model; Vectors; Anomaly Detection; Bipartite Graph; Mutual Dependency; Mutual Reinforcement; Node Anomalies
【Paper Link】 【Pages】:181-190
【Authors】: Yuxiao Dong ; Jie Tang ; Sen Wu ; Jilei Tian ; Nitesh V. Chawla ; Jinghai Rao ; Huanhuan Cao
【Abstract】: Link prediction and recommendation is a fundamental problem in social network analysis. The key challenge of link prediction comes from the sparsity of networks due to the strong disproportion of links that they have potential to form to links that do form. Most previous work tries to solve the problem in single network, few research focus on capturing the general principles of link formation across heterogeneous networks. In this work, we give a formal definition of link recommendation across heterogeneous networks. Then we propose a ranking factor graph model (RFG) for predicting links in social networks, which effectively improves the predictive performance. Motivated by the intuition that people make friends in different networks with similar principles, we find several social patterns that are general across heterogeneous networks. With the general social patterns, we develop a transfer-based RFG model that combines them with network structure information. This model provides us insight into fundamental principles that drive the link formation and network evolution. Finally, we verify the predictive performance of the presented transfer model on 12 pairs of transfer cases. Our experimental results demonstrate that the transfer of general social patterns indeed help the prediction of links.
【Keywords】: graph theory; network theory (graphs); social networking (online); general social patterns; heterogeneous social networks; link formation principles; link prediction; link recommendation; network evolution; network structure information; ranking factor graph model; social network analysis; transfer-based RFG model; Correlation; Data mining; Data models; Indexes; Predictive models; Twitter; Factor graph; Heterogeneous networks; Link prediction; Recommendation; Social network analysis
【Paper Link】 【Pages】:191-200
【Authors】: Changying Du ; Fuzhen Zhuang ; Qing He ; Zhongzhi Shi
【Abstract】: Multi-task learning has proven to be useful to boost the learning of multiple related but different tasks. Meanwhile, latent semantic models such as LSA and LDA are popular and effective methods to extract discriminative semantic features of high dimensional dyadic data. In this paper, we present a method to combine these two techniques together by introducing a new matrix tri-factorization based formulation for semi-supervised latent semantic learning, which can incorporate labeled information into traditional unsupervised learning of latent semantics. Our inspiration for multi-task semantic feature learning comes from two facts, i.e., 1) multiple tasks generally share a set of common latent semantics, and 2) a semantic usually has a stable indication of categories no matter which task it is from. Thus to make multiple tasks learn from each other we wish to share the associations between categories and those common semantics among tasks. Along this line, we propose a novel joint Nonnegative matrix tri-factorization framework with the aforesaid associations shared among tasks in the form of a semantic-category relation matrix. Our new formulation for multi-task learning can simultaneously learn (1) discriminative semantic features of each task, (2) predictive structure and categories of unlabeled data in each task, (3) common semantics shared among tasks and specific semantics exclusive to each task. We give alternating iterative algorithm to optimize our objective and theoretically show its convergence. Finally extensive experiments on text data along with the comparison with various baselines and three state-of-the-art multi-task learning algorithms demonstrate the effectiveness of our method.
【Keywords】: learning (artificial intelligence); matrix decomposition; multiprogramming; pattern classification; classification; discriminative semantic features; high dimensional dyadic data; joint nonnegative matrix trifactorization framework; latent semantic models; multi-task semi-supervised semantic feature learning; semantic-category relation matrix; Data mining; Data models; Feature extraction; Iterative methods; Joints; Optimization; Semantics; joint nonnegative matrix tri-factorization; multi-task learning; semantic feature learning; semi-supervised learning; text classification
【Paper Link】 【Pages】:201-210
【Authors】: Liang Du ; Xuan Li ; Yi-Dong Shen
【Abstract】: Nonnegative matrix factorization (NMF) is a popular technique for learning parts-based representation and data clustering. It usually uses the squared residuals to quantify the quality of factorization, which is optimal specifically to zero-mean, Gaussian noise and sensitive to outliers in general cases. In this paper, we propose a robust NMF method based on the correntropy induced metric, which is much more insensitive to outliers. A half-quadratic optimization algorithm is developed to solve the proposed problem efficiently. The proposed method is further extended to handle outlier rows by incorporating structural knowledge about the outliers. Experimental results on data sets with and without apparent outliers demonstrate the effectiveness of the proposed algorithms.
【Keywords】: Gaussian noise; data handling; matrix decomposition; minimisation; Gaussian noise; correntropy induced metric; data clustering; half quadratic minimization; half quadratic optimization algorithm; parts based representation; robust nonnegative matrix factorization; structural knowledge; zero mean; Computer integrated manufacturing; Kernel; Linear programming; Matrix decomposition; Minimization; Optimization; Robustness; correntropy induced metric; half-quadratic optimization; robust non-negative matrix factorization
【Paper Link】 【Pages】:211-220
【Authors】: Dongsheng Duan ; Yuhua Li ; Ruixuan Li ; Rui Zhang ; Aiming Wen
【Abstract】: Topic modeling has become a widely used tool for document management due to its superior performance. However, there are few topic models distinguishing the importance of documents on different topics. In this paper, we investigate how to utilize the importance of documents to improve topic modeling and propose to incorporate link based ranking into topic modeling. Specifically, topical pagerank is used to compute the topic level ranking of documents, which indicates the importance of documents on different topics. By retreating the topical ranking of a document as the probability of the document involved in corresponding topic, a generalized relation is built between ranking and topic modeling. Based on the relation, a ranking based topic model Rank Topic is proposed. With Rank Topic, a mutual enhancement framework is established between ranking and topic modeling. Extensive experiments on paper citation data and Twitter data are conducted to compare the performance of Rank Topic with that of some state-of-the-art topic models. Experimental results show that Rank Topic performs much better than some baseline models and is comparable with the state-of-the-art link combined relational topic model (RTM) in generalization performance, document clustering and classification by setting a proper balancing parameter. It is also demonstrated in both quantitative and qualitative ways that topics detected by Rank Topic are more interpretable than those detected by some baseline models and still competitive with RTM.
【Keywords】: document handling; generalisation (artificial intelligence); pattern classification; pattern clustering; probability; RankTopic tool; Twitter data; balancing parameter; document classification; document clustering; document importance; document management; document ranking; generalization performance; generalized relation; mutual enhancement framework; paper citation data; probability; ranking based topic modeling; relational topic model; topical pagerank; Computational modeling; Data models; Educational institutions; Equations; Mathematical model; Noise; Web pages; Classification; Clustering; Document Network; Ranking; Topic Modeling
【Paper Link】 【Pages】:221-230
【Authors】: Fabon Dzogang ; Christophe Marsala ; Marie-Jeanne Lesot ; Maria Rifqi
【Abstract】: We propose an extension of the spherical K-means algorithm to deal with settings where the number of data points is largely inferior to the number of dimensions. We assume the data to lie in local and dense regions of the original space and we propose to embed each cluster into its specific ellipsoid. A new objective function is introduced, analytical solutions are derived for both the centroids and the associated ellipsoids. Furthermore, a study on the complexity of this algorithm highlights that it is of same order as the regular K-means algorithm. Results on both synthetic and real data show the efficiency of the proposed method.
【Keywords】: computational complexity; document handling; pattern clustering; algorithm complexity; document clustering; ellipsoid; ellipsoidal k-means; spherical k-means algorithm; Clustering algorithms; Ellipsoids; Feature extraction; Linear programming; Partitioning algorithms; Tuning; Vectors; clustering; feature selection; information retrieval; spherical k-means
【Paper Link】 【Pages】:231-240
【Authors】: Dóra Erdös ; Rainer Gemulla ; Evimaria Terzi
【Abstract】: Consider a social network and suppose that we are given the number of common friends between each pair of users. Can we reconstruct the underlying network? Similarly, consider a set of documents and the words that appear in them. If we know the number of common words for every pair of documents, as well as the number of common documents for every pair of words, can we infer which words appear in which documents? In this paper, we develop a general methodology for answering questions like the ones above. We formalize these questions in what we call the Reconstruct problem: Given information about the common neighbors of nodes in a network, our goal is to reconstruct the hidden binary matrix that indicates the presence or absence of relationships between individual nodes. We propose an effective and practical heuristic, which exploits properties of the singular value decomposition of the hidden binary matrix. More specifically, we show that using the available neighborhood information, we can reconstruct the hidden matrix by finding the components of its singular value decomposition and then combining them appropriately. Our extensive experimental study suggests that our methods are able to reconstruct binary matrices of different characteristics with up to 100% accuracy.
【Keywords】: data handling; graph theory; singular value decomposition; documents; graph reconstruction; hidden binary matrix reconstruction; individual nodes; neighborhood data; neighborhood information; reconstruct problem; singular value decomposition; social network; Bipartite graph; Matrix decomposition; Measurement; Motion pictures; Singular value decomposition; Symmetric matrices; Vectors
【Paper Link】 【Pages】:241-250
【Authors】: Mingyu Fan ; Hong Qiao ; Bo Zhang ; Xiaoqin Zhang
【Abstract】: Manifold learning is an important topic in pattern recognition and computer vision. However, most manifold learning algorithms implicitly assume the data are aligned on a single manifold, which is too strict in actual applications. Isometric feature mapping (Isomap), as a promising manifold learning method, fails to work on data which distribute on clusters in a single manifold or manifolds. In this paper, we propose a new multi-manifold learning algorithm (M-Isomap). The algorithm first discovers the data manifolds and then reduces the dimensionality of the manifolds separately. Meanwhile, a skeleton representing the global structure of whole data set is built and kept in low-dimensional space. Secondly, by referring to the low-dimensional representation of the skeleton, the embeddings of the manifolds are relocated to a global coordinate system. Compared with previous methods, these algorithms can keep both of the intra and inter manifolds geodesics faithfully. The features and effectiveness of the proposed multi-manifold learning algorithms are demonstrated and compared through experiments.
【Keywords】: data structures; differential geometry; feature extraction; M-Isomap; computer vision; dimensionality reduction; feature extraction; global coordinate system; global data structure; intermanifolds geodesics; intramanifolds geodesics; isomap; isometric feature mapping; isometric multimanifold learning; manifold learning method; multimanifold learning algorithm; pattern recognition; skeleton; Algorithm design and analysis; Approximation algorithms; Clustering algorithms; Heuristic algorithms; Manifolds; Skeleton; Vectors; Feature extraction; geodesic distance; multi-manifold learning
【Paper Link】 【Pages】:251-259
【Authors】: Yuqiang Fang ; Ruili Wang ; Bin Dai
【Abstract】: The key task in graph-oriented learning is constructing an informative graph to model the geometrical and discriminant structure of a data manifold. Since traditional graph construction methods are sensitive to noise and less datum-adaptive to changes in density, a new graph construction method so-called ℓ1-Graph has been proposed [1] recently. A graph construction method needs to have two important properties: sparsity and locality. However, the ℓ1-Graph is strong in sparsity property, but weak in locality. In order to overcome such limitation, we propose a new method of constructing an informative graph using automatic group sparse regularization based on the work of ℓ1-Graph, which is called as group sparse graph (GroupSp-Graph). The newly developed GroupSp-Graph has the same noise-insensitive property as ℓ1-Graph, and also can successively preserve the group and local information in the graph. In other words, the proposed group sparse graph has both properties of sparsity and locality simultaneously. Furthermore, we integrate the proposed graph with several graph-oriented learning algorithms: spectral embedding, spectral clustering, subspace learning and manifold regularized non-negative matrix factorization. The empirical studies on benchmark data sets show that the proposed algorithms achieve considerable improvement over classic graph constructing methods and the ℓ1-Graph method in various learning task.
【Keywords】: data analysis; geometry; graph theory; learning (artificial intelligence); matrix decomposition; pattern clustering; ℓ1-graph; GroupSp-graph; automatic group sparse regularization; automatic group sparsity; data analysis; data manifold; discriminant structure; geometrical structure; graph construction; graph-oriented learning; group sparse graph; informative graph; learning task; manifold regularized nonnegative matrix factorization; noise-insensitive property; sparsity property; spectral clustering; spectral embedding; subspace learning; Clustering algorithms; Educational institutions; Equations; Laplace equations; Manifolds; Noise; Sparse matrices; graph learning; non-negative matrix factoriza-tion; sparse representation; spectral embedding; subspace learning
【Paper Link】 【Pages】:260-269
【Authors】: Pedro Ferrera ; Ivan de Prado ; Eric Palacios ; Jose Luis Fernandez-Marquez ; Giovanna Di Marzo Serugendo
【Abstract】: This paper proposes Tuple Map Reduce, a new foundational model extending Map Reduce with the notion of tuples. Tuple Map Reduce allows to bridge the gap between the low-level constructs provided by Map Reduce and higher-level needs required by programmers, such as compound records, sorting or joins. This paper presents as well Pangool, an open-source framework implementing Tuple Map Reduce. Pangool eases the design and implementation of applications based on Map Reduce and increases their flexibility, still maintaining Hadoop's performance.
【Keywords】: distributed programming; public domain software; Hadoop; Pangool; compound records; foundational model; low-level constructs; open-source framework; tuple MapReduce; Companies; Compounds; Computational modeling; Data models; Mathematical model; Programming; Sorting; Big Data; Distributed systems; Hadoop; MapReduce; Scalability
【Paper Link】 【Pages】:270-279
【Authors】: Stephan Günnemann ; Phuong Dao ; Mohsen Jamali ; Martin Ester
【Abstract】: Assessing the significance of data mining results is an important step in the knowledge discovery process. While results might appear interesting at a first glance, they can often be explained by already known characteristics of the data. Randomization is an established technique for significance testing, and methods to assess data mining results on vector data or network data have been proposed. In many applications, however, both sources are simultaneously given. Since these sources are rarely independent of each other but highly correlated, naively applying existing randomization methods on each source separately is questionable. In this work, we present a method to assess the significance of mining results on graphs with binary features vectors. We propose a novel null model that preserves correlation information between both sources. Our randomization exploits an adaptive Metropolis sampling and interweaves attribute randomization and graph randomization steps. In thorough experiments, we demonstrate the application of our technique. Our results indicate that while simultaneously using both sources is beneficial, often one source of information is dominant for determining the mining results.
【Keywords】: data mining; graph theory; sampling methods; vectors; adaptive Metropolis sampling; attribute randomization; binary feature vector; correlation information; data mining result; graph randomization; information source; knowledge discovery process; network data; randomization technique; significance testing; vector data; Clustering algorithms; Correlation; Data mining; Data models; Markov processes; Testing; Vectors; data mining; graph; network; randomization; significance testing
【Paper Link】 【Pages】:280-289
【Authors】: Yu Hayashi ; Kenji Yamanishi
【Abstract】: We are concerned with the issue of tracking changes of variable dependencies from multivariate time series. Conventionally, this issue has been addressed in the batch scenario where the whole data set is given at once, and the change detection must be done in a retrospective way. This paper addresses this issue in a sequential scenario where multivariate data are sequentially input and the detection must be done in a sequential fashion. We propose a new method for sequential tracking of variable dependencies. In it we employ a Bayesian network as a representation of variable dependencies. The key ideas of our method are, 1) we extend the theory of dynamic model selection (DMS), which has been developed in the batch-learning scenario, into the sequential setting, and apply it to our issue, 2) we conduct the change detection sequentially using dynamic programming per a window where we employ the Hoeffding's bound to automatically determine the window size. We empirically demonstrate that our proposed method is able to perform change detection more efficiently than a conventional batch method. Further, we give a new framework of an application of variable dependency change detection, which we call Ad Impact Relation analysis (AIR). In it, we detect the time point when a commercial message advertisement has given an impact on the market and effectively visulaize the impact through network changes. We employ real data sets to demonstrate the validity of AIR.
【Keywords】: advertising data processing; belief networks; data analysis; dynamic programming; learning (artificial intelligence); time series; AIR; Bayesian network; DMS; Hoeffding's bound; ad impact relation analysis; batch-learning scenario; commercial message advertisement; conventional batch method; dynamic model selection; dynamic programming; multivariate data; multivariate time series; sequential fashion; sequential network change detection; sequential scenario; sequential setting; sequential tracking; variable dependency change detection; Algorithm design and analysis; Bayesian methods; Computational modeling; Data models; Encoding; Graphical models; Time series analysis; Advertisement; Bayesian Network; Dynamic Model Selection; Marketing; Netword Change Detection
【Paper Link】 【Pages】:290-298
【Authors】: Jingrui He ; Yada Zhu
【Abstract】: Many real problems of multi-task learning exhibit hierarchical task relatedness. In other words, the tasks are partitioned into multiple groups. Different tasks within the same group are related on the task-level, whereas different groups are related on the group-level. For example, in semiconductor manufacturing, the problem of wafer quality prediction can be considered as hierarchical multi-task learning, where each task corresponds to a single side of a chamber with side-level relatedness, and a group of tasks corresponds to a chamber of multiple sides with chamber-level relatedness. Motivated by this application, in this paper, we propose an optimization framework for hierarchical multi-task learning, which partitions all the input features into 2 sets based on their characteristics, and models task-level and group-level relatedness by imposing different constraints on the coefficient vectors of the 2 sets. This is different from existing work on task clustering where the goal is to uncover the grouping of tasks, the tasks do not exhibit group-level relatedness, and the input features are not discriminated in the prediction model to model task-level and group-level relatedness. To solve this framework, we propose the hear algorithm based on block coordinate descent, and demonstrate its effectiveness on both synthetic and real data sets from domains of semiconductor manufacturing and document classification.
【Keywords】: integrated circuit manufacture; learning (artificial intelligence); optimisation; pattern clustering; prediction theory; production engineering computing; semiconductor technology; block coordinate descent-based HEAR algorithm; chamber-level relatedness; coefficient vectors; document classification; group-level relatedness; hierarchical multitask learning; hierarchical task relatedness; optimization framework; prediction model; real data sets; semiconductor manufacturing; side-level relatedness; synthetic data sets; task clustering; task-level relatedness; wafer quality prediction; Linear programming; Manufacturing; Optimization; Prediction algorithms; Predictive models; Semiconductor device modeling; Vectors; hierarchical multi-task learning; task relatedness; wafer quality
【Paper Link】 【Pages】:299-308
【Authors】: Michael E. Houle ; Xiguo Ma ; Michael Nett ; Vincent Oria
【Abstract】: In data mining applications such as subspace clustering or feature selection, changes to the underlying feature set can require the reconstruction of search indices to support fundamental data mining tasks. For such situations, multi-step search approaches have been proposed that can accommodate changes in the underlying similarity measure without the need to rebuild the index. In this paper, we present a heuristic multi-step search algorithm that utilizes a measure of intrinsic dimension, the generalized expansion dimension (GED), as the basis of its search termination condition. Compared to the current state-of-the-art method, experimental results show that our heuristic approach is able to obtain significant improvements in both the number of candidates and the running time, while losing very little in the accuracy of the query results.
【Keywords】: data mining; pattern clustering; query processing; GED; data mining; dimensional testing; feature selection; feature set; generalized expansion dimension; heuristic multistep search algorithm; intrinsic dimension; multistep similarity search; query result; search indices; search termination condition; similarity measure; subspace clustering; Algorithm design and analysis; Approximation algorithms; Data mining; Heuristic algorithms; Indexes; Measurement; Vectors; Similarity search; adaptive similarity; intrinsic dimensionality; kNN; multi-step; nearest neighbor
【Paper Link】 【Pages】:309-318
【Authors】: Zhou Jin ; Jiuyong Li ; Lin Liu ; Thuc Duy Le ; Bing-Yu Sun ; Rujing Wang
【Abstract】: Discovering causal relationships in large databases of observational data is challenging. The pioneering work in this area was rooted in the theory of Bayesian network (BN) learning, which however, is a NP-complete problem. Hence several constraint-based algorithms have been developed to efficiently discover causations in large databases. These methods usually use the idea of BN learning, directly or indirectly, and are focused on causal relationships with single cause variables. In this paper, we propose an approach to mine causal rules in large databases of binary variables. Our method expands the scope of causality discovery to causal relationships with multiple cause variables, and we utilise partial association tests to exclude noncausal associations, to ensure the high reliability of discovered causal rules. Furthermore an efficient algorithm is designed for the tests in large databases. We assess the method with a set of real-world diagnostic data. The results show that our method can effectively discover interesting causal rules in large databases.
【Keywords】: Bayes methods; computational complexity; data mining; very large databases; Bayesian network learning; NP-complete problem; binary variable; causal rule discovery; causal rule mining; constraint-based algorithm; large database; observational data; partial association test; Association rules; Bayesian methods; Databases; Diseases; Equations; Reliability; Testing; causal rule; causality; data mining; partial association
【Paper Link】 【Pages】:319-328
【Authors】: Ramakrishnan Kannan ; Mariya Ishteva ; Haesun Park
【Abstract】: Matrix lower rank approximations such as non-negative matrix factorization (NMF) have been successfully used to solve many data mining tasks. In this paper, we propose a new matrix lower rank approximation called Bounded Matrix Low Rank Approximation (BMA) which imposes a lower and an upper bound on every element of a lower rank matrix that best approximates a given matrix with missing elements. This new approximation models many real world problems, such as recommender systems, and performs better than other methods, such as singular value decompositions (SVD) or NMF. We present an efficient algorithm to solve BMA based on coordinate descent method. BMA is different from NMF as it imposes bounds on the approximation itself rather than on each of the low rank factors. We show that our algorithm is scalable for large matrices with missing elements on multi core systems with low memory. We present substantial experimental results illustrating that the proposed method outperforms the state of the art algorithms for recommender systems such as Stochastic Gradient Descent, Alternating Least Squares with regularization, SVD++, Bias-SVD on real world data sets such as Jester, Movie lens, Book crossing, Online dating and Netflix.
【Keywords】: data mining; gradient methods; matrix algebra; bounded matrix low rank approximation model; coordinate descent method; data mining task; multicore systems; nonnegative matrix factorization; recommender systems; regularization; singular value decomposition; stochastic gradient descent; Approximation algorithms; Least squares approximation; Matrix decomposition; Recommender systems; Upper bound; Vectors; Low rank approximation; block; block coordinate de- scent method; bound constraints; matrix factorization; recommender systems; scalable algorithm
【Paper Link】 【Pages】:329-338
【Authors】: Namit Katariya ; Arun Iyer ; Sunita Sarawagi
【Abstract】: The goal of this work is to estimate the accuracy of a classifier on a large unlabeled dataset based on a small labeled set and a human labeler. We seek to estimate accuracy and select instances for labeling in a loop via a continuously refined stratified sampling strategy. For stratifying data we develop a novel strategy of learning r bit hash functions to preserve similarity in accuracy values. We show that our algorithm provides better accuracy estimates than existing methods for learning distance preserving hash functions. Experiments on a wide spectrum of real datasets show that our estimates achieve between 15% and 62% relative reduction in error compared to existing approaches. We show how to perform stratified sampling on unlabeled data that is so large that in an interactive setting even a single sequential scan is impractical. We present an optimal algorithm for performing importance sampling on a static index over the data that achieves close to exact estimates while reading three orders of magnitude less data.
【Keywords】: cryptography; importance sampling; learning (artificial intelligence); pattern classification; sampling methods; active classifier evaluation; continuously refined stratified sampling strategy; distance preserving hash function; importance sampling; labeling accuracy; labeling instance; learning strategy; unlabeled dataset classifier; Accuracy; Estimation; Humans; Labeling; Learning systems; Reliability; Vectors; Accuracy estimation; active evaluation; learning hash functions
【Paper Link】 【Pages】:339-348
【Authors】: Gilad Katz ; Asaf Shabtai ; Lior Rokach ; Nir Ofek
【Abstract】: Decision trees have three main disadvantages: reduced performance when the training set is small, rigid decision criteria and the fact that a single "uncharacteristic" attribute might "derail" the classification process. In this paper we present ConfDTree - a post-processing method which enables decision trees to better classify outlier instances. This method, which can be applied on any decision trees algorithm, uses confidence intervals in order to identify these hard-to-classify instances and proposes alternative routes. The experimental study indicates that the proposed post-processing method consistently and significantly improves the predictive performance of decision trees, particularly for small, imbalanced or multi-class datasets in which an average improvement of 5%-9% in the AUC performance is reported.
【Keywords】: decision trees; pattern classification; set theory; AUC performance; ConfDTree; classification process; confidence intervals; decision criteria; decision trees algorithm; hard-to-classify instances; multiclass datasets; post-processing method; training set; Classification algorithms; Decision trees; Gaussian distribution; Prediction algorithms; Standards; Training; Vegetation; confidence intervals; decision trees; imbalanced datasets
【Paper Link】 【Pages】:349-358
【Authors】: Hyungsul Kim ; Yizhou Sun ; Julia Hockenmaier ; Jiawei Han
【Abstract】: Topic models, which factor each document into different topics and represent each topic as a distribution of terms, have been widely and successfully used to better understand collections of text documents. However, documents are also associated with further information, such as the set of real-world entities mentioned in them. For example, news articles are usually related to several people, organizations, countries or locations. Since those associated entities carry rich information, it is highly desirable to build more expressive, entity-based topic models, which can capture the term distributions for each topic, each entity, as well as each topic-entity pair. In this paper, we therefore introduce a novel Entity Topic Model (ETM) for documents that are associated with a set of entities. ETM not only models the generative process of a term given its topic and entity information, but also models the correlation of entity term distributions and topic term distributions. A Gibbs sampling-based algorithm is proposed to learn the model. Experiments on real datasets demonstrate the effectiveness of our approach over several state-of-the-art baselines.
【Keywords】: Markov processes; Monte Carlo methods; data mining; learning (artificial intelligence); text analysis; ETM; Gibbs sampling-based algorithm; document mining; entity information; entity term distribution correlation; entity topic models; model learning; text documents; topic information; topic term distributions; topic-entity pair; Analytical models; Computational modeling; Correlation; Data mining; Data models; Mathematical model; Vectors; data mining; entity; topic models
【Paper Link】 【Pages】:359-368
【Authors】: Flip Korn ; Ruilin Liu ; Wendy Hui Wang
【Abstract】: In many networks including Internet Service Providers, transportation monitoring systems and the electric grid, measurements from a set of objects are continuously taken over time and used for important decisions such as provisioning and pricing. It is therefore vital to understand the completeness and reliability of such data. Given the large volume of data generated by these systems, rather than enumerating the times and objects incurring missing or spurious data, it is more effective to provide patterns (groups of tuples) concisely summarizing trends that may not otherwise be readily observable. In this paper, we define the Graph Tableau Discovery Problem where the measured tuples can be thought of as edges in a bipartite graph on an ordered attribute (time) and an unordered attribute (object identifiers). We define the problem of finding an optimal summary, show that it is NP-complete, and then provide a polynomial-time approximation algorithm with guarantees to find a good summary. Experiments on real and synthetic data demonstrate the effectiveness and efficiency of our approach.
【Keywords】: computational complexity; data handling; graph theory; polynomial approximation; Internet service providers; NP-complete; data completeness; electric grid; graph tableau discovery problem; network monitoring systems; polynomial-time approximation algorithm; pricing; transportation monitoring systems; Approximation algorithms; Approximation methods; Greedy algorithms; Loss measurement; Monitoring; Silicon; Time measurement
【Paper Link】 【Pages】:369-378
【Authors】: Hardy Kremer ; Stephan Günnemann ; Arne Held ; Thomas Seidl
【Abstract】: Mining temporal multivariate data by clustering is an important research topic. In today's complex data, interesting patterns are often neither bound to the whole dimensional nor temporal extent of the data domain. This challenge is met by temporal subspace clustering methods. Their effectiveness, however, is impeded by aspects unavoidable in real world data: Misalignments between time series, for example caused by out-of-sync sensors, and measurement errors. Under these conditions, existing temporal subspace clustering approaches miss the patterns contained in the data. In this paper, we propose a novel clustering method that mines temporal subspace clusters reflected by sets of objects and relevant intervals. We enable flexible handling of misaligned time series by adaptively shifting time series in the time domain, and we achieve robustness to measurement errors by allowing certain fractions of deviating values in each relevant point in time. We show the effectiveness of our method in experiments on real and synthetic data.
【Keywords】: data mining; pattern clustering; time series; data domain; measurement errors; out-of-sync sensors; temporal multivariate data mining; temporal subspace clustering methods; time series; Clustering algorithms; Clustering methods; Data mining; Robustness; Time measurement; Time series analysis; Vectors
【Paper Link】 【Pages】:379-388
【Authors】: Hans-Peter Kriegel ; Peer Kröger ; Erich Schubert ; Arthur Zimek
【Abstract】: In this paper, we propose a novel outlier detection model to find outliers that deviate from the generating mechanisms of normal instances by considering combinations of different subsets of attributes, as they occur when there are local correlations in the data set. Our model enables to search for outliers in arbitrarily oriented subspaces of the original feature space. We show how in addition to an outlier score, our model also derives an explanation of the outlierness that is useful in investigating the results. Our experiments suggest that our novel method can find different outliers than existing work and can be seen as a complement of those approaches.
【Keywords】: data mining; statistical analysis; unsupervised learning; arbitrarily oriented subspace; data mining; normal instance mechanism; outlier detection model; outlier score; Correlation; Covariance matrix; Data models; Eigenvalues and eigenfunctions; Feature extraction; Principal component analysis; Vectors; anomaly detection; correlation; data mining; outlier detection
【Paper Link】 【Pages】:389-398
【Authors】: Himabindu Lakkaraju ; Indrajit Bhattacharya ; Chiranjib Bhattacharyya
【Abstract】: We study the problem of analyzing influence of various factors affecting individual messages posted in social media. The problem is challenging because of various types of influences propagating through the social media network that act simultaneously on any user. Additionally, the topic composition of the influencing factors and the susceptibility of users to these influences evolve over time. This problem has not been studied before, and off-the-shelf models are unsuitable for this purpose. To capture the complex interplay of these various factors, we propose a new non-parametric model called the Dynamic Multi-Relational Chinese Restaurant Process. This accounts for the user network for data generation and also allows the parameters to evolve over time. Designing inference algorithms for this model suited for large scale social-media data is another challenge. To this end, we propose a scalable and multi-threaded inference algorithm based on online Gibbs Sampling. Extensive evaluations on large-scale Twitter and Face book data show that the extracted topics when applied to authorship and commenting prediction outperform state-of-the-art baselines. More importantly, our model produces valuable insights on topic trends and user personality trends beyond the capability of existing approaches.
【Keywords】: data mining; inference mechanisms; sampling methods; social networking (online); stochastic processes; Facebook data; data generation; dynamic multirelational Chinese restaurant process; large scale social-media data; large-scale Twitter; multithreaded inference algorithm; nonparametric model; off-the-shelf model; online Gibbs Sampling; scalable inference algorithm; social media; topic composition; topic trend; user personality trend; Algorithm design and analysis; Analytical models; Data models; Equations; Inference algorithms; Mathematical model; Media; Non-parametric Modeling; Parallel Inference; Social Media Analysis
【Paper Link】 【Pages】:399-408
【Authors】: Bin Li ; Xingquan Zhu ; Lianhua Chi ; Chengqi Zhang
【Abstract】: Most studies on graph classification focus on designing fast and effective kernels. Several fast subtree kernels have achieved a linear time-complexity w.r.t. the number of edges under the condition that a common feature space (e.g., a subtree pattern list) is needed to represent all graphs. This will be infeasible when graphs are presented in a stream with rapidly emerging subtree patterns. In this case, computing a kernel matrix for graphs over the entire stream is difficult since the graphs in the expired chunks cannot be projected onto the unlimitedly expanding feature space again. This leads to a big trouble for graph classification over streams - Different portions of graphs have different feature spaces. In this paper, we aim to enable large-scale graph classification over streams using the classical ensemble learning framework, which requires the data in different chunks to be in the same feature space. To this end, we propose a Nested Subtree Hashing (NSH) algorithm to recursively project the multi-resolution subtree patterns of different chunks onto a set of common low-dimensional feature spaces. We theoretically analyze the derived NSH kernel and obtain a number of favorable properties: 1) The NSH kernel is an unbiased and highly concentrated estimator of the fast subtree kernel. 2) The bound of convergence rate tends to be tighter as the NSH algorithm steps into a higher resolution. 3) The NSH kernel is robust in tolerating concept drift between chunks over a stream. We also empirically test the NSH kernel on both a large-scale synthetic graph data set and a real-world chemical compounds data set for anticancer activity prediction. The experimental results validate that the NSH kernel is indeed efficient and robust for graph classification over streams.
【Keywords】: computational complexity; file organisation; learning (artificial intelligence); pattern classification; trees (mathematics); ensemble learning; large-scale graph classification; linear time-complexity; multi-resolution subtree patterns; nested subtree hash kernels; Chemical compounds; Convergence; Data mining; Feature extraction; Kernel; Robustness; Vectors; Graph classification; data stream mining; graph hash kernels; nested subtree hashing
【Paper Link】 【Pages】:409-418
【Authors】: Jun Li ; Peng Zhang ; Yanan Cao ; Ping Liu ; Li Guo
【Abstract】: Behavior targeting (BT) is a promising tool for online advertising. The state-of-the-art BT methods, which are mainly based on regression models, have two limitations. First, learning regression models for behavior targeting is difficult since user clicks are typically several orders of magnitude fewer than views. Second, the user interests are not fixed, but often transient and influenced by media and pop culture. In this paper, we propose to formulate behavior targeting as a classification problem. Specifically, we propose to use an SVM ensemble for behavior prediction. The challenge of using ensemble SVM for BT stems from the computational complexity (it takes 53 minutes in our experiments to predict behavior for 32 million users, which is inadequate for online application). To this end, we propose a fast ensemble SVM prediction framework, which builds an indexing structure for SVM ensemble to achieve sub-linear prediction time complexity. Experimental results on real-world large scale behavior targeting data demonstrate that the proposed method is efficient and outperforms existing linear regression based BT models.
【Keywords】: advertising data processing; behavioural sciences; computational complexity; indexing; learning (artificial intelligence); pattern classification; regression analysis; support vector machines; BT method; SVM ensemble indexing; behavior prediction; classification problem; computational complexity; indexing structure; learning regression model; online advertising; real-world large scale behavior targeting data; sublinear prediction time complexity; Conferences; Data mining; Behavior Targeting; SVM index; ensemble SVM
【Paper Link】 【Pages】:419-428
【Authors】: Wen Li ; Lixin Duan ; Ivor Wai-Hung Tsang ; Dong Xu
【Abstract】: We propose a multi-view learning approach called co-labeling which is applicable for several machine learning problems where the labels of training samples are uncertain, including semi-supervised learning (SSL), multi-instance learning (MIL) and max-margin clustering (MMC). Particularly, we first unify those problems into a general ambiguous problem in which we simultaneously learn a robust classifier as well as find the optimal training labels from a finite label candidate set. To effectively utilize multiple views of data, we then develop our co-labeling approach for the general multi-view ambiguous problem. In our work, classifiers trained on different views can teach each other by iteratively passing the predictions of training samples from one classifier to the others. The predictions from one classifier are considered as label candidates for the other classifiers. To train a classifier with a label candidate set for each view, we adopt the Multiple Kernel Learning (MKL) technique by constructing the base kernel through associating the input kernel calculated from input features with one label candidate. Compared with the traditional co-training method which was specifically designed for SSL, the advantages of our co-labeling are two-fold: 1) it can be applied to other ambiguous problems such as MIL and MMC, 2) it is more robust by using the MKL method to integrate multiple labeling candidates obtained from different iterations and biases. Promising results on several real-world multi-view data sets clearly demonstrate the effectiveness of our proposed co-labeling for both MIL and SSL.
【Keywords】: learning (artificial intelligence); pattern classification; pattern clustering; set theory; MIL; MKL technique; MMC; SSL; ambiguous problems; colabeling approach; finite label candidate set; input kernel; machine learning problems; max-margin clustering; multiinstance learning; multiple kernel learning technique; multiple labeling candidates; multiview ambiguous problem; multiview learning approach; optimal training labels; real-world multiview data sets; robust classifier; semisupervised learning; training samples; Kernel; Labeling; Prediction algorithms; Robustness; Semisupervised learning; Supervised learning; Training; TBIR; ambiguous learning; multi-instance learning; multiple kernel learning; semi-supervised learning
【Paper Link】 【Pages】:429-438
【Authors】: Frank Lin ; William W. Cohen
【Abstract】: Spectral clustering methods are elegant and effective graph-based node clustering methods, but they do not allow mixed membership clustering. We describe an approach that first transforms the data from a node-centric representation to an edge-centric one, and then use this representation to define a scalable and competitive mixed membership alternative to spectral clustering methods. Experimental results show the proposed approach improves substantially in mixed membership clustering tasks over node clustering methods.
【Keywords】: graph theory; pattern clustering; competitive mixed membership clustering; edge-centric representation; general approach; graph-based node clustering methods; node-centric representation; scalable approach; scalable mixed membership; spectral clustering methods; Bipartite graph; Clustering algorithms; Clustering methods; Communities; Social network services; Sparse matrices; Vectors; clustering; scalable methods; unsupervised learning; large scale learning; mixed membership clustering
【Paper Link】 【Pages】:439-448
【Authors】: Bo Liu ; Gao Cong ; Dong Xu ; Yifeng Zeng
【Abstract】: Influence maximization is a fundamental research problem in social networks. Viral marketing, one of its applications, is to get a small number of users to adopt a product, which subsequently triggers a large cascade of further adoptions by utilizing "Word-of-Mouth" effect in social networks. Influence maximization problem has been extensively studied recently. However, none of the previous work considers the time constraint in the influence maximization problem. In this paper, we propose the time constrained influence maximization problem. We show that the problem is NP-hard, and prove the monotonicity and submodularity of the time constrained influence spread function. Based on this, we develop a greedy algorithm with performance guarantees. To improve the algorithm scalability, we propose two Influence Spreading Path based methods. Extensive experiments conducted over four public available datasets demonstrate the efficiency and effectiveness of the Influence Spreading Path based methods.
【Keywords】: greedy algorithms; optimisation; social networking (online); NP hard; greedy algorithm; influence spreading path based method; monotonicity; social networks; submodularity; time constrained influence maximization problem; time constrained influence spread function; viral marketing; Approximation algorithms; Delay; Greedy algorithms; Integrated circuit modeling; Social network services; Time factors
【Paper Link】 【Pages】:449-458
【Authors】: Chuanren Liu ; Hui Xiong ; Yong Ge ; Wei Geng ; Matt Perkins
【Abstract】: Rapid growth in the development of real-time location system solutions has led to an increased interest in indoor location-aware services, such as hospital asset management. Although there are extensive studies in the literature on the analysis of outdoor location traces, the studies of indoor location traces are less touched and fragmented. To that end, in this paper, we provide a focused study of indoor location traces collected by the sensors attached to medical devices in a hospital environment. Along this line, we first introduce some unique properties of these indoor location traces. We show that they can capture the movement patterns of the medical devices, which are tightly coupled with the work flow in the controlled hospital environment. Based on this observation, we propose a stochastic model for context-aware anomaly detection in indoor location traces, which exploits the hospital work flow and models the movements of medical devices as transitions in finite state machines. In detail, we first develop a density-based method to identify the hotspots filled with high-level abnormal activities in the indoor environment. The discovered hotspots serve as the context for nearby trajectories. Then, we introduce an N-gram based method for measuring the degree of anomaly based on the detected hotspots, which is able to predict the missing events possibly due to the devices being stolen. Besides, to address the noisy nature of the indoor sensor networks, we also propose an iterative algorithm to estimate the transition probabilities. This algorithm allows to effectively recover the missing location records which are critical for the abnormality estimation. Finally, the experimental results on the real-world date sets validate the effectiveness of the proposed context-aware anomaly detection method for identifying abnormal events.
【Keywords】: data mining; finite state machines; hospitals; indoor radio; iterative methods; mobile computing; real-time systems; N-gram based method; context-aware anomaly detection; density-based method; finite state machine; hospital asset management; hospital environment; hotspots; indoor location traces; indoor sensor networks; iterative algorithm; medical device; real-time location system; stochastic model; transition probability; Buildings; Clustering algorithms; Context modeling; Hospitals; Sensors; Stochastic processes; Trajectory; Anomaly Detection; Indoor Trajectory; Stochastic Modeling
【Paper Link】 【Pages】:459-468
【Authors】: Lin Liu ; Ruoming Jin ; Charu C. Aggarwal ; Yelong Shen
【Abstract】: Many graphs in practical applications are not deterministic, but are probabilistic in nature because the existence of the edges is inferred with the use of a variety of statistical approaches. In this paper, we will examine the problem of clustering uncertain graphs. Uncertain graphs are best clustered with the use of a possible worlds model in which the most reliable clusters are discovered in the presence of uncertainty. Reliable clusters are those which are not likely to be disconnected in the context of different instantiations of the uncertain graph. We present experimental results which illustrate the effectiveness of our model and approach.
【Keywords】: graph theory; pattern clustering; statistical analysis; uncertain systems; reliable clustering; reliable clusters; statistical approaches; uncertain graphs; worlds model; Channel coding; Clustering algorithms; Equations; Linear programming; Reliability; clustering; reliability; uncertain graph
【Paper Link】 【Pages】:469-478
【Authors】: Xutong Liu ; Feng Chen ; Chang-Tien Lu
【Abstract】: Spatial kriging is a widely used predictive model for spatial datasets. In spatial kriging model, the observations are assumed to be Gaussian for computational convenience. However, its predictive accuracy could be significantly compromised if the observations are contaminated by outliers. This deficiency can be systematically addressed by increasing the robustness of spatial kriging model using heavy tailed distributions, such as the Huber, Laplace, and Student's t distributions. This paper presents a novel Robust and Reduced Rank Spatial Kriging Model (R3-SKM), which is resilient to the influences of outliers and allows for fast spatial inference. Furthermore, three effective and efficient algorithms are proposed based on R3-SKM framework that can perform robust parameter estimation, spatial prediction, and spatial outlier detection with a linear-order time complexity. Extensive experiments on both simulated and real data sets demonstrated the robustness and efficiency of our proposed techniques.
【Keywords】: Gaussian processes; computational complexity; inference mechanisms; parameter estimation; prediction theory; statistical distributions; Gaussian; Huber distribution; Laplace distribution; R3-SKM; linear-order time complexity; parameter estimation; predictive accuracy; predictive model; reduced rank spatial kriging model; robust prediction; spatial dataset; spatial inference; spatial outlier detection; spatial prediction; student t distribution; Approximation algorithms; Approximation methods; Gaussian approximation; Robustness; Spatial databases; Vectors; Laplace Approximation; Outlier Detection; Robust Estimation
【Paper Link】 【Pages】:479-488
【Authors】: Wei Lu ; Laks V. S. Lakshmanan
【Abstract】: Influence maximization is the problem of finding a set of influential users in a social network such that the expected spread of influence under a certain propagation model is maximized. Much of the previous work has neglected the important distinction between social influence and actual product adoption. However, as recognized in the management science literature, an individual who gets influenced by social acquaintances may not necessarily adopt a product (or technology), due, e.g., to monetary concerns. In this work, we distinguish between influence and adoption by explicitly modeling the states of being influenced and of adopting a product. We extend the classical Linear Threshold (LT) model to incorporate prices and valuations, and factor them into users' decision-making process of adopting a product. We show that the expected profit function under our proposed model maintains submodularity under certain conditions, but no longer exhibits monotonicity, unlike the expected influence spread function. To maximize the expected profit under our extended LT model, we employ an unbudgeted greedy framework to propose three profit maximization algorithms. The results of our detailed experimental study on three real-world datasets demonstrate that of the three algorithms, PAGE, which assigns prices dynamically based on the profit potential of each candidate seed, has the best performance both in the expected profit achieved and in running time.
【Keywords】: greedy algorithms; marketing; optimisation; pricing; profitability; social networking (online); PAGE algorithm; expected influence spread function; expected profit function; linear threshold model; monetary concern; network propagation model; price; product adoption; profit maximization; social acquaintance; social influence; social network; unbudgeted greedy framework; valuation; viral marketing; Computational modeling; Cost accounting; Integrated circuit modeling; Pricing; Social network services; Vectors; Profit maximization; influence maximization; social networks; viral marketing
【Paper Link】 【Pages】:489-498
【Authors】: Dijun Luo ; Chris H. Q. Ding ; Heng Huang
【Abstract】: We propose a nontrivial strategy to parallelize a series of data mining and machine learning problems, including 1-class and 2-class support vector machines, nonnegative least square problems, and $ell_1$ regularized regression (LASSO) problems. Our strategy fortunately leads to extremely simple multiplicative algorithms which can be straightforwardly implemented in parallel computational environments, such as Map Reduce, or CUDA. We provide rigorous analysis of the correctness and convergence of the algorithm. We demonstrate the scalability and accuracy of our algorithms in comparison with other current leading algorithms.
【Keywords】: data mining; learning (artificial intelligence); regression analysis; support vector machines; 1-class support vector machine; 2-class support vector machine; CUDA; LASSO problem; Map Reduce; data mining; machine learning problem; multiplicative algorithm; nonnegative least square problem; nontrivial strategy; parallel computational environment; regularized regression; Algorithm design and analysis; Convergence; Data mining; Graphics processing units; Machine learning algorithms; Optimization; Support vector machines; Big Data; CUDA; LASSO; MapReduce; Support Vector Machine
【Paper Link】 【Pages】:499-508
【Authors】: Michael Mampaey ; Siegfried Nijssen ; Ad Feelders ; Arno J. Knobbe
【Abstract】: Subgroup discovery systems are concerned with finding interesting patterns in labeled data. How these systems deal with numeric and nominal data has a large impact on the quality of their results. In this paper, we consider two ways to extend the standard pattern language of subgroup discovery: using conditions that test for interval membership for numeric attributes, and value set membership for nominal attributes. We assume a greedy search setting, that is, iteratively refining a given subgroup, with respect to a (convex) quality measure. For numeric attributes, we propose an algorithm that finds the optimal interval in linear (rather than quadratic) time, with respect to the number of examples and split points. Similarly, for nominal attributes, we show that finding the optimal set of values can be achieved in linear (rather than exponential) time, with respect to the number of examples and the size of the domain of the attribute. These algorithms operate by only considering subgroup refinements that lie on a convex hull in ROC space, thus significantly narrowing down the search space. We further provide efficient algorithms specifically for the popular Weighted Relative Accuracy quality measure, taking advantage of some of its properties. Our algorithms are shown to perform well in practice, and furthermore provide additional expressive power leading to higher-quality results.
【Keywords】: data mining; ROC space; convex hull; greedy search setting; interesting patterns; interval membership; labeled data; linear time; nominal attributes; nominal data; numeric attributes; numeric data; standard pattern language; subgroup descriptions; subgroup discovery systems; value set membership; weighted relative accuracy quality measure; Accuracy; Complexity theory; Context; Data mining; Decision trees; Gain measurement; Weight measurement; ROC analysis; convex functions; nominal data; numeric data; subgroup discovery
【Paper Link】 【Pages】:509-518
【Authors】: Hossein Maserrat ; Jian Pei
【Abstract】: Compression plays an important role in social network analysis from both practical and theoretical points of view. Although there are a few pioneering studies on social network compression, they mainly focus on lossless approaches. In this paper, we tackle the novel problem of community preserving lossy compression of social networks. The trade-off between space and information preserved in a lossy compression presents an interesting angle for social network analysis, and, at the same time, makes the problem very challenging. We propose a sequence graph compression approach, discuss the design of objective functions towards community preservation, and present an interesting and practically effective greedy algorithm. Our experimental results on both real data sets and synthetic data sets demonstrate the promise of our method.
【Keywords】: data compression; graph theory; greedy algorithms; social networking (online); community preserving lossy compression; greedy algorithm; lossless approaches; objective functions; sequence graph compression approach; social network analysis; synthetic data sets; Algorithm design and analysis; Communities; Educational institutions; Equations; Linear programming; Noise; Social network services; communities; compression; social networks
【Paper Link】 【Pages】:519-528
【Authors】: Pauli Miettinen
【Abstract】: Boolean matrix factorization is a method to decompose a binary matrix into two binary factor matrices. Akin to other matrix factorizations, the factor matrices can be used for various data analysis tasks. Many (if not most) real-world data sets are dynamic, though, meaning that new information is recorded over time. Incorporating this new information into the factorization can require a re-computation of the factorization -- something we cannot do if we want to keep our factorization up-to-date after each update. This paper proposes a method to dynamically update the Boolean matrix factorization when new data is added to the data base. This method is extended with a mechanism to improve the factorization with a trade-off in speed of computation. The method is tested with a number of real-world and synthetic data sets including studying its efficiency against off-line methods. The results show that with good initialization the proposed online and dynamic methods can beat the state-of-the-art offline Boolean matrix factorization algorithms.
【Keywords】: Boolean algebra; data analysis; database management systems; matrix decomposition; binary factor matrices; binary matrix; data analysis tasks; dynamic Boolean matrix factorizations; dynamic methods; off-line methods; online methods; real-world data sets; recomputation; state-of-the-art offline Boolean matrix factorization algorithms; synthetic data sets; Algorithm design and analysis; Approximation methods; Data mining; Encoding; Heuristic algorithms; Matrices; Vectors; Boolean matrix factorization; Dynamic algorithms; On-line algorithms
【Paper Link】 【Pages】:529-538
【Authors】: Emmanuel Müller ; Ira Assent ; Patricia Iglesias Sánchez ; Yvonne Mülle ; Klemens Böhm
【Abstract】: Outlier mining is an important task for finding anomalous objects. In practice, however, there is not always a clear distinction between outliers and regular objects as objects have different roles w.r.t. different attribute sets. An object may deviate in one subspace, i.e. a subset of attributes. And the same object might appear perfectly regular in other subspaces. One can think of subspaces as multiple views on one database. Traditional methods consider only one view (the full attribute space). Thus, they miss complex outliers that are hidden in multiple subspaces. In this work, we propose Outrank, a novel outlier ranking concept. Outrank exploits subspace analysis to determine the degree of outlierness. It considers different subsets of the attributes as individual outlier properties. It compares clustered regions in arbitrary subspaces and derives an outlierness score for each object. Its principled integration of multiple views into an outlierness measure uncovers outliers that are not detectable in the full attribute space. Our experimental evaluation demonstrates that Outrank successfully determines a high quality outlier ranking, and outperforms state-of-the-art outlierness measures.
【Keywords】: data analysis; data mining; OutRank; data mining; database; outlier ranking concept; outlierness degree determination; outlierness score; subspace analysis; Abstracts; Clustering algorithms; Conferences; Data mining; Databases; Educational institutions; Thyristors; clusterings; multiple subspaces; outlier ranking
【Paper Link】 【Pages】:539-548
【Authors】: Seth A. Myers ; Jure Leskovec
【Abstract】: In networks, contagions such as information, purchasing behaviors, and diseases, spread and diffuse from node to node over the edges of the network. Moreover, in real-world scenarios multiple contagions spread through the network simultaneously. These contagions not only propagate at the same time but they also interact and compete with each other as they spread over the network. While traditional empirical studies and models of diffusion consider individual contagions as independent and thus spreading in isolation, we study how different contagions interact with each other as they spread through the network. We develop a statistical model that allows for competition as well as cooperation of different contagions in information diffusion. Competing contagions decrease each other's probability of spreading, while cooperating contagions help each other in being adopted throughout the network. We evaluate our model on 18,000 contagions simultaneously spreading through the Twitter network. Our model learns how different contagions interact with each other and then uses these interactions to more accurately predict the diffusion of a contagion through the network. Moreover, the model also provides a compelling hypothesis for the principles that govern content interaction in information diffusion. Most importantly, we find very strong effects of interactions between contagions. Interactions cause a relative change in the spreading probability of a contagion by 71% on the average.
【Keywords】: gradient methods; information management; pattern clustering; probability; social networking (online); Twitter network; contagion competition; contagion cooperation; information diffusion; network contagion; spreading probability; statistical model; Context; Data models; Mathematical model; Media; Predictive models; Twitter; Twitter; competing contagions; information diffusion; social media
【Paper Link】 【Pages】:549-558
【Authors】: Ankur Narang ; Abhinav Srivastava ; Naga Praveen Kumar Katta
【Abstract】: Big data analytics is a hot research area both in academia and industry. It envisages processing massive amounts of data at high rates to generate new insights leading to positive impact (for both users and providers) of industries such as E-commerce, Telecom, Finance, Life Sciences and so forth. We consider collaborative filtering (CF) and Clustering algorithms that are key fundamental analytics kernels that help in achieving these aims. High throughput CF and co-clustering on highly sparse and massive datasets, along with a high prediction accuracy, is a computationally challenging problem. In this paper, we present a novel hierarchical design for soft real-time (less than 1-minute.) distributed co-clustering based collaborative filtering algorithm. We study both the online and offline variants of this algorithm. Theoretical analysis of the time complexity of our algorithm proves the efficacy of our approach. Further, we present the impact of load balancing based optimizations on multi-core cluster architectures. Using the Netflix dataset(900M training ratings with replication) as well as the Yahoo KDD Cup(2.3B training ratings with replication) datasets, we demonstrate the performance and scalability of our algorithm on a large multi-core cluster architecture. In offline mode, our distributed algorithm demonstrates around 4x better performance (on Blue Gene/P) as compared to the best prior work, along with high accuracy. In online mode, we demonstrated around 3x better performance compared to baseline MPI implementation. To the best of our knowledge, our algorithm provides the best known online and offline performance and scalability results with high accuracy on multi-core cluster architectures.
【Keywords】: collaborative filtering; computational complexity; distributed algorithms; multiprocessing systems; resource allocation; Netflix dataset; analytics kernels; clustering algorithm; data analytics; distributed algorithm; high performance offline distributed collaborative filtering; load balancing based optimization; multicore cluster architecture; online distributed collaborative filtering algorithm; time complexity; Algorithm design and analysis; Approximation methods; Clustering algorithms; Collaboration; Matrix decomposition; Partitioning algorithms; Training; Distributed Collaborative Filtering; Parallel Performance Optimizations; Performance & Scalability Analysis
【Paper Link】 【Pages】:559-565
【Authors】: Eric Nichols ; Charles DuHadway ; Hrishikesh B. Aradhye ; Richard F. Lyon
【Abstract】: Online video presents a great opportunity for up-and-coming singers and artists to be visible to a worldwide audience. However, the sheer quantity of video makes it difficult to discover promising musicians. We present a novel algorithm to automatically identify talented musicians using machine learning and acoustic analysis on a large set of "home singing" videos. We describe how candidate musician videos are identified and ranked by singing quality. To this end, we present new audio features specifically designed to directly capture singing quality. We evaluate these vis-a-vis a large set of generic audio features and demonstrate that the proposed features have good predictive performance. We also show that this algorithm performs well when videos are normalized for production quality.
【Keywords】: acoustic signal processing; learning (artificial intelligence); music; social networking (online); video signal processing; YouTube videos; acoustic analysis; home singing videos; machine learning; online video; talented musicians discovery; Feature extraction; Histograms; Standards; Tuning; Vectors; Videos; YouTube; YouTube; intonation; melody; music; singing; talent discovery; video
【Paper Link】 【Pages】:566-574
【Authors】: Feiping Nie ; Hua Wang ; Xiao Cai ; Heng Huang ; Chris H. Q. Ding
【Abstract】: The low-rank matrix completion problem is a fundamental machine learning problem with many important applications. The standard low-rank matrix completion methods relax the rank minimization problem by the trace norm minimization. However, this relaxation may make the solution seriously deviate from the original solution. Meanwhile, most completion methods minimize the squared prediction errors on the observed entries, which is sensitive to outliers. In this paper, we propose a new robust matrix completion method to address these two problems. The joint Schatten p-norm and ℓp-norm are used to better approximate the rank minimization problem and enhance the robustness to outliers. The extensive experiments are performed on both synthetic data and real world applications in collaborative filtering and social network link prediction. All empirical results show our new method outperforms the standard matrix completion methods.
【Keywords】: collaborative filtering; learning (artificial intelligence); minimisation; social networking (online); ℓp-norm minimization; Schatten p-norm; collaborative filtering; low-rank matrix completion problem; machine learning; outlier; rank minimization problem; social network link prediction; squared prediction error; trace norm minimization; Collaboration; Joints; Minimization; Motion pictures; Robustness; Social network services; Standards; low-rank matrix recovery; matrix completion; optimization; recommendation system
【Paper Link】 【Pages】:575-584
【Authors】: Hidekazu Oiwa ; Shin Matsushima ; Hiroshi Nakagawa
【Abstract】: We propose a new truncation framework for online supervised learning. Learning a compact predictive model in an online setting has recently attracted a great deal of attention. The combination of online learning with sparsity-inducing regularization enables faster learning with a smaller memory space than a conventional learning framework. However, a simple combination of these triggers the truncation of weights whose corresponding features rarely appear, even if these features are crucial for prediction. Furthermore, it is difficult to emphasize these features in advance while preserving the advantages of online learning. We develop an extensional truncation framework to Dual Averaging, which retains rarely occurring but informative features. Our proposed framework integrates information on all previous sub gradients of the loss functions into a regularization term. Our enhancement of a conventional L1-regularization accomplishes the automatic adjustment of each feature's truncations. This extension enables us to identify and retain rare but informative features without preprocessing. In addition, our framework achieves the same computational complexity and regret bound as standard Dual Averaging. Experiments demonstrated that our framework outperforms other sparse online learning algorithms.
【Keywords】: computational complexity; learning (artificial intelligence); compact predictive model; computational complexity; conventional L1-regularization; dual averaging; extensional truncation framework; healing truncation bias; informative features; loss functions; online supervised learning; regularization term; self-weighted truncation framework; sparsity-inducing regularization; Educational institutions; Equations; Indexes; Optimization; Prediction algorithms; Predictive models; Vectors; Feature Selection; Online Learning; Sentiment Analysis; Sparsity-inducing Regularization; Supervised Learning
【Paper Link】 【Pages】:585-594
【Authors】: Tapio Pahikkala ; Antti Airola ; Fabian Gieseke ; Oliver Kramer
【Abstract】: Regularized least-squares classification is one of the most promising alternatives to standard support vector machines, with the desirable property of closed-form solutions that can be obtained analytically, and efficiently. While the supervised, and mostly binary case has received tremendous attention in recent years, unsupervised multi-class settings have not yet been considered. In this work we present an efficient implementation for the unsupervised extension of the multi-class regularized least-squares classification framework, which is, to the best of the authors' knowledge, the first one in the literature addressing this task. The resulting kernel-based framework efficiently combines steepest descent strategies with powerful meta-heuristics for avoiding local minima. The computational efficiency of the overall approach is ensured through the application of matrix algebra shortcuts that render efficient updates of the intermediate candidate solutions possible. Our experimental evaluation indicates the potential of the novel method, and demonstrates its superior clustering performance over a variety of competing methods on real-world data sets.
【Keywords】: gradient methods; least squares approximations; matrix algebra; pattern classification; pattern clustering; support vector machines; closed-form solutions; clustering performance; kernel-based framework; matrix algebra shortcuts; meta-heuristics; steepest descent strategies; support vector machines; unsupervised multiclass regularized least-squares classification; Clustering algorithms; Kernel; Optimization; Support vector machines; Training; Unsupervised learning; Vectors; Maximum Margin Clustering; Multi-Class Regularized Least-Squares Classification; Unsupervised Learning
【Paper Link】 【Pages】:595-604
【Authors】: Bei Pan ; Ugur Demiryurek ; Cyrus Shahabi
【Abstract】: For the first time, real-time high-fidelity spatiotemporal data on transportation networks of major cities have become available. This gold mine of data can be utilized to learn about traffic behavior at different times and locations, potentially resulting in major savings in time and fuel, the two important commodities of 21st century. As a first step towards the utilization of this data, in this paper, we study the real-world data collected from Los Angeles County transportation network in order to incorporate the data's intrinsic behavior into a time-series mining technique to enhance its accuracy for traffic prediction. In particular, we utilized the spatiotemporal behaviors of rush hours and events to perform a more accurate prediction of both short-term and long-term average speed on road-segments, even in the presence of infrequent events (e.g., accidents). Our result shows that taking historical rush-hour behavior we can improve the accuracy of traditional predictors by up to 67% and 78% in short-term and long-term predictions, respectively. Moreover, we can incorporate the impact of an accident to improve the prediction accuracy by up to 91%.
【Keywords】: data mining; road accidents; road traffic; time series; traffic engineering computing; Los Angeles County; accident impact; data mining; data utilization; long-term prediction; prediction accuracy; real-world transportation data; rush hour traffic behavior; short-term prediction; spatiotemporal data; time-series mining technique; traffic behavior; traffic prediction; transportation network; Accidents; Accuracy; Autoregressive processes; Data mining; Data models; Predictive models; Transportation; event impact analysis; time-series mining; traffic prediction; transportation data
【Paper Link】 【Pages】:605-614
【Authors】: Debprakash Patnaik ; Srivatsan Laxman ; Badrish Chandramouli ; Naren Ramakrishnan
【Abstract】: Discovering frequent episodes over event sequences is an important data mining problem. Existing methods typically require multiple passes over the data, rendering them unsuitable for streaming contexts. We present the first streaming algorithm for mining frequent episodes over a window of recent events in the stream. We derive approximation guarantees for our algorithm in terms of: (i) the separation of frequent episodes from infrequent ones, and (ii) the rate of change of stream characteristics. Our parameterization of the problem provides a new sweet spot in the tradeoff between making distributional assumptions over the stream and algorithmic efficiencies of mining. We illustrate how this yields significant benefits when mining practical streams from neuroscience and telecommunications logs.
【Keywords】: data mining; data mining; dynamic event streams; efficient episode mining; event sequences; neuroscience; telecommunications logs; Algorithm design and analysis; Approximation algorithms; Approximation methods; Data mining; Electronic mail; Frequency shift keying; Photonic band gap; Approximation Algorithms; Data Streams; Event Sequences; Frequent Episodes; Pattern Discovery; Streaming Algorithms
【Paper Link】 【Pages】:615-624
【Authors】: Mitat Poyraz ; Zeynep Hilal Kilimci ; Murat Can Ganiz
【Abstract】: It has been shown that Latent Semantic Indexing (LSI) takes advantage of implicit higher-order (or latent) structure in the association of terms and documents. Higher order relations in LSI capture "latent semantics". Inspired by this, a novel Bayesian framework for classification named Higher Order Naïve Bayes (HONB), which can explicitly make use of these higher-order relations, has been introduced previously. We present a novel semantic smoothing method named Higher Order Smoothing (HOS) for the Naive Bayes algorithm. HOS is built on a similar graph based data representation of HONB which allows semantics in higher order paths to be exploited. Additionally, we take the concept one step further in HOS and exploited the relationships between instances of different classes in order to improve the parameter estimation when dealing with insufficient labeled data. As a result, we have not only been able to move beyond instance boundaries, but also class boundaries to exploit the latent information in higher-order paths. The results of our extensive experiments demonstrate the value of HOS on several benchmark datasets.
【Keywords】: Bayes methods; indexing; pattern classification; smoothing methods; text analysis; Bayesian framework; HONB; HOS; LSI; higher order naïve Bayes; higher order paths; higher order smoothing; implicit higher-order structure; latent semantic indexing; semantic smoothing method; text classification; Classification algorithms; Niobium; Semantics; Smoothing methods; Support vector machines; Text categorization; Training; Higher Order Naive Bayes; Higher Order Smoothing; Naive Bayes; Semantic Smoothing; Text Classification
【Paper Link】 【Pages】:625-634
【Authors】: Jana Schmidt ; Stefan Kramer
【Abstract】: Probabilistic real time automata (PRTAs) are a representation of dynamic processes arising in the sciences and industry. Currently, the induction of automata is divided into two steps: the creation of the prefix tree acceptor (PTA) and the merge procedure based on clustering of the states. These two steps can be very time intensive when a PRTA is to be induced for massive or even unbounded data sets. The latter one can be efficiently processed, as there exist scalable online clustering algorithms. However, the creation of the PTA still can be very time consuming. To overcome this problem, we propose a genuine online PRTA induction approach that incorporates new instances by first collapsing them and then using a maximum frequent pattern based clustering. The approach is tested against a predefined synthetic automaton and real-world data sets, for which the approach is scalable and stable. Moreover, we present a broad evaluation on a real world disease group data set that shows the applicability of such a model to the analysis of medical processes.
【Keywords】: Internet; automata theory; diseases; medical computing; pattern clustering; disease group data set; dynamic process representation; massive data sets; maximum frequent pattern based clustering; medical process analysis; merge procedure; online PRTA induction approach; online clustering algorithms; online induction; prefix tree acceptor creation; probabilistic real time automata; real-world data sets; state clustering; synthetic automaton; unbounded data sets; Automata; Data models; Diseases; History; Merging; Probabilistic logic; Real-time systems; Probabilistic real time automata; maximum frequent pattern based clustering; online induction
【Paper Link】 【Pages】:635-644
【Authors】: Xiaoxiao Shi ; Philip S. Yu
【Abstract】: Combining correlated data sources may help improve the learning performance of a given task. For example, in recommendation problems, one can combine (1) user profile database (e.g. genders, age, etc.), (2) users' log data (e.g., clickthrough data, purchasing records, etc.), and (3) users' social network (useful in social targeting) to build a recommendation model. All these data sources provide informative but heterogeneous features. For instance, user profile database usually has nominal features reflecting users' background, log data provides term-based features about users' historical behaviors, and social network database has graph relational features. Given multiple heterogeneous data sources, one important challenge is to find a unified feature subspace that captures the knowledge from all sources. To this aim, we propose a principle of collective component analysis (CoCA), in order to handle dimensionality reduction across a mixture of vector-based features and graph relational features. The CoCA principle is to find a feature subspace with maximal variance under two constraints. First, there should be consensus among the projections from different feature spaces. Second, the similarity between connected data (in any of the network databases) should be maximized. The optimal solution is obtained by solving an eigenvalue problem. Moreover, we discuss how to use prior knowledge to distinguish informative data sources, and optimally weight them in CoCA. Since there is no previous model that can be directly applied to solve the problem, we devised a straightforward comparison method by performing dimension reduction on the concatenation of the data sources. Three sets of experiments show that CoCA substantially outperforms the comparison method.
【Keywords】: data mining; principal component analysis; social networking (online); vectors; CoCA; collective component analysis; correlated data sources; dimensionality reduction; graph relational features; heterogeneous feature space; maximal variance; multiple heterogeneous data sources; recommendation problem; social network database; social targeting; term-based feature; user historical behavior; user log data; user profile database; user social network; vector-based feature; Data models; Databases; Eigenvalues and eigenfunctions; Noise; Optimization; Principal component analysis; Social network services; component; formatting; style; styling
【Paper Link】 【Pages】:645-654
【Authors】: Mark Smith ; Steven Reece ; Stephen J. Roberts ; Iead Rezek
【Abstract】: Novelty, or abnormality, detection aims to identify patterns within data streams that do not conform to expected behaviour. This paper introduces a novelty detection technique using a combination of Gaussian Processes and extreme value theory to identify anomalous behaviour in streaming data. The proposed combination of continuous and count stochastic processes is a principled approach towards dynamic extreme value modeling that accounts for the dynamics in the time series, the streaming nature of its observation as well as its sampling process. The approach is tested on both synthetic and real data, showing itself to be effective in our primary application of maritime vessel track analysis.
【Keywords】: Gaussian processes; data handling; marine engineering; media streaming; sampling methods; time series; Gaussian processes; anomalous behaviour; count stochastic processes; data streams; dynamic extreme value modeling; extreme value theory; maritime vessel track analysis; online maritime abnormality detection; sampling process; streaming data; time series; Context; Covariance matrix; Data models; Equations; Gaussian processes; Kernel; Mathematical model; Extreme Value; Gaussian Process; Maritime Traffic; Novelty Detection; Outlier Detection
【Paper Link】 【Pages】:655-664
【Authors】: Christina Teflioudi ; Faraz Makari ; Rainer Gemulla
【Abstract】: We discuss parallel and distributed algorithms for large-scale matrix completion on problems with millions of rows, millions of columns, and billions of revealed entries. We focus on in-memory algorithms that run on a small cluster of commodity nodes, even very large problems can be handled effectively in such a setup. Our DALS, ASGD, and DSGD++ algorithms are novel variants of the popular alternating least squares and stochastic gradient descent algorithms, they exploit thread-level parallelism, in-memory processing, and asynchronous communication. We provide some guidance on the asymptotic performance of each algorithm and investigate the performance of both our algorithms and previously proposed Map Reduce algorithms in large-scale experiments. We found that DSGD++ outperforms competing methods in terms of overall runtime, memory consumption, and scalability. Using DSGD++, we can factor a matrix with 10B entries on 16 compute nodes in around 40 minutes.
【Keywords】: data mining; gradient methods; least squares approximations; parallel algorithms; ASGD algorithm; DALS algorithm; DSGD++ algorithm; Map Reduce algorithm; alternating least squares algorithm; asymptotic performance; asynchronous communication; commodity node; data mining; distributed algorithm; distributed matrix completion; in-memory algorithm; in-memory processing; large-scale matrix completion; memory consumption; parallel algorithm; stochastic gradient descent algorithm; thread-level parallelism; Algorithm design and analysis; Clustering algorithms; Convergence; Distributed algorithms; Instruction sets; Schedules; Training; ALS; parallel and distributed matrix factorization; recommender systems; stochastic gradient descent
【Paper Link】 【Pages】:665-674
【Authors】: Ze Tian ; Huanan Zhang ; Rui Kuang
【Abstract】: Detecting DNA copy number variations (CNVs) from arrayCGH or genotyping-array data to correlate with cancer outcomes is crucial for understanding the molecular mechanisms underlying cancer. Previous methods either focus on detecting CNVs in each individual patient sample or common CNVs across all the patient samples. These methods ignore the discrepancies introduced by the heterogeneity in the patient samples, which implies that common CNVs might only be shared within some groups of samples instead of all samples. In this paper, we propose a latent feature model that couples sparse sample group selection with fused lasso on CNV components to identify group-specific CNVs. Assuming a given group structure on patient samples by clinical information, sparse group selection on fused lasso (SGS-FL) identifies the optimal latent CNV components, each of which is specific to the samples in one or several groups. The group selection for each CNV component is determined dynamically by an adaptive algorithm to achieve a desired sparsity. Simulation results show that SGS-FL can more accurately identify the latent CNV components when there is a reliable underlying group structure in the samples. In the experiments on arrayCGH breast cancer and bladder cancer datasets, SGS-FL detected CNV regions that are more relevant to cancer, and provided latent feature weights that can be used for better sample classification.
【Keywords】: biology computing; cancer; learning (artificial intelligence); molecular biophysics; pattern classification; CNV component; adaptive algorithm; arrayCGH data; bladder cancer dataset; breast cancer dataset; cancer molecular mechanism; cancer outcome; clinical information; fused lasso component; genotyping-array data; group-specific DNA copy number variation; latent feature model; latent feature weight; patient sample; sample classification; sparse group selection; Arrays; Cancer; DNA; Feature extraction; Matrix decomposition; Optimization; Probes; DNA copy number variations; fused lasso; group lasso; sparse group learning
【Paper Link】 【Pages】:675-684
【Authors】: Grigorios Tzortzis ; Aristidis Likas
【Abstract】: Exploiting multiple representations, or views, for the same set of instances within a clustering framework is a popular practice for boosting clustering accuracy. However, some of the available sources may be misleading (due to noise, errors in measurement etc.) in revealing the true structure of the data, thus, their inclusion in the clustering process may have negative influence. This aspect seems to be overlooked in the multi-view literature where all representations are equally considered. In this work, views are expressed in terms of given kernel matrices and a weighted combination of the kernels is learned in parallel to the partitioning. Weights assigned to kernels are indicative of the quality of the corresponding views' information. Additionally, the combination scheme incorporates a parameter that controls the admissible sparsity of the weights to avoid extremes and tailor them to the data. Two efficient iterative algorithms are proposed that alternate between updating the view weights and recomputing the clusters to optimize the intra-cluster variance from different perspectives. The conducted experiments reveal the effectiveness of our methodology compared to other multi-view methods.
【Keywords】: iterative methods; learning (artificial intelligence); matrix algebra; pattern clustering; combination scheme; intracluster variance; iterative algorithms; kernel matrices; kernel-based weighted multiview clustering framework; learning; view information; view weights; Clustering algorithms; Convergence; Kernel; Noise measurement; Optimization; Partitioning algorithms; Training; kernel k-means; multi-view clustering; multiple kernel learning
【Paper Link】 【Pages】:685-694
【Authors】: Nguyen Xuan Vinh ; Madhu Chetty ; Ross L. Coppel ; Pramod P. Wangikar
【Abstract】: Learning optimal Bayesian networks (BN) from data is NP-hard in general. Nevertheless, certain BN classes with additional topological constraints, such as the dynamic BN (DBN) models, widely applied in specific fields such as systems biology, can be efficiently learned in polynomial time. Such algorithms have been developed for the Bayesian-Dirichlet (BD), Minimum Description Length (MDL), and Mutual Information Test (MIT) scoring metrics. The BD-based algorithm admits a large polynomial bound, hence it is impractical for even modestly sized networks. The MDL-and MIT-based algorithms admit much smaller bounds, but require a very restrictive assumption that all variables have the same cardinality, thus significantly limiting their applicability. In this paper, we first propose an improvement to the MDL-and MIT-based algorithms, dropping the equicardinality constraint, thus significantly enhancing their generality. We also explore local Markov blanket based algorithms for constructing BN in the context of DBN, and show an interesting result: under the faithfulness assumption, the mutual information test based local Markov blanket algorithms yield the same network as learned by the global optimization MIT-based algorithm. Experimental validation on small and large scale genetic networks demonstrates the effectiveness of our proposed approaches.
【Keywords】: Markov processes; belief networks; computational complexity; genetic algorithms; learning (artificial intelligence); topology; BD-based algorithm; BN classes; Bayesian-Dirichlet; DBN model; MDL-based algorithm; MIT scoring metrics; NP-hard; cardinality; dynamic BN model; genetic network; global optimization MIT-based algorithm; learning dynamic Bayesian network; local algorithm; minimum description length; mutual information test based local Markov blanket algorithm; optimal Bayesian network; polynomial bound; polynomial time; systems biology; topological constraint; Algorithm design and analysis; Bayesian methods; Measurement; Mutual information; Polynomials; Time complexity; dynamic Bayesian network; gene regulatory network; global optimization; polynomial time algorithms
【Paper Link】 【Pages】:695-704
【Authors】: Byron C. Wallace ; Issa J. Dahabreh
【Abstract】: Obtaining good probability estimates is imperative for many applications. The increased uncertainty and typically asymmetric costs surrounding rare events increases this need. Experts (and classification systems) often rely on probabilities to inform decisions. However, we demonstrate that class probability estimates attained via supervised learning in imbalanced scenarios systematically underestimate the probabilities for minority class instances, despite ostensibly good overall calibration. To our knowledge, this problem has not previously been explored. Motivated by our exposition of this issue, we propose a simple, effective and theoretically motivated method to mitigate the bias of probability estimates for imbalanced data that bags estimators calibrated over balanced bootstrap samples. This approach drastically improves performance on the minority instances without greatly affecting overall calibration. We show that additional uncertainty can be exploited via a Bayesian approach by considering posterior distributions over bagged probability estimates.
【Keywords】: Bayes methods; estimation theory; learning (artificial intelligence); pattern classification; statistical distributions; Bayesian approach; class probability estimation; classification system; imbalanced data; posterior distribution; supervised learning; Bagging; Calibration; Estimation; Logistics; Mathematical model; Sensitivity; class imbalance; probability estimates
【Paper Link】 【Pages】:705-714
【Authors】: Joyce Jiyoung Whang ; Xin Sui ; Inderjit S. Dhillon
【Abstract】: Clustering of social networks is an important task for their analysis, however, most existing algorithms do not scale to the massive size of todayâs social networks. A popular class of graph clustering algorithms for large-scale networks, such as PMetis, KMetis and Graclus, is based on a multilevel framework. Generally, these multilevel algorithms work reasonably well on networks with a few million vertices. However, when the network size increases to the scale of 10 million vertices or greater, the performance of these algorithms rapidly degrades. Furthermore, an inherent property of social networks, the power law degree distribution, makes these algorithms infeasible to apply to large-scale social networks. In this paper, we propose a scalable and memory-efficient clustering algorithm for large-scale social networks. We name our algorithm GEM, by mixing two key concepts of the algorithm, Graph Extraction and weighted kernel k-Means. GEM efficiently extracts a good skeleton graph from the original graph, and propagates the clustering result of the extracted graph to the rest of the network. Experimental results show that GEM produces clusters of quality comparable to or better than existing state-of-the-art graph clustering algorithms, while it is much faster and consumes much less memory. Furthermore, the parallel implementation of GEM, called PGEM, not only produces higher quality of clusters but also achieves much better scalability than most current parallel graph clustering algorithms.
【Keywords】: graph theory; pattern clustering; social networking (online); GEM; Graclus; KMetis; PMetis; graph extraction; large-scale social networks; memory-efficient clustering; multilevel framework; parallel graph clustering algorithms; power law degree distribution; scalable clustering; weighted kernel k-means; Algorithm design and analysis; Clustering algorithms; Kernel; Memory management; Skeleton; Twitter; clustering; graph clustering; graph partitioning; kernel k-means; scalable computing; social networks
【Paper Link】 【Pages】:715-724
【Authors】: Linli Xu ; Bo Li ; Enhong Chen
【Abstract】: An ensemble is composed of a set of base learners that make predictions jointly. The generalization performance of an ensemble has been justified both theoretically and in practice. However, existing ensemble learning methods sometimes produce unnecessarily large ensembles, with an expense of extra computational costs and memory consumption. The purpose of ensemble pruning is to select a subset of base learners with comparable or better prediction performance. In this paper, we formulate the ensemble pruning problem into a combinatorial optimization problem with the goal to maximize the accuracy and diversity at the same time. Solving this problem exactly is computationally hard. Fortunately, we can relax and reformulate it as a constrained eigenvector problem, which can be solved with an efficient algorithm that is guaranteed to converge globally. Convincing experimental results demonstrate that this optimization based ensemble pruning algorithm outperforms the state-of-the-art heuristics in the literature.
【Keywords】: combinatorial mathematics; eigenvalues and eigenfunctions; learning (artificial intelligence); optimisation; combinatorial optimization problem; constrained eigen-optimization; constrained eigenvector problem; ensemble learning method; ensemble pruning; Accuracy; Bagging; Complexity theory; Optimization; Predictive models; Training; Vectors; ensemble pruning; optimization
【Paper Link】 【Pages】:725-734
【Authors】: Xinxing Xu ; Ivor W. Tsang ; Dong Xu
【Abstract】: Data ambiguities exist in many data mining and machine learning applications such as text categorization and image retrieval. For instance, it is generally beneficial to utilize the ambiguous unlabeled documents to learn a more robust classifier for text categorization under the semi-supervised learning setting. To handle general data ambiguities, we present a unified kernel learning framework named Input-Output Kernel Learning (IOKL). Based on our framework, we further propose a novel soft margin group sparse Multiple Kernel Learning (MKL) formulation by introducing a group kernel slack variable to each group of base input-output kernels. Moreover, an efficient block-wise coordinate descent algorithm with an analytical solution for the kernel combination coefficients is developed to solve the proposed formulation. We conduct comprehensive experiments on benchmark datasets for both semi-supervised learning and multiple instance learning tasks, and also apply our IOKL framework to a computer vision application called text-based image retrieval on the NUS-WIDE dataset. Promising results demonstrate the effectiveness of our proposed IOKL framework.
【Keywords】: data mining; learning (artificial intelligence); operating system kernels; text analysis; NUS-WIDE dataset; ambiguous unlabeled documents; benchmark datasets; block wise coordinate descent algorithm; computer vision application; data mining; general data ambiguity; group kernel slack; input output kernel learning; kernel combination coefficients; kernel learning framework; machine learning application; multiple instance learning task; multiple kernel learning formulation; robust classifier; semisupervised learning setting; text based image retrieval; text categorization; Kernel; Linear programming; Semisupervised learning; Support vector machines; Training; Uncertainty; Vectors; Group Multiple Kernel Learning; Input-Output Kernel Learning; Multi-Instance Learning; Semi-supervised Learning; Text-based Image Retrieval
【Paper Link】 【Pages】:735-744
【Authors】: Zhao Xu ; Kristian Kersting ; Christian Bauckhage
【Abstract】: Spectral hashing (SH) seeks compact binary codes of data points so that Hamming distances between codes correlate with data similarity. Quickly learning such codes typically boils down to principle component analysis (PCA). However, this is only justified for normally distributed data. For proportional data (normalized histograms), this is not the case. Due to the sum-to-unity constraint, features that are as independent as possible will not all be uncorrelated. In this paper, we show that a linear-time transformation efficiently copes with sum-to-unity constraints: first, we select a small number K of diverse data points by maximizing the volume of the simplex spanned by these prototypes; second, we represent each data point by means of its cosine similarities to the K selected prototypes. This maximum volume hashing is sensible since each dimension in the transformed space is likely to follow a von Mises (vM) distribution, and, in very high dimensions, the vM distribution closely resembles a Gaussian distribution. This justifies to employ PCA on the transformed data. Our extensive experiments validate this: maximum volume hashing outperforms spectral hashing and other state of the art techniques.
【Keywords】: Gaussian distribution; cryptography; learning (artificial intelligence); principal component analysis; Gaussian distribution; Hamming distance; PCA; binary code; data learning; data similarity; linear-time transformation; maximum volume hashing; normalized histogram; principle component analysis; proportional data hashing; simplex volume; spectral hashing; sum-to-unity constraint; von Mises distribution; Binary codes; Eigenvalues and eigenfunctions; Gaussian distribution; Optimization; Principal component analysis; Semantics; Vectors; Dimensionality Reduction; Proportional Data; Spectral Hashing; von Mises Distribution
【Paper Link】 【Pages】:745-754
【Authors】: Jaewon Yang ; Jure Leskovec
【Abstract】: Nodes in real-world networks organize into densely linked communities where edges appear with high concentration among the members of the community. Identifying such communities of nodes has proven to be a challenging task mainly due to a plethora of definitions of a community, intractability of algorithms, issues with evaluation and the lack of a reliable gold-standard ground-truth. In this paper we study a set of 230 large real-world social, collaboration and information networks where nodes explicitly state their group memberships. For example, in social networks nodes explicitly join various interest based social groups. We use such groups to define a reliable and robust notion of ground-truth communities. We then propose a methodology which allows us to compare and quantitatively evaluate how different structural definitions of network communities correspond to ground-truth communities. We choose 13 commonly used structural definitions of network communities and examine their sensitivity, robustness and performance in identifying the ground-truth. We show that the 13 structural definitions are heavily correlated and naturally group into four classes. We find that two of these definitions, Conductance and Triad-participation-ratio, consistently give the best performance in identifying ground-truth communities. We also investigate a task of detecting communities given a single seed node. We extend the local spectral clustering algorithm into a heuristic parameter-free community detection method that easily scales to networks with more than hundred million nodes. The proposed method achieves 30% relative improvement over current local clustering methods.
【Keywords】: information networks; pattern clustering; social networking (online); collaboration networks; conductance; densely linked communitty; gold-standard ground-truth community; heuristic parameter-free community detection method; information networks; interest based social groups; local spectral clustering algorithm; network community evaluation; node community detection; real-world networks; social networks; triad-participation-ratio; Collaboration; Communities; Image edge detection; Measurement; Robustness; Social network services; Community detection; Community scoring functions; Modularity; Network communities
【Paper Link】 【Pages】:755-764
【Authors】: Yang Yang ; Nitesh V. Chawla ; Yizhou Sun ; Jiawei Han
【Abstract】: Link prediction is an important task in network analysis, benefiting researchers and organizations in a variety of fields. Many networks in the real world, for example social networks, are heterogeneous, having multiple types of links and complex dependency structures. Link prediction in such networks must model the influence propagating between heterogeneous relationships to achieve better link prediction performance than in homogeneous networks. In this paper, we introduce Multi-Relational Influence Propagation (MRIP), a novel probabilistic method for heterogeneous networks. We demonstrate that MRIP is useful for predicting links in sparse networks, which present a significant challenge due to the severe disproportion of the number of potential links to the number of real formed links. We also explore some factors that can inform the task of classification yet remain unexplored, such as temporal information. In this paper we make use of the temporal-related features by carefully investigating the issues of feasibility and generality. In accordance with our work in unsupervised learning, we further design an appropriate supervised approach in heterogeneous networks. Our experiments on co-authorship prediction demonstrate the effectiveness of our approach.
【Keywords】: information networks; pattern classification; probability; unsupervised learning; MRIP probabilistic method; classification task; coauthorship prediction; dependency structure; heterogeneous network; link prediction; multirelational influence propagation; multirelational network; network analysis; social network; sparse network; unsupervised learning; Correlation; Diseases; Equations; Genetics; Integrated circuit modeling; Time series analysis; Unsupervised learning; Heterogeneous Network; Link Prediction; Temporal Analysis
【Paper Link】 【Pages】:765-774
【Authors】: Hsiang-Fu Yu ; Cho-Jui Hsieh ; Si Si ; Inderjit S. Dhillon
【Abstract】: Matrix factorization, when the matrix has missing values, has become one of the leading techniques for recommender systems. To handle web-scale datasets with millions of users and billions of ratings, scalability becomes an important issue. Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD) are two popular approaches to compute matrix factorization. There has been a recent flurry of activity to parallelize these algorithms. However, due to the cubic time complexity in the target rank, ALS is not scalable to large-scale datasets. On the other hand, SGD conducts efficient updates but usually suffers from slow convergence that is sensitive to the parameters. Coordinate descent, a classical optimization approach, has been used for many other large-scale problems, but its application to matrix factorization for recommender systems has not been explored thoroughly. In this paper, we show that coordinate descent based methods have a more efficient update rule compared to ALS, and are faster and have more stable convergence than SGD. We study different update sequences and propose the CCD++ algorithm, which updatesrank-one factors one by one. In addition, CCD++ can be easily parallelized on both multi-core and distributed systems. We empirically show that CCD++ is much faster than ALS and SGD in both settings. As an example, on a synthetic dataset with 2 billion ratings, CCD++ is 4 times faster than both SGD and ALS using a distributed system with 20 machines.
【Keywords】: computational complexity; convergence; gradient methods; matrix decomposition; optimisation; parallel algorithms; recommender systems; stochastic processes; ALS; CCD++ algorithm; SGD; alternating least squares; classical optimization approach; coordinate descent based methods; cubic time complexity; distributed systems; large-scale datasets; large-scale problems; multicore systems; parallel algorithms; parallel matrix factorization; recommender systems; scalable coordinate descent; slow convergence; stable convergence; stochastic gradient descent; synthetic dataset; update rule; update sequences; updates rank-one factors; web-scale datasets; Acceleration; Charge coupled devices; Convergence; Least squares approximation; Recommender systems; Time complexity; Low rank approximation; Matrix factorization; Parallelization; Recommender systems
【Paper Link】 【Pages】:775-784
【Authors】: Guo-Xun Yuan ; Kwan-Liu Ma
【Abstract】: Sparse linear support vector machines have been widely applied to variable selection in many applications. For large data, managing the cost of training a sparse model with good predication performance is an essential topic. In this work, we propose a scalable training algorithm for large-scale data with millions of examples and features. We develop a dual alternating direction method for solving L1-regularized linear SVMs. The learning procedure simply involves quadratic programming in the same form as the standard SVM dual, followed by a soft-thresholding operation. The proposed training algorithm possesses two favorable properties. First, it is a decomposable algorithm by which a large problem can be reduced to small ones. Second, the sparsity of intermediate solutions is maintained throughout the training process. It naturally promotes the solution sparsity by soft-thresholding. We demonstrate that, by experiments, our method outperforms state-of-the-art approaches on large-scale benchmark data sets. We also show that it is well suited for training large sparse models on a distributed system.
【Keywords】: pattern classification; quadratic programming; support vector machines; L1-regularized linear SVM; decomposable algorithm; distributed system; dual alternating direction method; large-scale data; learning procedure; predication performance; quadratic programming; scalable training; soft-thresholding operation; sparse classifier; sparse linear SVM; sparse linear support vector machine; variable selection; Approximation methods; Convergence; Data models; Minimization; Standards; Support vector machines; Training; L1 regularization; large-scale linear classification; optimization techniques
【Paper Link】 【Pages】:785-794
【Authors】: Jesin Zakaria ; Abdullah Mueen ; Eamonn J. Keogh
【Abstract】: Time series clustering has become an increasingly important research topic over the past decade. Most existing methods for time series clustering rely on distances calculated from the entire raw data using the Euclidean distance or Dynamic Time Warping distance as the distance measure. However, the presence of significant noise, dropouts, or extraneous data can greatly limit the accuracy of clustering in this domain. Moreover, for most real world problems, we cannot expect objects from the same class to be equal in length. As a consequence, most work on time series clustering only considers the clustering of individual time series "behaviors," e.g., individual heart beats or individual gait cycles, and contrives the time series in some way to make them all equal in length. However, contriving the data in such a way is often a harder problem than the clustering itself. In this work, we show that by using only some local patterns and deliberately ignoring the rest of the data, we can mitigate the above problems and cluster time series of different lengths, i.e., cluster one heartbeat with multiple heartbeats. To achieve this we exploit and extend a recently introduced concept in time series data mining called shapelets. Unlike existing work, our work demonstrates for the first time the unintuitive fact that shapelets can be learned from unlabeled time series. We show, with extensive empirical evaluation in diverse domains, that our method is more accurate than existing methods. Moreover, in addition to accurate clustering results, we show that our work also has the potential to give insights into the domains to which it is applied.
【Keywords】: pattern clustering; time series; unsupervised learning; Euclidean distance measure; clustering accuracy; dynamic time warping distance measure; time series clustering; unsupervised shapelet concept; Clustering algorithms; Data mining; Earth; Euclidean distance; Time measurement; Time series analysis; Vectors; clustering; shapelets; time series; unsupervised
【Paper Link】 【Pages】:795-803
【Authors】: Yan Zhou ; Murat Kantarcioglu ; Bhavani M. Thuraisingham
【Abstract】: Practical machine learning and data mining problems often face shortage of labeled training data. Self-training algorithms are among the earliest attempts of using unlabeled data to enhance learning. Traditional self-training algorithms label unlabeled data on which classifiers trained on limited training data have the highest confidence. In this paper, a self-training algorithm that decreases the disagreement region of hypotheses is presented. The algorithm supplements the training set with self-labeled instances. Only instances that greatly reduce the disagreement region of hypotheses are labeled and added to the training set. Empirical results demonstrate that the proposed self-training algorithm can effectively improve classification performance.
【Keywords】: data mining; learning (artificial intelligence); data mining problems; disagreement region; labeled training data; limited training data; machine learning; selection by rejection; self labeled instances; self training algorithm; unlabeled data; Accuracy; Algorithm design and analysis; Distributed databases; Labeling; Noise; Semisupervised learning; Training; self-training; semi-supervised learning
【Paper Link】 【Pages】:804-809
【Authors】: Blake Anderson ; Curtis B. Storlie ; Terran Lane
【Abstract】: With the increasing prevalence of richer, more complex data sources, learning with multiple views is becoming more widespread. Multiple kernel learning (MKL) has been developed to address this problem, but in general, the solutions provided by traditional MKL are restricted to a classification objective function. In this work, we develop a novel multiple kernel learning algorithm that is based on a spectral clustering objective function which is able to find an optimal kernel weight vector for the clustering problem. We go on to show how this optimization problem can be cast as a semidefinite program and efficiently solved using off-the-shelf interior point methods.
【Keywords】: invasive software; learning (artificial intelligence); mathematical programming; pattern classification; pattern clustering; MKL; classification objective function; clustering problem; complex data source; malware; multiple kernel learning algorithm; multiple kernel learning clustering; off-the-shelf interior point method; optimal kernel weight vector; optimization problem; semidefinite program; spectral clustering; Clustering algorithms; Equations; Kernel; Laplace equations; Linear programming; Malware; Vectors; Clustering; Convex Optimization; Malware; Multiple Kernel Learning
【Paper Link】 【Pages】:810-815
【Authors】: Cuneyt Gurcan Akcora ; Barbara Carminati ; Elena Ferrari
【Abstract】: In this paper, we explore the risks of friends in social networks caused by their friendship patterns, by using real life social network data and starting from a previously defined risk model. Particularly, we observe that risks of friendships can be mined by analyzing users' attitude towards friends of friends. This allows us to give new insights into friendship and risk dynamics on social networks.
【Keywords】: data privacy; risk management; social networking (online); friendship; friendship pattern; friendship risk; risk dynamics; risk model; social network; user attitude; Clustering algorithms; Computational modeling; Estimation; Labeling; Logistics; Privacy; Social network services; Friendship; Online Social Networks; Privacy; Risk Model
【Paper Link】 【Pages】:816-821
【Authors】: Md. Hijbul Alam ; SangKeun Lee
【Abstract】: The number of opinions and reviews about different products and services is growing online. Users frequently look for important aspects of a product or service in the reviews. Usually, they are interested in semantic (i.e., sentiment-oriented) aspects. However, extracting semantic aspects with supervised methods is very expensive. We propose a domain independent unsupervised model to extract semantic aspects, and conduct qualitative and quantitative experiments to evaluate the extracted aspects. The experiments show that our model effectively extracts semantic aspects with correlated top words. In addition, the conducted evaluation on aspect sentiment classification shows that our model outperforms other models by 5-7% in terms of macro-average F1.
【Keywords】: Internet; pattern classification; F1 macro-average; aspect sentiment classification; domain independent unsupervised model; online reviews; qualitative experiments; quantitative experiments; semantic aspect discovery; sentiment-oriented aspect; top word correlation; Airports; Analytical models; Computational modeling; Contamination; Context; Joints; Semantics; aspect discovery; opinion mining; sentiment analysis; topic model
【Paper Link】 【Pages】:822-827
【Authors】: Mohammad Ali Bagheri ; Qigang Gao ; Sergio Escalera
【Abstract】: Among the proposed methods to deal with multi-class classification problems, the Error-Correcting Output Codes (ECOC) represents a powerful framework. The key factor in designing any ECOC matrix is the independency of the binary classifiers, without which the ECOC method would be ineffective. This paper proposes an efficient new approach to the ECOC framework in order to improve independency among classifiers. The underlying rationale for our work is that we design three-dimensional codematrix, where the third dimension is the feature space of the problem domain. Using rough set-based feature selection, a new algorithm, named "Rough Set Subspace ECOC (RSS-ECOC)" is proposed. We introduce the Quick Multiple Reduct algorithm in order to generate a set of reducts for a binary problem, where each reduct is used to train a dichotomizer. In addition to creating more independent classifiers, ECOC matrices with longer codes can be built. The numerical experiments in this study compare the classification accuracy of the proposed RSS-ECOC with classical ECOC, one-versus-one, and one-versus-all methods on 24 UCI datasets. The results show that the proposed technique increases the classification accuracy in comparison with the state of the art coding methods.
【Keywords】: error correction codes; matrix algebra; pattern classification; rough set theory; binary classifier; classification accuracy; dichotomizer training; multiclass classification problem; quick multiple reduct algorithm; rough set subspace ECOC; rough set subspace error-correcting output code matrix; rough set-based feature selection; three-dimensional codematrix; Accuracy; Algorithm design and analysis; Data mining; Decoding; Encoding; Training; Vectors; Error Correcting Output Codes; Feature subspace; Multiclass classification; Rough Set
【Paper Link】 【Pages】:828-833
【Authors】: Gilles Bisson ; Clement Grimal
【Abstract】: In many applications, entities of the domain are described through different views that clustering methods often process one by one. We introduce here the architecture MVSim, that is able to deal simultaneously with all the information contained in such multi-view datasets by using several instances of a co-similarity algorithm. We show that this architecture provides better results than both single-view and multi-view approaches and that it can be easily parallelized thus reducing both time and space complexities of the computations.
【Keywords】: computational complexity; parallel processing; pattern clustering; MVSim architecture; clustering method; cosimilarity algorithm; multiview dataset coclustering; parallelizable approach; space complexity; time complexity; Clustering algorithms; Clustering methods; Complexity theory; Computer architecture; Damping; Silicon; Symmetric matrices; Co-clustering; Multi-view and Similarity Learning
【Paper Link】 【Pages】:834-839
【Authors】: Anveshi Charuvaka ; Huzefa Rangwala
【Abstract】: Several biological databases organize information in taxonomies/hierarchies. These databases differ in terms of curation process, input data, coverage and annotation errors. SCOP and CATH are examples of two databases that classify proteins hierarchically into structurally related groups based on experimentally determined structures. Given the large number of protein sequences with unavailable structure, there is a need to develop prediction methods to classify protein sequences into structural classes. We have developed a novel classification approach that utilizes the underlying relationships across multiple hierarchical source databases within a multi-task learning (MTL) framework. MTL is used to simultaneously learn multiple related tasks, and has been shown to improve generalization performance. Specifically, we have developed and evaluated an MTL approach for predicting the structural class, as defined by two hierarchical databases, CATH and SCOP, using protein sequence information only. We define one task per node of the hierarchies and formulate the MTL problem as a combination of these binary classification tasks. Our experimental evaluation demonstrates that the MTL approach that integrates both the hierarchies outperforms the base-line approach that trains independent models per task, as well as a MTL approach that integrates tasks across a single hierarchical database. We also performed extensive experiments that evaluate different regularization penalties and incorporate different task relationships that achieve superior classification performance.
【Keywords】: biology computing; database management systems; learning (artificial intelligence); multiprogramming; proteins; CATH; SCOP; annotation errors; biological databases; curation process; dual hierarchies; multi-task learning; multiple hierarchical source databases; protein sequences; proteins classification; taxonomies; Context; Databases; Mathematical model; Optimization; Proteins; Topology; Training
【Paper Link】 【Pages】:840-845
【Authors】: Yu-Wei Chu ; Chih-Hua Tai ; Ming-Syan Chen ; Philip S. Yu
【Abstract】: Information network analysis has drawn a lot attention in recent years. Among all the aspects of network analysis, similarity measure of nodes has been shown useful in many applications, such as clustering, link prediction and community identification, to name a few. As linkage data in a large network is inherently sparse, it is noted that collecting more data can improve the quality of similarity measure. This gives different parties a motivation to cooperate. In this paper, we address the problem of link-based similarity measure of nodes in an information network distributed over different parties. Concerning the data privacy, we propose a privacy-preserving Sim Rank protocol based on fully-homomorphic encryption to provide cryptographic protection for the links.
【Keywords】: cryptographic protocols; data privacy; information analysis; information networks; clustering application; community identification application; cryptographic protection; data collection; data privacy; distributed information network; fully-homomorphic encryption; information network analysis; link prediction application; link-based similarity measure; node similarity measure; privacy-preserving SimRank protocol; Ciphers; Encryption; Joints; Motion pictures; Protocols; Vectors; Privacy; Similarity
【Paper Link】 【Pages】:846-851
【Authors】: Fatemeh Dorri ; Ali Ghodsi
【Abstract】: A main problem in machine learning is to predict the response variables of a test set given the training data and its corresponding response variables. A predictive model can perform satisfactorily only if the training data is an appropriate representative of the test data. This is usually reflected in the assumption that the training data and the test data are drawn from the same underlying probability distribution. However, the assumption may not be correct in many applications for various reasons. We propose a method based on kernel distribution embedding and Hilbert-Schmidt Independence Criterion (HSIC) to address this problem. The proposed method explores a new representation of the data in a new feature space with two properties: (i) the distributions of the training and the test data sets are as close as possible in the new feature space, and (ii) the important structural information of the data is preserved. The algorithm can reduce the dimensionality of the data while it preserves the aforementioned properties and therefore it can be seen as a dimensionality reduction method as well. Our method has a closed-form solution and the experimental results show that it works well in practice.
【Keywords】: data handling; data structures; learning (artificial intelligence); statistical distributions; HSIC; Hilbert-Schmidt independence criterion; component analysis; data dimensionality; data representation; dimensionality reduction; kernel distribution embedding; machine learning; predictive model; probability distribution; response variable; test data; training data; Algorithm design and analysis; Error analysis; Kernel; Predictive models; Probability distribution; Training; Training data; Domain Adaptation; Hilbert- Schmidt independence criterion; Kernel Embedding
【Paper Link】 【Pages】:852-857
【Authors】: Mingyu Fan ; Xiaoqin Zhang ; Zhouchen Lin ; Zhongfei Zhang ; Hujun Bao
【Abstract】: Manifold learning is an important feature extraction approach in data mining. This paper presents a new semi-supervised manifold learning algorithm, called Multi-Manifold Discriminative Analysis (Multi-MDA). The proposed method is designed to explore the discriminative information hidden in geodesic distances. The main contributions of the proposed method are: 1) we propose a semi-supervised graph construction method which can effectively capture the multiple manifolds structure of the data, 2) each data point is replaced with an associated feature vector whose elements are the graph distances from it to the other data points. Information of the nonlinear structure is contained in the feature vectors which are helpful for classification, 3) we propose a new semi-supervised linear dimension reduction method for feature vectors which introduces the class information into the manifold learning process and establishes an explicit dimension reduction mapping. Experiments on benchmark data sets are conducted to show the effectiveness of the proposed method.
【Keywords】: data mining; feature extraction; graph theory; learning (artificial intelligence); vectors; Multi-MDA algorithm; data mining; data structure; discriminative information; explicit dimension reduction mapping; feature vector; geodesic based semisupervised multimanifold feature extraction; geodesic distance; graph distance; manifold learning; multimanifold discriminative analysis; semisupervised graph construction method; semisupervised linear dimension reduction method; Accuracy; Algorithm design and analysis; Feature extraction; Manifolds; Principal component analysis; Training; Vectors; Feature extraction; geodesic distance; manifold learning
【Paper Link】 【Pages】:858-863
【Authors】: Meng Fang ; Xingquan Zhu ; Bin Li ; Wei Ding ; Xindong Wu
【Abstract】: The emergence of social tagging and crowdsourcing systems provides a unique platform where multiple weak labelers can form a crowd to fulfill a labeling task. Yet crowd labelers are often noisy, inaccurate, and have limited labeling knowledge, and worst of all, they act independently without seeking complementary knowledge from each other to improve labeling performance. In this paper, we propose a Self-Taught Active Learning (STAL) paradigm, where imperfect labelers are able to learn complementary knowledge from one another to expand their knowledge sets and benefit the underlying active learner. We employ a probabilistic model to characterize the knowledge of each labeler through which a weak labeler can learn complementary knowledge from a stronger peer. As a result, the self-taught active learning process eventually helps achieve high classification accuracy with minimized labeling costs and labeling errors.
【Keywords】: learning (artificial intelligence); probability; active learner; classification accuracy; crowdsourcing system; labeling cost minimisation; labeling error minimisation; labeling task; probabilistic model; self-taught active learning; social tagging; Computer science; Educational institutions; Graphical models; Labeling; Learning systems; Reliability; Uncertainty; active learning; crowd; self-taught
【Paper Link】 【Pages】:864-869
【Authors】: Zheng Fang ; Zhongfei (Mark) Zhang
【Abstract】: Multiple feature views arise in various important data classification scenarios. However, finding a consensus feature view from multiple feature views for a classifier is still a challenging task. We present a new classification framework using the multi-label correlation information to address the problem of simultaneously combining multiple feature views and maximum margin classification. Under this framework, we propose a novel algorithm that iteratively computes the multiple view feature mapping matrices, the consensus feature view representation, and the coefficients of the classifier. Extensive experimental evaluations demonstrate the effectiveness and promise of this framework as well as the algorithm for discovering a consensus view from multiple feature views.
【Keywords】: learning (artificial intelligence); matrix algebra; pattern classification; classifier coefficient; consensus feature view; consensus feature view representation; data classification; maximum margin classification; multilabel correlation information; multiple view feature mapping matrices; multiview multilabel learning; Correlation; Data mining; Feature extraction; Linear programming; Optimization; Training; Training data; consensus representation; feature mapping; label dependence maximization; maximum margin classification; multi-view learning
【Paper Link】 【Pages】:870-875
【Authors】: Mario Frank ; Ben Dong ; Adrienne Porter Felt ; Dawn Song
【Abstract】: Android and Facebook provide third-party applications with access to users' private data and the ability to perform potentially sensitive operations (e.g., post to a user's wall or place phone calls). As a security measure, these platforms restrict applications' privileges with permission systems: users must approve the permissions requested by applications before the applications can make privacy-or security-relevant API calls. However, recent studies have shown that users often do not understand permission requests and are unsure of which permissions are typical for applications. As a first step towards simplifying permission systems, we cluster a corpus of 188,389 Android applications and 27,029 Facebook applications to find patterns in permission requests. Using a method for Boolean matrix factorization to find overlapping clusters of permissions, we find that Facebook permission requests follow a clear structure that can be fitted well with only five patterns, whereas Android applications demonstrate more complex permission requests. We also find that low-reputation applications often deviate from the permission request patterns that we identified for high-reputation applications, which suggests that permission request patterns can be indicative of user satisfaction or application quality.
【Keywords】: Boolean algebra; application program interfaces; data mining; data privacy; matrix decomposition; operating systems (computers); social networking (online); Android; Boolean matrix factorization; Facebook; application quality; high-reputation application; low-reputation application; overlapping permission clusters; permission request pattern mining; permission system; privacy-or security-relevant API call; private data; security measure; third-party application; user satisfaction; Androids; Facebook; Hardware; Humanoid robots; Malware; Smart phones; Training; Android; Facebook; Permissions; Smartphones; Unsupervised learning; pattern mining
【Paper Link】 【Pages】:876-881
【Authors】: Liang Ge ; Jing Gao ; Xiao Yu ; Wei Fan ; Aidong Zhang
【Abstract】: We investigate how to estimate information trustworthiness by considering multiple information sources jointly in a latent matrix space. We particularly focus on user review and recommendation systems, as there are multiple platforms where people can rate items and services that they have purchased, and many potential customers rely on these opinions to make decisions. Information trustworthiness is a serious problem because ratings are generated freely by end-users so that many stammers take advantage of freedom of speech to promote their business or damage reputation of competitors. We propose to simply use customer ratings to estimate each individual source's reliability by exploring correlations among multiple sources. Ratings of items are provided by users of diverse tastes and styles, and thus may appear noisy and conflicting across sources, however, they share some underlying common behavior. Therefore, we can group users based on their opinions, and a source is reliable on an item if its opinions given by latent groups are consistent across platforms. Inspired by this observation, we solve the problem by a two-step model -- a joint matrix factorization procedure followed by reliability score computation. We propose two effective approaches to decompose rating matrices as the products of group membership and group rating matrices, and then compute consistency degrees from group rating matrices as source reliability scores. We conduct experiments on both synthetic data and real user ratings collected from Orbitz, Priceline and Trip Advisor on all the hotels in Las Vegas and New York City. Results show that the proposed method is able to give accurate estimates of source reliability and thus successfully identify inconsistent, conflicting and unreliable information.
【Keywords】: data privacy; matrix decomposition; recommender systems; user interfaces; Las Vegas; New York City; Orbitz; Priceline; Trip Advisor; customer rating; group membership matrix; group rating matrix; information source; item rating; latent matrix space; local information trustworthiness estimation; multisource joint matrix factorization; recommendation system; reliability score computation; user review system; Algorithm design and analysis; Cities and towns; Convergence; Joints; Matrix decomposition; Optimization; Reliability; Joint Matrix Factorization; Recommendation System; Trustworthiness
【Paper Link】 【Pages】:882-887
【Authors】: Quanquan Gu ; Jiawei Han
【Abstract】: Active learning on graphs has received increasing interest in the past years. In this paper, we propose a textit{nonadaptive} active learning approach on graphs, based on generalization error bound minimization. In particular, we present a data-dependent error bound for a graph-based learning method, namely learning with local and global consistency (LLGC). We show that the empirical transductive Rademacher complexity of the function class for LLGC provides a natural criterion for active learning. The resulting active learning approach is to select a subset of nodes on a graph such that the empirical transductive Rademacher complexity of LLGC is minimized. We propose a simple yet effective sequential optimization algorithm to solve it. Experiments on benchmark datasets show that the proposed method outperforms the state-of-the-art active learning methods on graphs.
【Keywords】: computational complexity; generalisation (artificial intelligence); graph theory; learning (artificial intelligence); minimisation; LLGC learning; active learning; data-dependent error bound; empirical transductive Rademacher complexity; error bound minimization approach; generalization error bound minimization; graph-based learning method; learning-with-local-and-global consistency; sequential optimization algorithm; Algorithm design and analysis; Complexity theory; Learning systems; Machine learning; Minimization; Optimization; Vectors; Active Learning; Generalization Error Bound; Graph; Sequential Optimization
【Paper Link】 【Pages】:888-893
【Authors】: Yuhong Guo ; Min Xiao
【Abstract】: In cross-lingual text classification problems, it is costly and time-consuming to annotate documents for each individual language. To avoid the expensive re-labeling process, domain adaptation techniques can be applied to adapt a learning system trained in one language domain to another language domain. In this paper we develop a transductive subspace representation learning method to address domain adaptation for cross-lingual text classifications. The proposed approach is formulated as a nonnegative matrix factorization problem and solved using an iterative optimization procedure. Our empirical study on cross-lingual text classification tasks shows the proposed approach consistently outperforms a number of comparison methods.
【Keywords】: classification; iterative methods; learning (artificial intelligence); matrix decomposition; optimisation; text analysis; cross-lingual text classification; document annotation; domain adaptation technique; iterative optimization; language domain; learning system; nonnegative matrix factorization problem; relabeling process; transductive subspace representation learning method; Accuracy; Learning systems; Linear programming; Optimization; Standards; Training; Vectors; cross-lingual text classification; domain adaptation; representation learning
【Paper Link】 【Pages】:894-899
【Authors】: Shrutendra K. Harsola ; Prasad M. Deshpande ; Jayant R. Haritsa
【Abstract】: We address the problem of mining targeted association rules over multidimensional market-basket data. Here, each transaction has, in addition to the set of purchased items, ancillary dimension attributes associated with it. Based on these dimensions, transactions can be visualized as distributed over cells of an n-dimensional cube. In this framework, a targeted association rule is of the form {X → Y}R, where R is a convex region in the cube and X → Y is a traditional association rule within region R. We first describe the TOARM algorithm, based on classical techniques, for identifying targeted association rules. Then, we discuss the concepts of bottom-up aggregation and cubing, leading to the Cell Union technique. This approach is further extended, using notions of cube-count interleaving and credit-based pruning, to derive the Ice Cube algorithm. Our experiments demonstrate that Ice Cube consistently provides the best execution time performance, especially for large and complex data cubes.
【Keywords】: data mining; marketing data processing; IceCube algorithm; TOARM algorithm; ancillary dimension attribute; bottom-up aggregation concept; cell union technique; credit-based pruning notion; cube-count interleaving notion; cubing concept; data cube; multidimensional market-basket data; targeted association rule mining; Aggregates; Algorithm design and analysis; Association rules; Filtering algorithms; Generators; Itemsets; association rule mining; data cube; localized rules
【Paper Link】 【Pages】:900-905
【Authors】: Hao Huang ; Hong Qin ; Shinjae Yoo ; Dantong Yu
【Abstract】: The primary originality of this paper lies at the fact that we have made the first attempt to apply quantum mechanics theory to anomaly (outlier) detection in high-dimensional datasets for data mining. We propose Fermi Density Descriptor (FDD) which represents the probability of measuring a fermion at a specific location for anomaly detection. We also quantify and examine different Laplacian normalization effects and choose the best one for anomaly detection. Both theoretical proof and quantitative experiments demonstrate that our proposed FDD is substantially more discriminative and robust than the commonly-used algorithms.
【Keywords】: data mining; quantum theory; security of data; statistical analysis; Fermi density descriptor; Laplacian normalization effect; anomaly detection algorithm; data mining; fermion measurement probability; outlier detection; quantum mechanics theory; Distribution functions; Eigenvalues and eigenfunctions; Equations; Laplace equations; Manifolds; Quantum mechanics; Robustness; Anomaly Detection; Quantum Mechanics
【Paper Link】 【Pages】:906-911
【Authors】: Katsuhiko Ishiguro ; Akisato Kimura ; Koh Takeuchi
【Abstract】: The amount and variety of multimedia data such as images, movies and music available on over social networks are increasing rapidly. However, the ability to analyze and exploit these unorganized multimedia data remains inadequate, even with state-of-the-art media processing techniques. Our finding in this paper is that the emerging social curation service is a promising information source for the automatic understanding and mining of images distributed and exchanged via social media. One remarkable virtue of social curation service datasets is that they are weakly supervised: the content in the service is manually collected, selected and maintained by users. This is very different from other social information sources, and we can utilize this characteristics for media content mining without expensive media processing techniques. In this paper we present a machine learning system for predicting view counts of images in social curation data as the first step to automatic image content evaluation. Our experiments confirm that the simple features extracted from a social curation corpus are much superior in terms of count prediction than the gold-standard image features of computer vision research.
【Keywords】: data mining; feature extraction; image processing; learning (artificial intelligence); multimedia computing; social networking (online); automatic image content evaluation; automatic image mining; automatic image understanding; feature extraction; machine learning system; media content mining; movies; multimedia data; music; social curation service; social media; social network; view count prediction; Computer vision; Context; Data mining; Feature extraction; Histograms; Media; Motion pictures; Social curation; automatic image understanding and evaluation; feature extraction; regression
【Paper Link】 【Pages】:912-917
【Authors】: Shangpu Jiang ; Daniel Lowd ; Dejing Dou
【Abstract】: A number of text mining and information extraction projects such as Text Runner and NELL seek to automatically build knowledge bases from the rapidly growing amount of information on the web. In order to scale to the size of the web, these projects often employ ad hoc heuristics to reason about uncertain and contradictory information rather than reasoning jointly about all candidate facts. In this paper, we present a Markov logic-based system for cleaning an extracted knowledge base. This allows a scalable system such as NELL to take advantage of joint probabilistic inference, or, conversely, allows Markov logic to be applied to a web scale problem. Our system uses only the ontological constraints and confidence values of the original system, along with human-labeled data if available. The labeled data can be used to calibrate the confidence scores from the original system or learn the effectiveness of individual extraction patterns. To achieve scalability, we introduce a neighborhood grounding method that only instantiates the part of the network most relevant to the given query. This allows us to partition the knowledge cleaning task into tractable pieces that can be solved individually. In experiments on NELL's knowledge base, we evaluate several variants of our approach and find that they improve both F1 and area under the precision-recall curve.
【Keywords】: Internet; Markov processes; data mining; formal logic; inference mechanisms; knowledge based systems; ontologies (artificial intelligence); query processing; text analysis; Markov logic-based system; NELL; Web information; Web scale problem; ad hoc heuristics; confidence score; confidence value; extraction pattern; human-labeled data; information extraction; knowledge base extraction; knowledge cleaning task; neighborhood grounding method; ontological constraint; precision-recall curve; probabilistic inference; query; text mining; text runner; Data mining; Joints; Knowledge based systems; Logistics; Markov processes; Ontologies; Training data; Information extraction; Markov logic; knowledge base; ontology; text mining
【Paper Link】 【Pages】:918-923
【Authors】: Kyomin Jung ; Wooram Heo ; Wei Chen
【Abstract】: Influence maximization is the problem of selecting top k seed nodes in a social network to maximize their influence coverage under certain influence diffusion models. In this paper, we propose a novel algorithm IRIE that integrates the advantages of influence ranking (IR) and influence estimation (IE) methods for influence maximization in both the independent cascade (IC) model and its extension IC-N that incorporates negative opinion propagations. Through extensive experiments, we demonstrate that IRIE matches the influence coverage of other algorithms while scales much better than all other algorithms. Moreover IRIE is much more robust and stable than other algorithms both in running time and memory usage for various density of networks and cascade size. It runs up to two orders of magnitude faster than other state-of-the-art algorithms such as PMIA for large networks with tens of millions of nodes and edges, while using only a fraction of memory.
【Keywords】: belief maintenance; marketing; optimisation; social networking (online); IC model; IRIE algorithm; independent cascade model; influence diffusion model; influence estimation method; influence maximization; influence ranking method; negative opinion propagation; social network; viral marketing; Algorithm design and analysis; Computational modeling; Greedy algorithms; Integrated circuit modeling; Mathematical model; Social network services; independent cascade model; influence maximization; social network analysis; social network mining; viral marketing
【Paper Link】 【Pages】:924-929
【Authors】: Faisal Kamiran ; Asim Karim ; Xiangliang Zhang
【Abstract】: Social discrimination (e.g., against females) arising from data mining techniques is a growing concern worldwide. In recent years, several methods have been proposed for making classifiers learned over discriminatory data discrimination-aware. However, these methods suffer from two major shortcomings: (1) They require either modifying the discriminatory data or tweaking a specific classification algorithm and (2) They are not flexible w.r.t. discrimination control and multiple sensitive attribute handling. In this paper, we present two solutions for discrimination-aware classification that neither require data modification nor classifier tweaking. Our first and second solutions exploit, respectively, the reject option of probabilistic classifier(s) and the disagreement region of general classifier ensembles to reduce discrimination. We relate both solutions with decision theory for better understanding of the process. Our experiments using real-world datasets demonstrate that our solutions outperform existing state-of-the-art methods, especially at low discrimination which is a significant advantage. The superior performance coupled with flexible control over discrimination and easy applicability to multiple sensitive attributes makes our solutions an important step forward in practical discrimination-aware classification.
【Keywords】: data mining; decision theory; pattern classification; probability; data mining; decision theory; discrimination control; discrimination-aware classification; probabilistic classifier; sensitive attribute handling; social discrimination; Accuracy; Communities; Data mining; Decision trees; Logistics; Probabilistic logic; Standards; classification; decision theory; ensembles; social discrimination
【Paper Link】 【Pages】:930-935
【Authors】: Yoonseop Kang ; Saehoon Kim ; Seungjin Choi
【Abstract】: Hashing seeks an embedding of high-dimensional objects into a similarity-preserving low-dimensional Hamming space such that similar objects are indexed by binary codes with small Hamming distances. A variety of hashing methods have been developed, but most of them resort to a single view (representation) of data. However, objects are often described by multiple representations. For instance, images are described by a few different visual descriptors (such as SIFT, GIST, and HOG), so it is desirable to incorporate multiple representations into hashing, leading to multi-view hashing. In this paper we present a deep network for multi-view hashing, referred to as deep multi-view hashing, where each layer of hidden nodes is composed of view-specific and shared hidden nodes, in order to learn individual and shared hidden spaces from multiple views of data. Numerical experiments on image datasets demonstrate the useful behavior of our deep multi-view hashing (DMVH), compared to recently-proposed multi-modal deep network as well as existing shallow models of hashing.
【Keywords】: cryptography; image coding; image representation; learning (artificial intelligence); Hamming distance; binary code; data single view representation; deep learning; hashing method; high-dimensional object; image dataset; multiple representation; multiview hashing; shared hidden node; similarity-preserving low-dimensional Hamming space; view-specific hidden node; visual descriptor; Binary codes; Computational modeling; Hamming distance; Linear programming; Machine learning; Training; Visualization; deep learning; harmonium; hashing; multi-view learning; restricted Boltzmann machines
【Paper Link】 【Pages】:936-941
【Authors】: Pooyan Khajehpour Tadavani ; Ali Ghodsi
【Abstract】: In the space of high-dimensional data, it is generally reasonable to assume that the data points are on (or close to) one or more submanifolds. Each of these submanifolds can be modeled by a number of linear subspaces. This is in fact the main intuition behind a majority of subspace clustering algorithms. In many cases, however, the subspaces computed by these algorithms consist of disconnected subsets of the underlying submanifolds and therefore, do not form localized and compact clusters. To address this problem, we propose "Low Dimensional Localized Clustering (LDLC)", a new method for subspace clustering. Unlike existing methods, LDLC respects the topology of the underling submanifolds and assigns the data points to localized clusters such that the total reconstruction error is minimized. This is a valuable property in many tasks, such as semi-supervised classification, data visualization and local dimensionality reduction. We establish connections between LDLC, K-Means, and VQPCA from different perspectives, and validate our method through various experiments on synthetic and real data sets.
【Keywords】: pattern clustering; topology; VQPCA; data visualization; high-dimensional data; k-means; linear subspace; local dimensionality reduction; low dimensional localized clustering; semisupervised classification; submanifold; subspace clustering algorithm; topology; Clustering algorithms; Data visualization; Image reconstruction; Linear programming; Manifolds; Measurement uncertainty; Partitioning algorithms; Clustering; Dimensionality Reduction; Manifold
【Paper Link】 【Pages】:942-947
【Authors】: Deguang Kong ; Chris H. Q. Ding
【Abstract】: Linear Discriminant Analysis (LDA) is widely used for dimension reduction in classification tasks. However, standard LDA formulation is not semi definite positive (s.d.p), and thus it is difficult to obtain the global optimal solution when standard LDA formulation is combined with other loss functions or graph embedding. In this paper, we present an alternative approach to LDA. We rewrite the LDA criterion as a convex formulation (semi-definite positive LDA, i.e., sdpLDA) using the largest eigen-value of the generalized eigen-value problem of standard LDA. We give applications by incorporating sdpLDA as a regularization term into discriminant regression analysis. Another application is to incorporate sdpLDA into standard Laplacian embedding, which utilizes the supervised information to improve the Laplacian embedding performance. Proposed sdpLDA formulation can be used for both multi-class classification tasks. Extensive experiments results on 10 multi-class datasets indicate promising results of proposed method.
【Keywords】: convex programming; data analysis; eigenvalues and eigenfunctions; graph theory; pattern classification; regression analysis; LDA criterion; classification tasks; convex formulation; dimension reduction; discriminant regression analysis; eigenvalue; global optimal solution; graph embedding; sdpLDA; semi-definite positive linear discriminant analysis; standard Laplacian embedding; Accuracy; Convex functions; Eigenvalues and eigenfunctions; Kernel; Laplace equations; Linear regression; Standards; LDA; kernel LDA; multi-class; multi-label; semi-definite positive
【Paper Link】 【Pages】:948-953
【Authors】: Jérôme Kunegis ; Jörg Fliege
【Abstract】: We present a method for trust prediction based on no diagonal decompositions of the asymmetric adjacency matrix of a directed network. The method we propose is based on a no diagonal decomposition into directed components (DEDICOM), which we use to learn the coefficients of a matrix polynomial of the network's adjacency matrix. We show that our method can be used to compute better low-rank approximations to a polynomial of a network's adjacency matrix than using the singular value decomposition, and that a higher precision can be achieved at the task of predicting directed links than by undirected or bipartite methods.
【Keywords】: directed graphs; matrix decomposition; polynomial matrices; trusted computing; asymmetric adjacency matrix; bipartite methods; decomposition into directed components; directed network; low-rank approximations; matrix polynomial coefficients; network adjacency matrix; nondiagonal decompositions; singular value decomposition; trust prediction; undirected methods; Approximation methods; Eigenvalues and eigenfunctions; Matrix decomposition; Polynomials; Singular value decomposition; Symmetric matrices; Training; decomposition into directed components; trust
【Paper Link】 【Pages】:954-959
【Authors】: Byunghan Lee ; Joonhong Park ; Sungroh Yoon
【Abstract】: Metagenomic sequencing has become a crucial tool for obtaining a gene catalogue of operational taxonomic units (OTUs) in a microbial community. High-throughput pyrosequencing is a next-generation sequencing technique very popular in microbial community analysis due to its longer read length compared to alternative methods. Computational tools are inevitable to process raw data from pyrosequencers, and in particular, noise removal is a critical data-mining step to obtain robust sequence reads. However, the slow rate of existing denoisers has bottlenecked the whole pyrosequencing process, let alone hindering efforts to improve robustness. To address these, we propose a new approach that can accelerate the denoising process substantially. By using our approach, it now takes only about 2 hours to denoise 62,873 pyrosequenced amplicons from a mixture of 91 full-length 16S rRNA clones. It would otherwise take nearly 2.5 days if existing software tools were used. Furthermore, our approach can effectively reduce overestimating the number of OTUs, producing 6.7 times fewer species-level OTUs on average than a state-of-the-art alternative under the same condition. Leveraged by our approach, we hope that metagenomic sequencing will become an even more appealing tool for microbial community analysis.
【Keywords】: biology computing; data mining; genomics; graphics processing units; molecular biophysics; 16S rRNA clone; GPU; data mining; data processing; gene catalogue; graphics processing unit; high-throughput pyrosequencing technique; metagenomic sequencing; microbial community; noise removal; operational taxonomic unit; pyrosequenced amplicons denoising; sequence read; software tool; time 2 hour; Acceleration; Communities; Graphics processing units; Instruction sets; Noise; Noise reduction; Robustness; GPU; amplicons; biomedical informatics; cluster analysis; metagenomics; pyrosequencing
【Paper Link】 【Pages】:960-965
【Authors】: Wonyeol Lee ; Jinha Kim ; Hwanjo Yu
【Abstract】: Influence maximization problem with applications to viral marketing has gained much attention. Underlying influence diffusion models affect influence maximizing nodes because they focus on difference aspect of influence diffusion. Nevertheless, existing diffusion models overlook two important aspects of real-world marketing - continuous trials and time restriction. This paper proposes a new realistic influence diffusion model called Continously activated and Time-restricted IC (CT-IC) model which generalizes the IC model by embedding the above two aspects. We first prove that CT-IC model satisfies two crucial properties - monotonicity and submodularity. We then provide an efficient method for calculating exact influence spread when a social network is restricted to a directed tree and a simple path. Finally, we propose a scalable algorithm for influence maximization under CT-IC model called CT-IPA. Our experiments show that CT-IC model provides seeds of higher influence spread than IC model and CT-IPA is four orders of magnitude faster than the greedy algorithm while providing similar influence spread to the greedy algorithm.
【Keywords】: marketing data processing; social networking (online); trees (mathematics); CT-IC; continuous trials; continuously activated cascade model; directed tree; influence diffusion; influence maximization problem; monotonicity property; social network; submodularity property; time restriction; time-restricted independent cascade model; viral marketing; Computational modeling; Data models; Greedy algorithms; Integrated circuit modeling; Mathematical model; Social network services; influence diffusion model; influence maximization; viral marketing social networks
【Paper Link】 【Pages】:966-971
【Authors】: Yifeng Li ; Alioune Ngom
【Abstract】: Sparse representation involves two relevant procedures - sparse coding and dictionary learning. Learning a dictionary from data provides a concise knowledge representation. Learning a dictionary in a higher feature space might allow a better representation of a signal. However, it is usually computationally expensive to learn a dictionary if the numbers of training data and(or) dimensions are very large using existing algorithms. In this paper, we propose a kernel dictionary learning framework for three models. We reveal that the optimization has dimension-free and parallel properties. We devise fast active-set algorithms for this framework. We investigated their performance on classification. Experimental results show that our kernel sparse representation approaches can obtain better accuracy than their linear counterparts. Furthermore, our active-set algorithms are faster than the existing interior-point and proximal algorithms.
【Keywords】: data structures; knowledge representation; learning (artificial intelligence); set theory; signal classification; signal representation; classification performance; dictionary learning procedure; fast active-set algorithm; interior-point algorithm; kernel dictionary learning framework; kernel sparse representation approach; knowledge representation; proximal algorithm; signal representation; sparse coding procedure; Accuracy; Dictionaries; Equations; Kernel; Mathematical model; Optimization; Training; $l_1$ regularization; dictionary learning; kernel sparse representation; non-negative least squares; sparse coding
【Paper Link】 【Pages】:972-977
【Authors】: Nedim Lipka ; Benno Stein ; James G. Shanahan
【Abstract】: Automated text classification is one of the most important learning technologies to fight information overload. However, the information society is not only confronted with an information flood but also with an increase in "information volatility", by which we understand the fact that kind and distribution of a data source's emissions can significantly vary. In this paper we show how to estimate the expected effectiveness of a classification solution when the underlying data source undergoes a shift in the distribution of its subclasses (modes). Subclass distribution shifts are observed among others in online media such as tweets, blogs, or news articles, where document emissions follow topic popularity. To estimate the expected effectiveness of a classification solution we partition a test sample by means of clustering. Then, using repetitive resampling with different margin distributions over the clustering, the effectiveness characteristics is studied. We show that the effectiveness is normally distributed and introduce a probabilistic lower bound that is used for model selection. We analyze the relation between our notion of expected effectiveness and the mean effectiveness over the clustering both theoretically and on standard text corpora. An important result is a heuristic for expected effectiveness estimation that is solely based on the initial test sample and that can be computed without resampling.
【Keywords】: pattern classification; pattern clustering; sampling methods; statistical distributions; text analysis; expected effectiveness estimation; information flood; information overload; information volatility; learning technology; margin distribution; model selection; probabilistic lower bound; repetitive resampling; subclass distribution shift; text classification solution; topic popularity; Clustering algorithms; Estimation; Machine learning; Mathematical model; Media; Standards; Vectors; Classification; Concept Drift; Model Selection; clustering; unknown distributions
【Paper Link】 【Pages】:978-983
【Authors】: Eric Yi Liu ; Zhishan Guo ; Xiang Zhang ; Vladimir Jojic ; Wei Wang
【Abstract】: Recent studies [1] -- [5] have suggested using constraints in the form of relative distance comparisons to represent domain knowledge: d(a, b) <; d(c, d) where d(·) is the distance function and a, b, c, d are data objects. Such constraints are readily available in many problems where pairwise constraints are not natural to obtain. In this paper we consider the problem of learning a Mahalanobis distance metric from supervision in the form of relative distance comparisons. We propose a simple, yet effective, algorithm that minimizes a convex objective function corresponding to the sum of squared residuals of constraints. We also extend our model and algorithm to promote sparsity in the learned metric matrix. Experimental results suggest that our method consistently outperforms existing methods in terms of clustering accuracy. Furthermore, the sparsity extension leads to more stable estimation when the dimension is high and only a small amount of supervision is given.
【Keywords】: convex programming; learning (artificial intelligence); minimisation; pattern clustering; Mahalanobis distance metric learning; clustering accuracy; convex objective function minimisation; distance function; learned metric matrix; pairwise constraint; relative distance comparison; squared residual minimisation; Accuracy; Clustering algorithms; Convergence; Covariance matrix; Linear programming; Measurement; Symmetric matrices; Mahalanobis metric; metric learning; relative comparisons
【Paper Link】 【Pages】:984-989
【Authors】: Junqiang Liu ; Ke Wang ; Benjamin C. M. Fung
【Abstract】: Utility mining emerged recently to address the limitation of frequent itemset mining by introducing interestingness measures that reflect both the statistical significance and the user's expectation. Among utility mining problems, utility mining with the itemset share framework is a hard one as no anti-monotone property holds with the interestingness measure. The state-of-the-art works on this problem all employ a two-phase, candidate generation approach, which suffers from the scalability issue due to the huge number of candidates. This paper proposes a high utility itemset growth approach that works in a single phase without generating candidates. Our basic approach is to enumerate itemsets by prefix extensions, to prune search space by utility upper bounding, and to maintain original utility information in the mining process by a novel data structure. Such a data structure enables us to compute a tight bound for powerful pruning and to directly identify high utility itemsets in an efficient and scalable way. We further enhance the efficiency significantly by introducing recursive irrelevant item filtering with sparse data, and a lookahead strategy with dense data. Extensive experiments on sparse and dense, synthetic and real data suggest that our algorithm outperforms the state-of-the-art algorithms over one order of magnitude.
【Keywords】: data mining; information filtering; candidate generation approach; dense data; frequent itemset mining; high utility itemset discovery; high utility itemset growth approach; interestingness measure; lookahead strategy; prefix extension; recursive irrelevant item filtering; sparse data; statistical significance; user expectation; utility mining; utility upper bounding; Data mining; Educational institutions; Electronic mail; Itemsets; Scalability; Upper bound; Utility mining; frequent itemsets; high utility itemsets; pattern mining
【Paper Link】 【Pages】:990-995
【Authors】: Steven Loscalzo ; Robert William Wright ; Kevin Acunto ; Lei Yu
【Abstract】: Autonomous agents are emerging in diverse areas and many rely on reinforcement learning (RL) to learn optimal control policies by acting in the environment. This form of learning generates large amounts of transition dynamics data, which can be mined to improve the agent's understanding of the environment. There could be many uses for this data, here we focus on mining it to identify a relevant feature subspace. This is vital since RL performs poorly in high-dimensional spaces, such as those that autonomous agents would commonly face in real-world problems. This paper demonstrates the necessity and feasibility of integrating data mining into the learning process while an agent is learning, enabling it to learn to act by both acting and understanding. Doing so requires overcoming challenges regarding data quantity and quality, and difficulty measuring feature relevance with respect to the control policy. We propose the progressive mining framework to address these challenges by relying on cyclic interaction between data mining and RL. We show that a feature selection algorithm developed under this framework, PROFESS, can improve RL scalability better than a competing approach.
【Keywords】: control engineering computing; data mining; learning (artificial intelligence); multi-agent systems; optimal control; autonomous agent; autonomous control; data mining; feature selection algorithm; feature subspace identification; optimal control policy; reinforcement learning; transition dynamics progressive mining; Aggregates; Algorithm design and analysis; Autonomous agents; Data mining; Heuristic algorithms; Sensors; Silicon; progressive feature selection; reinforcement learning; transition dynamics data
【Paper Link】 【Pages】:996-1001
【Authors】: Qiang Lou ; Zoran Obradovic
【Abstract】: Identifying informative biomarkers from a large pool of candidates is the key step for accurate prediction of an individual's health status. In clinical applications traditional static feature selection methods that flatten the temporal data cannot be directly applied since the patient's observed clinical condition is a temporal multivariate time series where different variables can capture various stages of temporal change in the patient's health status. In this study, in order to identify informative genes in temporal microarray data, a margin based feature selection filter is proposed. The proposed method is based on well-established machine learning techniques without any assumptions about the data distribution. The objective function of temporal margin-based feature selection is defined to maximize each subject's temporal margin in its own relevant subspace. In the objective function, the uncertainty in calculating nearest neighbors is taken into account by considering the change in feature weights in each iteration. A fixed-point gradient descent method is proposed to solve the formulated objective function. The experimental results on both synthetic and real data provide evidence that the proposed method can identify more informative features than the alternatives that flatten the temporal data in advance.
【Keywords】: data analysis; genetics; gradient methods; medical computing; pattern recognition; clinical application; data distribution; fixed-point gradient descent method; individual health status prediction; informative biomarker candidate identification; iteration; machine learning; margin based feature selection filter; nearest neighbor; temporal high-dimensional gene expression data analysis; temporal margin-based feature selection; temporal microarray data; Gene expression; Linear programming; Optimization; Time measurement; Time series analysis; Uncertainty; Vectors; feature selection; high dimensional; margin; multivariate time series data; temporal data
【Paper Link】 【Pages】:1002-1007
【Authors】: Zhiwu Lu ; Yuxin Peng
【Abstract】: This paper presents a graph-based method for heterogeneous constraint propagation on multi-modal data using constrained sparse representation. Since heterogeneous pair wise constraints are defined over pairs of data points from different modalities, heterogeneous constraint propagation is more challenging than the transitional homogeneous constraint propagation on single-modal data which has been studied extensively in previous work. The main difficulty of heterogeneous constraint propagation lies in how to effectively propagate heterogeneous pair wise constraints across different modalities. To address this issue, we decompose heterogeneous constraint propagation into semi-supervised learning sub problems which can then be efficiently solved by graph-based label propagation. Moreover, we develop a constrained sparse representation method for graph construction over each modality using homogeneous pair wise constraints. The experimental results in cross-modal retrieval have shown the superior performance of our heterogeneous constraint propagation.
【Keywords】: constraint handling; data analysis; graph theory; learning (artificial intelligence); constrained sparse representation method; cross-modal retrieval; graph construction; graph-based label propagation; graph-based method; heterogeneous constraint propagation; heterogeneous pairwise constraints; homogeneous pairwise constraints; modality; multimodal data analysis; semisupervised learning subproblems; single-modal data; transitional homogeneous constraint propagation; Correlation; Encyclopedias; Equations; Laplace equations; Semantics; Semisupervised learning; Sparse matrices; cross-modal retrieval; heterogeneous constraint propagation; multimodal data; sparse representation
【Paper Link】 【Pages】:1008-1013
【Authors】: Dhruv Mahajan ; Sundararajan Sellamanickam ; Subhajit Sanyal ; Amit Madaan
【Abstract】: In this paper we propose a novel classification based framework for finding a small number of images that summarize a given concept. Our method exploits metadata information available with the images to get category information using Latent Dirichlet Allocation. Using this category information for each image, we solve the underlying classification problem by building a sparse classifier model for each concept. We demonstrate that the images that specify the sparse model form a good summary. In particular, our summary satisfies important properties such as likelihood, diversity and balance in both visual and semantic sense. Furthermore, the framework allows users to specify desired distributions over categories to create personalized summaries. Experimental results on seven broad query types show that the proposed method performs better than state-of-the-art methods.
【Keywords】: image classification; Latent dirichlet allocation; balance property; classification based framework; concept summarization; diversity property; image classification; likelihood property; metadata information; sparse classifier model; Kernel; Linear programming; Measurement; Optimization; Semantics; Vectors; Visualization; classification; concept summarization; metadata
【Paper Link】 【Pages】:1014-1019
【Authors】: Son T. Mai ; Sebastian Goebl ; Claudia Plant
【Abstract】: Recently, fiber segmentation has become an emerging technique in neuroscience. Grouping fiber tracts into anatomical meaningful bundles allows to study the structure of the brain and to investigate onset and progression of neurodegenerative and mental diseases. In this paper, we propose a novel technique for fiber tracts based on shape similarity and connection similarity. For shape similarity, we propose some new techniques adapted from existing similarity measures for trajectory data. We also propose a new technique called Warped Longest Common Subsequence (WLCS) for which we additionally developed a lower-bounding distance function to speed up the segmentation process. Our segmentation is based on an outlier-robust density-based clustering algorithm. Extensive experiments on diffusion tensor images demonstrate the efficiency and effectiveness of our technique.
【Keywords】: diseases; image segmentation; medical image processing; tensors; WLCS; diffusion tensor images; fiber segmentation; lower-bounding distance function; mental diseases; neurodegenerative diseases; neuroscience; outlier-robust density-based clustering algorithm; segmentation algorithm; segmentation process; warped longest common subsequence; white matter fiber tracts; Clustering algorithms; Gold; Noise; Robustness; Shape; Shape measurement; Standards; Diffusion Tensor Imaging; Fiber Segmentation; Fiber Similarity Measure; Neuroscience
【Paper Link】 【Pages】:1020-1025
【Authors】: Julian J. McAuley ; Jure Leskovec ; Dan Jurafsky
【Abstract】:
Most online reviews consist of plain-text feedback together with a single numeric score. However, understanding the multiple aspects' that contribute to users' ratings may help us to better understand their individual preferences. For example, a user's impression of an audio book presumably depends on aspects such as the story and the narrator, and knowing their opinions on these aspects may help us to recommend better products. In this paper, we build models for rating systems in which such dimensions are explicit, in the sense that users leave separate ratings for each aspect of a product. By introducing new corpora consisting of five million reviews, rated with between three and six aspects, we evaluate our models on three prediction tasks: First, we uncover which parts of a review discuss which of the rated aspects. Second, we summarize reviews by finding the sentences that best explain a user's rating. Finally, since aspect ratings are optional in many of the datasets we consider, we recover ratings that are missing from a user's evaluation. Our model matches state-of-the-art approaches on existing small-scale datasets, while scaling to the real-world datasets we introduce. Moreover, our model is able to
disentangle' content and sentiment words: we automatically learn content words that are indicative of a particular aspect as well as the aspect-specific sentiment words that are indicative of a particular rating.
【Keywords】: information services; learning (artificial intelligence); user interfaces; aspect rating; content word; learning attitude; learning attribute; multiaspect review; numeric score; online review; plain-text feedback; rating recovery; rating system; sentiment word; user audio book impression; user evaluation; Bipartite graph; Correlation; Data models; Optimization; Predictive models; Training; Unsupervised learning; machine learning; segmentation; sentiment analysis; summarization
【Paper Link】 【Pages】:1026-1031
【Authors】: Phong Nguyen ; Jun Wang ; Melanie Hilario ; Alexandros Kalousis
【Abstract】: The notion of meta-mining has appeared recently and extends traditional meta-learning in two ways. First it provides support for the whole data-mining process. Second it pries open the so called algorithm black-box approach where algorithms and workflows also have descriptors. With the availability of descriptors both for datasets and data-mining workflows we are faced with a problem the nature of which is much more similar to those appearing in recommendation systems. In order to account for the meta-mining specificities we derive a novel metric-based-learning recommender approach. Our method learns two homogeneous metrics, one in the dataset and one in the workflow space, and a heterogeneous one in the dataset-workflow space. All learned metrics reflect similarities established from the dataset-workflow preference matrix. The latter is constructed from the performance results obtained by the application of workflows to datasets. We demonstrate our method on meta-mining over biological (microarray datasets) problems. The application of our method is not limited to the meta-mining problem, its formulation is general enough so that it can be applied on problems with similar requirements.
【Keywords】: data mining; learning (artificial intelligence); recommender systems; algorithm black-box approach; biological problem; data mining; dataset-workflow preference matrix; dataset-workflow space; datasets; heterogeneous metric; heterogeneous similarity measure learning; homogeneous metric; hybrid recommendation; metalearning; metamining notion; metric-based-learning recommender approach; microarray dataset; recommendation system; Data mining; Learning systems; Linear programming; Machine learning; Measurement; Optimization; Vectors; Hybrid Recommendation; Meta-Learning; Meta-Mining; Metric-Learning
【Paper Link】 【Pages】:1032-1037
【Authors】: Feng Niu ; Ce Zhang ; Christopher Re ; Jude W. Shavlik
【Abstract】: Markov logic is a knowledge-representation language that allows one to specify large graphical models. However, the resulting large graphical models can make inference for Markov logic a computationally challenging problem. Recently, dual decomposition (DD) has become a popular approach for scalable inference on graphical models. We study how to apply DD to scale up inference in Markov logic. A standard approach for DD first partitions a graphical model into multiple tree-structured sub problems. We apply this approach to Markov logic and show that DD can outperform prior inference approaches. Nevertheless, we observe that the standard approach for DD is suboptimal as it does not exploit the rich structure often present in the Markov logic program. Thus, we describe a novel decomposition strategy that partitions a Markov logic program into parts based on its structure. A crucial advantage of our approach is that we can use specialized algorithms for portions of the input problem -- some of which have been studied for decades, e.g., coreference resolution. Empirically, we show that our program-level decomposition approach outperforms both non-decomposition and graphical model-based decomposition approaches to Markov logic inference on several data-mining tasks.
【Keywords】: data mining; formal logic; inference mechanisms; knowledge representation languages; solid modelling; trees (mathematics); DD first partition approach; Markov logic; coreference resolution algorithm; data mining task; dual decomposition; graphical model; graphical model-based decomposition approach; inference approach; knowledge representation language; nondecomposition based decomposition approach; program-level decomposition approach; scaling inference; tree-structured subproblem; Graphical models; Inference algorithms; Markov processes; Master-slave; Message passing; Scalability; Standards
【Paper Link】 【Pages】:1038-1043
【Authors】: Anastasios Noulas ; Salvatore Scellato ; Neal Lathia ; Cecilia Mascolo
【Abstract】: Mobile location-based services are thriving, providing an unprecedented opportunity to collect fine grained spatio-temporal data about the places users visit. This multi-dimensional source of data offers new possibilities to tackle established research problems on human mobility, but it also opens avenues for the development of novel mobile applications and services. In this work we study the problem of predicting the next venue a mobile user will visit, by exploring the predictive power offered by different facets of user behavior. We first analyze about 35 million check-ins made by about 1 million Foursquare users in over 5 million venues across the globe, spanning a period of five months. We then propose a set of features that aim to capture the factors that may drive users' movements. Our features exploit information on transitions between types of places, mobility flows between venues, and spatio-temporal characteristics of user check-in patterns. We further extend our study combining all individual features in two supervised learning models, based on linear regression and M5 model trees, resulting in a higher overall prediction accuracy. We find that the supervised methodology based on the combination of multiple features offers the highest levels of prediction accuracy: M5 model trees are able to rank in the top fifty venues one in two user check-ins, amongst thousands of candidate items in the prediction list.
【Keywords】: data mining; learning (artificial intelligence); regression analysis; social networking (online); trees (mathematics); user interfaces; Foursquare user; M5 model trees; data collection; data multidimensional source; fine grained spatio-temporal data; human mobility; information exploitation; linear regression; location-based services; mobile application; mobile service; next place prediction; prediction accuracy; supervised learning model; user behavior; user check-in; user mobility feature mining; Accuracy; Cities and towns; Filtering; Humans; Mobile communication; Predictive models; Supervised learning; data mining; human mobility; location-based services
【Paper Link】 【Pages】:1044-1049
【Authors】: Orlando Ohashi ; Luís Torgo
【Abstract】: Many real world data mining applications involve analyzing geo-referenced data. Frequently, this type of data sets are incomplete in the sense that not all geographical coordinates have measured values of the variable(s) of interest. This incompleteness may be caused by poor data collection, measurement errors, costs management and many other factors. These missing values may cause several difficulties in many applications. Spatial imputation/interpolation methods try to fill in these unknown values in geo-referenced data sets. In this paper we propose a new spatial imputation method based on machine learning algorithms and a series of data pre-processing steps. The key distinguishing factor of this method is allowing the use of data from faraway regions, contrary to the state of the art on spatial data mining. Images (e.g. from a satellite or video surveillance cameras) may also suffer from this incompleteness where some pixels are missing, which again may be caused by many factors. An image can be seen as a spatial data set in a Cartesian coordinates system, where each pixel (location) registers some value (e.g. degree of gray on a black and white image). Being able to recover the original image from a partial or incomplete version of the reality is a key application in many domains (e.g. surveillance, security, etc.). In this paper we evaluate our general methodology for spatial interpolation on this type of problems. Namely, we check the ability of our method to fill in unknown pixels on several images. We compare it to state of the art methods and provide strong experimental evidence of the advantages of our proposal.
【Keywords】: data mining; image processing; interpolation; learning (artificial intelligence); regression analysis; Cartesian coordinates system; black image; data mining application; data preprocessing step; geo-referenced data; gray image; image recovery; machine learning algorithm; multiple regression; spatial data mining; spatial imputation method; spatial interpolation; white image; Context; Data models; Interpolation; Predictive models; Proposals; Spatial databases; data pre-processing; spatial prediction
【Paper Link】 【Pages】:1050-1055
【Authors】: Masayuki Okabe ; Seiji Yamada
【Abstract】: A method for creating a constrained clustering ensemble by learning the priorities of pair wise constraints is proposed in this paper. This method integrates multiple clusters produced by using a simple constrained K-means algorithm that we modify to utilize the constraints priorities. The cluster ensemble is executed according to a boosting framework, which adaptively learns the constraints priorities and provides them for the modified constrained K-means to create diverse clusters that finally improve the clustering performance. The experimental results show that our proposed method outperforms the original constrained K-means and is comparable to several state-of-the-art constrained clustering methods.
【Keywords】: learning (artificial intelligence); pattern clustering; boosting framework; clustering performance; constrained clustering ensemble; constraint priority learning; pairwise constraint; simple constrained K-means algorithm; Boosting; Clustering algorithms; Clustering methods; Computational efficiency; Glass; Kernel; Measurement; Boosting; Cluster Ensemble; Clustering; K-means; Learning Kernel Matrix
【Paper Link】 【Pages】:1056-1061
【Authors】: Amichai Painsky ; Saharon Rosset
【Abstract】: The availability of large microarray data has led to a growing interest in biclustering methods in the past decade. Several algorithms have been proposed to identify subsets of genes and conditions according to different similarity measures and under varying constraints. In this paper we focus on the exclusive row biclusteing problem (also known as projected clustering) for gene expression data sets, in which each row can only be a member of a single bicluster while columns can participate in multiple clusters. This type of biclustering may be adequate, for example, for clustering groups of cancer patients where each patient (row) is expected to be carrying only a single type of cancer, while each cancer type is associated with multiple (and possibly overlapping) genes (columns). In this paper we present a novel method to identify these exclusive row biclusters through a combination of existing biclustering algorithms and combinatorial auction techniques. We devise an approach for tuning the threshold for our algorithm based on comparison to a null model in the spirit of the Gap statistic approach. We demonstrate our approach on both synthetic and real-world gene expression data and show its power in identifying large span non-overlapping rows sub matrices, while considering their unique nature. The Gap statistic approach succeeds in identifying appropriate thresholds in all our examples.
【Keywords】: biology computing; cancer; combinatorial mathematics; genetics; molecular biophysics; pattern clustering; statistics; Gap statistic approach; biclustering method; cancer type; combinatorial auction approach; exclusive row biclustering; gene expression; microarray data; null model; projected clustering; similarity measure; Algorithm design and analysis; Cancer; Clustering algorithms; Educational institutions; Gene expression; Optimization; Partitioning algorithms; Biclustering; Exclusive Row Biclustering; Gene Expression; Projected Clustering
【Paper Link】 【Pages】:1062-1067
【Authors】: Niketan Pansare ; Chris Jermaine ; Peter J. Haas ; Nitendra Rajput
【Abstract】: Virtually all work on topic modeling has assumed that the topics are to be learned over a text-based document corpus. However, there exist important applications where topic models must be learned over an audio corpus of spoken language. Unfortunately, speech-to-text programs can have very low accuracy. We therefore propose a novel topic model for spoken language that incorporates a statistical model of speech-to-text software behavior. Crucially, our model exploits the uncertainty numbers returned by the software. Our ideas apply to any domain in which it would be useful to build a topic model over data in which uncertainties are explicitly represented.
【Keywords】: speech processing; statistical analysis; text analysis; audio corpus; speech-to-text program; speech-to-text software behavior; spoken language; statistical model; text-based document corpus; topic model; Accuracy; Biological system modeling; Computational modeling; Data models; Software; Uncertainty; Vectors; Speech recognition; Text analysis; Uncertain data
【Paper Link】 【Pages】:1068-1073
【Authors】: Chengbin Peng ; Ka-Chun Wong ; Alyn Rockwood ; Xiangliang Zhang ; Jinling Jiang ; David Keyes
【Abstract】: Non-negative matrix factorization (NMF) provides the advantage of parts-based data representation through additive only combinations. It has been widely adopted in areas like item recommending, text mining, data clustering, speech denoising, etc. In this paper, we provide an algorithm that allows the factorization to have linear or approximately linear constraints with respect to each factor. We prove that if the constraint function is linear, algorithms within our multiplicative framework will converge. This theory supports a large variety of equality and inequality constraints, and can facilitate application of NMF to a much larger domain. Taking the recommender system as an example, we demonstrate how a specialized weighted and constrained NMF algorithm can be developed to fit exactly for the problem, and the tests justify that our constraints improve the performance for both weighted and unweighted NMF algorithms under several different metrics. In particular, on the Movie lens data with 94% of items, the Constrained NMF improves recall rate 3% compared to SVD50 and 45% compared to SVD150, which were reported as the best two in the top-N metric.
【Keywords】: approximation theory; data structures; singular value decomposition; NMF; constrained nonnegative matrix factorization; constraint function; data clustering; equality constraint; inequality constraint; item recommendation; linear constraint; multiplicative algorithm; parts-based data representation; recall rate; singular value decomposition; speech denoising; text mining; Approximation algorithms; Clustering algorithms; Convergence; Matrix decomposition; Measurement; Motion pictures; Vectors; Linear Constraints; Multiplicative Algorithm; Non-negative Matrix Factorization
【Paper Link】 【Pages】:1074-1079
【Authors】: Huida Qiu ; Yan Liu ; Niranjan A. Subrahmanya ; Weichang Li
【Abstract】: Recent developments in industrial systems provide us with a large amount of time series data from sensors, logs, system settings and physical measurements, etc. These data are extremely valuable for providing insights about the complex systems and could be used to detect anomalies at early stages. However, the special characteristics of these time series data, such as high dimensions and complex dependencies between variables, as well as its massive volume, pose great challenges to existing anomaly detection algorithms. In this paper, we propose Granger graphical models as an effective and scalable approach for anomaly detection whose results can be readily interpreted. Specifically, Granger graphical models are a family of graphical models that exploit the temporal dependencies between variables by applying L1-regularized learning to Granger causality. Our goal is to efficiently compute a robust "correlation anomaly" score for each variable via Granger graphical models that can provide insights on the possible reasons of anomalies. We evaluate the effectiveness of our proposed algorithms on both synthetic and application datasets. The results show the proposed algorithm achieves significantly better performance than other baseline algorithms and is scalable for large-scale applications.
【Keywords】: learning (artificial intelligence); manufacturing data processing; manufacturing systems; reliability; solid modelling; time series; Granger causality; Granger graphical model; L1-regularized learning; anomaly detection algorithm; correlation anomaly score; industrial system; time-series anomaly detection; Algorithm design and analysis; Data models; Graphical models; Optimization; Principal component analysis; Stochastic processes; Time series analysis; Anomaly Detection; Time Series Analysis
【Paper Link】 【Pages】:1080-1085
【Authors】: Umaa Rebbapragada ; Carla E. Brodley ; Damien Sulla-Menashe ; Mark A. Friedl
【Abstract】: Active Label Correction (ALC) is an interactive method that cleans an established training set of mislabeled examples in conjunction with a domain expert. ALC presumes that the expert who conducts this review is either more accurate than the original annotator or has access to additional resources that ensure a high quality label. A high-cost re-review is possible because ALC proceeds iteratively, scoring the full training set but selecting only small batches of examples that are likely mislabeled. The expert reviews each batch and corrects any mislabeled examples, after which the classifier is retrained and the process repeats until the expert terminates it. We compare several instantiations of ALC to fully-automated methods that attempt to discard or correct label noise in a single pass. Our empirical results show that ALC outperforms single-pass methods in terms of selection efficiency and classifier accuracy. We evaluate the best ALC instantiation on our motivating task of detecting mislabeled and poorly formulated sites within a land cover classification training set from the geography domain.
【Keywords】: geography; learning (artificial intelligence); pattern classification; terrain mapping; ALC method; active label correction; classifier accuracy; classifier retraining; domain expert; geography domain; land cover classification training; selection efficiency; single-pass method; Accuracy; Labeling; Noise; Noise level; Training; Training data; Uncertainty; data cleaning; label noise; land cover classification; supervised learning
【Paper Link】 【Pages】:1086-1091
【Authors】: Rafael Geraldeli Rossi ; Thiago de Paulo Faleiros ; Alneu de Andrade Lopes ; Solange Oliveira Rezende
【Abstract】: Usually, algorithms for categorization of numeric data have been applied for text categorization after a preprocessing phase which assigns weights for textual terms deemed as attributes. However, due to characteristics of textual data, some algorithms for data categorization are not efficient for text categorization. Characteristics of textual data such as sparsity and high dimensionality sometimes impair the quality of general purpose classifiers. Here, we propose a text classifier based on a bipartite heterogeneous network used to represent textual document collections. Such algorithm induces a classification model assigning weights to objects that represents terms of the textual document collection. The induced weights correspond to the influence of the terms in the classification of documents they appear. The least-mean-square algorithm is used in the inductive process. Empirical evaluation using a large amount of textual document collections shows that the proposed IMBHN algorithm produces significantly better results than the k-NN, C4.5, SVM and Naïve Bayes algorithms.
【Keywords】: least mean squares methods; network theory (graphs); pattern classification; text analysis; C4.5 algorithm; IMBHN algorithm; Naive Bayes algorithm; SVM algorithm; bipartite heterogeneous network; classification model; dimensionality characteristics; general purpose classifier; inductive model generation; k-NN algorithm; k-nearest neighbor; least mean square algorithm; numeric data categorization algorithm; sparsity characteristics; support vector machines; text categorization; textual data characteristics; textual document collection; textual term; Accuracy; Data models; Equations; Mathematical model; Niobium; Training; Vectors; Heterogeneous Network; Text Categorization
【Paper Link】 【Pages】:1092-1097
【Authors】: Budhaditya Saha ; Dinh Q. Phung ; Duc-Son Pham ; Svetha Venkatesh
【Abstract】: We present a novel method for document clustering using sparse representation of documents in conjunction with spectral clustering. An ℓ1-norm optimization formulation is posed to learn the sparse representation of each document, allowing us to characterize the affinity between documents by considering the overall information instead of traditional pair wise similarities. This document affinity is encoded through a graph on which spectral clustering is performed. The decomposition into multiple subspaces allows documents to be part of a sub-group that shares a smaller set of similar vocabulary, thus allowing for cleaner clusters. Extensive experimental evaluations on two real-world datasets from Reuters-21578 and 20Newsgroup corpora show that our proposed method consistently outperforms state-of-the-art algorithms. Significantly, the performance improvement over other methods is prominent for this datasets.
【Keywords】: data structures; document handling; graph theory; optimisation; pattern clustering; 20Newsgroup dataset; L1-norm optimization formulation; Reuters-21578 dataset; document affinity; graph; pairwise similarity; sparse subspace representation; spectral clustering; spectral document clustering; subspace decomposition; Clustering algorithms; Eigenvalues and eigenfunctions; Indexing; Laplace equations; Matrix decomposition; Sparse matrices; Symmetric matrices; Document clustering; Sparse representation
【Paper Link】 【Pages】:1098-1103
【Authors】: Chun-Wei Seah ; Ivor Wai-Hung Tsang ; Yew-Soon Ong ; Qi Mao
【Abstract】: In the absence of the labeled samples in a domain referred to as target domain, Domain Adaptation (DA) techniques come in handy. Generally, DA techniques assume there are available source domains that share similar predictive function with the target domain. Two core challenges of DA typically arise, variance that exists between source and target domains, and the inherent source hypothesis bias. In this paper, we first propose a Stability Transfer criterion for selecting relevant source domains with low variance. With this criterion, we introduce a TARget learning Assisted by Source Classifier Adaptation (TARASCA) method to address the two core challenges that have impeded the performances of DA techniques. To verify the robustness of TARASCA, extensive experimental studies are carried out with comparison to several state-of-the-art DA methods on the real-world Sentiment and Newsgroups datasets, where various settings for the class ratios of the source and target domains are considered.
【Keywords】: learning (artificial intelligence); pattern classification; statistical analysis; DA technique; Newsgroups dataset; Sentiment dataset; TARASCA method; domain adaptation technique; domain variance; predictive function learning; source domain; stability transfer criterion; target domain; target learning assisted by source classifier adaptation; Joints; Kernel; Prediction algorithms; Robustness; Stability criteria; Standards; Support vector machines; Domain Adaptation; Source Hypothesis bias; Transfer Learning
【Paper Link】 【Pages】:1104-1109
【Authors】: Ming Shao ; Carlos Castillo ; Zhenghong Gu ; Yun Fu
【Abstract】: One of the most important challenges in machine learning is performing effective learning when there are limited training data available. However, there is an important case when there are sufficient training data coming from other domains (source). Transfer learning aims at finding ways to transfer knowledge learned from a source domain to a target domain by handling the subtle differences between the source and target. In this paper, we propose a novel framework to solve the aforementioned knowledge transfer problem via low-rank representation constraints. This is achieved by finding an optimal subspace where each datum in the target domain can be linearly represented by the corresponding subspace in the source domain. Extensive experiments on several databases, i.e., Yale B, CMU PIE, UB Kin Face databases validate the effectiveness of the proposed approach and show the superiority to the existing, well-established methods.
【Keywords】: database management systems; learning (artificial intelligence); CMU PIE database; UB Kin Face database; Yale B database; knowledge transfer problem; low-rank representation constraint; machine learning; training data; transfer subspace learning; Convergence; Databases; Knowledge transfer; Learning systems; Machine learning; Principal component analysis; Silicon; domain adaptation; low-rank; transfer learning
【Paper Link】 【Pages】:1110-1115
【Authors】: Yelong Shen ; Ruoming Jin ; Dejing Dou ; Nafisa Afrin Chowdhury ; Junfeng Sun ; Brigitte Piniewski ; David Kil
【Abstract】: Modeling and predicting human behaviors, such as the activity level and intensity, is the key to prevent the cascades of obesity, and help spread wellness and healthy behavior in a social network. In this work, we propose a Socialized Gaussian Process (SGP) for socialized human behavior modeling. In the proposed SGP model, we naturally incorporates human's personal behavior factor and social correlation factor into a unified model, where basic Gaussian Process model is leveraged to capture individual's personal behavior pattern. Furthermore, we extend the Gaussian Process Model to socialized Gaussian Process (SGP) which aims to capture social correlation phenomena in the social network. The detailed experimental evaluation has shown the SGP model achieves the best prediction accuracy compared with other baseline methods.
【Keywords】: Gaussian processes; behavioural sciences; social sciences; SGP model; health social network; healthy behavior; human activity level; human behavior prediction; human intensity; personal behavior factor; prediction accuracy; social correlation factor; social correlation phenomenon; socialized Gaussian process model; wellness behavior; Accuracy; Correlation; Educational institutions; Gaussian processes; Humans; Predictive models; Social network services; Human Behavior Prediction; Socialized Gaussian Process
【Paper Link】 【Pages】:1116-1121
【Authors】: Jafar Tanha ; Maarten van Someren ; Hamideh Afsarmanesh
【Abstract】: We present an algorithm for multiclass Semi-Supervised learning which is learning from a limited amount of labeled data and plenty of unlabeled data. Existing semi-supervised algorithms use approaches such as one-versus-all to convert the multiclass problem to several binary classification problems which is not optimal. We propose a multiclass semi-supervised boosting algorithm that solves multiclass classification problems directly. The algorithm is based on a novel multiclass loss function consisting of the margin cost on labeled data and two regularization terms on labeled and unlabeled data. Experimental results on a number of UCI datasets show that the proposed algorithm performs better than the state-of-the-art boosting algorithms for multiclass semi-supervised learning.
【Keywords】: learning (artificial intelligence); pattern classification; AdaBoost algorithm; binary classification problem; labeled data; margin cost; multiclass loss function; multiclass semisupervised boosting algorithm; multiclass semisupervised learning; one-versus-all approach; regularization term; unlabeled data; Boosting; Linear programming; Optimization; Prediction algorithms; Semisupervised learning; Training; Semi-Supervised Learning; boosting; multiclass classification
【Paper Link】 【Pages】:1122-1127
【Authors】: Duy Tin Truong ; Roberto Battiti
【Abstract】: Supervised alternative clusterings is the problem of finding a set of clusterings which are of high quality and different from a given negative clustering. The task is therefore a clear multi-objective optimization problem. Optimizing two conflicting objectives requires dealing with trade-offs. Most approaches in the literature optimize these objectives sequentially or indirectly, resulting in solutions which are dominated. We develop a multi-objective algorithm, called COGNAC, able to optimize the objectives directly and simultaneously and producing solutions approximating the Pareto front. COGNAC performs the recombination operator at the cluster level instead of the object level as in traditional genetic algorithms. It can accept arbitrary clustering quality and dissimilarity objectives and provide solutions dominating those of other state-of-the-art algorithms. COGNAC can also be used to generate a sequence of alternative clusterings, each of which is guaranteed to be different from all previous ones.
【Keywords】: Pareto optimisation; approximation theory; genetic algorithms; learning (artificial intelligence); pattern clustering; COGNAC algorithm; Pareto front approximation; arbitrary clustering quality; cluster-oriented genetic algorithm; conflicting objective; dissimilarity objective; multiobjective optimization problem; negative clustering; recombination operator; supervised alternative clustering; Birds; Clustering algorithms; Complexity theory; Face; Genetic algorithms; Sociology; Statistics; alternative clustering; cluster-oriented; genetic algorithm; multi-objective optimization
【Paper Link】 【Pages】:1128-1133
【Authors】: Carlos Vivaracho-Pascual ; María Arancha Simón Hurtado ; Esperanza Manso Martínez ; Juan Manuel Pascual-Gaspar
【Abstract】: Score Normalization is a usual technique in pattern recognition to standardize the classifier output ranges so as to, for example, fuse these outputs. The use of score normalization in biometric recognition is a very important part of the system, specially in those based on behavioral traits, such as written signature or voice, conditioning the final system performance. Then, many works can be found that focus on the problem. A successful new approach for client threshold prediction, based on Multiple Linear Prediction, has been presented in recent works. Here, a new approach for score normalization, based on this proposal for biometric manuscript signature user verification, is shown. This proposal is compared with the state of the art methods, achieving an improvement of 19% and 16% for Equal Error Rate (EER) and 60% and 26% for Detection Cost Function (DCF) performance measures, for random and skilled forgeries, respectively.
【Keywords】: biometrics (access control); handwriting recognition; DCF performance measure; biometric manuscript signature user verification; biometric signature recognition; client threshold prediction; detection cost function; equal error rate; multiple linear prediction; pattern recognition; random forgery; score normalization; skilled forgery; voice; written signature; Data models; Equations; Forgery; Mathematical model; Predictive models; Proposals; Standards; Biometric Signature Recognition; Client Threshold Prediction; Score Normalization
【Paper Link】 【Pages】:1134-1139
【Authors】: Bingsheng Wang ; Feng Chen ; Haili Dong ; Arnold P. Boedihardjo ; Chang-Tien Lu
【Abstract】: As the issue of freshwater shortage is increasing daily, it's critical to take effective measures for water conservation. Based on previous studies, device level consumption could lead to significant conservation of freshwater. However, current smart meter deployments only produce low sample rate aggregated data. In this paper, we examine the task of separating whole-home water consumption into its component appliances. A key challenge is to address the unique features of low sample rate data. To this end, we propose Sparse Coding with Featured Discriminative Dictionary (SCFDD) by incorporating inherent shape and activation features to capture the discriminative characteristics of devices. In addition, extensive experiments were performed to validate the effectiveness of SCFDD.
【Keywords】: data mining; encoding; signal processing; smart meters; water conservation; SCFDD; activation features; component appliances; device level consumption; featured discriminative dictionary; freshwater shortage; signal disaggregation; smart meter; sparse coding; water conservation; whole home water consumption; Dictionaries; Encoding; Market research; Performance evaluation; Shape; Testing; Water conservation; Disaggregation; Discriminative dictionary; Low sample rate; Shape and activation features
【Paper Link】 【Pages】:1140-1145
【Authors】: Jialei Wang ; Peilin Zhao ; Steven C. H. Hoi
【Abstract】: Both cost-sensitive classification and online learning have been studied extensively in data mining and machine learning communities, respectively. It is a bit surprising that there was very limited comprehensive study for addressing an important intersecting problem, that is, cost-sensitive online classification. In this paper, we formally study this problem, and propose a new framework for cost-sensitive online classification by exploiting the idea of online gradient descent techniques. Based on the framework, we propose a family of cost-sensitive online classification algorithms, which are designed to directly optimize two well-known cost-sensitive measures: (i) maximization of weighted sum of sensitivity and specificity, and (ii) minimization of weighted misclassification cost. We analyze the theoretical bounds of the cost-sensitive measures made by the proposed algorithms, and extensively examine their empirical performance on a variety of cost-sensitive online classification tasks.
【Keywords】: data mining; gradient methods; learning (artificial intelligence); minimisation; pattern classification; cost-sensitive measure; cost-sensitive online classification; data mining; machine learning; online gradient descent technique; online learning; sensitivity sum; specificity sum; weighted misclassification cost minimization; Accuracy; Data mining; Machine learning; Machine learning algorithms; Prediction algorithms; Sensitivity; classification; cost-sensitive learning; online learning
【Paper Link】 【Pages】:1146-1151
【Authors】: Xiang Wang ; Buyue Qian ; Ian Davidson
【Abstract】: In many real-world applications we can model the data as a graph with each node being an instance and the edges indicating a degree of similarity. Side information is often available in the form of labels for a small subset of instances, which gives rise to two problem settings and two types of algorithms. In the label propagation style algorithms, the known labels are propagated to the unlabeled nodes. In the constrained clustering style algorithms, known labels are first converted to pair wise constraints (Must-Link and Cannot-Link), then a constrained cut is computed as a tradeoff between minimizing the cut cost and maximizing the constraint satisfaction. Both techniques are evaluated by their ability to recover the ground truth labeling, i.e. by 0/1 loss function either directly on the labels or on the pair wise relations derived from the labels. These two fields have developed separately, but in this paper, we show that they are indeed related. This insight allows us to propose a novel way to generate constraints from the propagated labels, which our empirical study shows outperforms and is more stable than the state-of-the-art label propagation and constrained spectral clustering algorithms.
【Keywords】: data mining; graph theory; learning (artificial intelligence); pattern clustering; cannot-link constraint; constrained clustering style algorithm; constrained spectral clustering; constraint satisfaction; cut cost; data mining; graph edge; graph node; ground truth labeling; label propagation style algorithm; loss function; machine learning; must-link constraint; pairwise constraint; side information; similarity degree; Clustering algorithms; Computer science; Educational institutions; Eigenvalues and eigenfunctions; Indexes; Labeling; Partitioning algorithms; constrained spectral clustering; label propagation; semi-supervised learning
【Paper Link】 【Pages】:1152-1157
【Authors】: Yuanhong Wang ; Yang Liu ; Xiaohui Yu
【Abstract】: Collaborative filtering (CF) aims to produce user specific recommendations based on other users' ratings of items. Most existing CF methods rely only on users' overall ratings of items, ignoring the variety of opinions users may have towards different aspects of the items. Using the movie domain as a case study, we propose a framework that is able to capture users' opinions on different aspects from the textual reviews, and use that information to improve the effectiveness of CF. This framework has two components, an opinion mining component and a rating inference component. The former extracts and summarizes the opinions on multiple aspects from the reviews, generating ratings on the various aspects. The latter component, on the other hand, infers the overall ratings of items based on the aspect ratings, which forms the basis for item recommendation. Our core contribution is in the proposal of a tensor factorization approach for the rating inference. Operating on the tensor composed of the overall and aspect ratings, this approach is able to capture the intrinsic relationships between users, items, and aspects, and provide accurate predictions on unknown ratings. Experiments on a movie dataset show that our proposal significantly improves the prediction accuracy compared with two baseline methods.
【Keywords】: collaborative filtering; data mining; recommender systems; tensors; CF; aspect ratings; aspect-based opinion mining; collaborative filtering; item recommendation; item user rating; movie dataset; movie domain; opinion extraction; opinion mining component; opinion summarization; rating inference component; tensor factorization approach; textual reviews; Collaboration; Data mining; Educational institutions; Estimation; Motion pictures; Proposals; Tensile stress; Collaborative Filtering; Opinion Mining; Recommendation System; Sentiment Analysis; Tensor Factorization
【Paper Link】 【Pages】:1158-1163
【Authors】: Tong Xu ; Dong Liu ; Enhong Chen ; Huanhuan Cao ; Jilei Tian
【Abstract】: Recently, the boom of media contents on the Internet raises challenges in managing them effectively and thus requires automatic media annotation techniques. Motivated by the observation that media contents are usually shared frequently in online communities and thus have a lot of social diffusion records, we propose a novel media annotating approach depending on these social diffusion records instead of metadata. The basic assumption is that the social diffusion records reflect the common interests (CI) between users, which can be analyzed for generating annotations. With this assumption, we present a novel CI-based social diffusion model and translate the automatic annotating task into the CI-based diffusion maximization (CIDM) problem. Moreover, we propose to solve the CIDM problem through two optimization tasks, corresponding to the training and test stages in supervised learning. Extensive experiments on real-world data sets show that our approach can effectively generate high quality annotations, and thus demonstrate the capability of social diffusion analysis in annotating media.
【Keywords】: Internet; content management; learning (artificial intelligence); optimisation; social networking (online); CI-based diffusion maximization problem; CI-based social diffusion model; CIDM problem; Internet; automatic media content annotation techniques; common interests; online community; optimization tasks; social diffusion analysis; social diffusion records; supervised learning; test stage; training stage; Computational modeling; Data mining; Feature extraction; Media; Motion pictures; Optimization; Training; Automatic annotation; common interests; diffusion maximization; social diffusion; social media
【Paper Link】 【Pages】:1164-1169
【Authors】: Hedong Yang ; Lijie Wen ; Jianmin Wang
【Abstract】: Process mining links traditional model-driven Business Process Management and data mining by means of deriving knowledge from event logs to improve operational business processes. As an impact factor of the quality of process mining results, the degree of completeness of the given event log should be necessarily measured. In this paper an approach is proposed in the context of mining control-flow dependencies to evaluate the local completeness of an event log without knowing any information about the original process model. Experiment results show that the proposed approach works robustly and gives better estimation than approaches available.
【Keywords】: business process re-engineering; data mining; data mining; event log; local completeness evaluation; mining control-flow dependency; model-driven business process management; process mining; Algorithm design and analysis; Business; Data mining; Data models; Estimation; Probabilistic logic; Process control; business process management; information completeness; process mining
【Paper Link】 【Pages】:1170-1175
【Authors】: Jaewon Yang ; Jure Leskovec
【Abstract】: One of the main organizing principles in real-world networks is that of network communities, where sets of nodes organize into densely linked clusters. Communities in networks often overlap as nodes can belong to multiple communities at once. Identifying such overlapping communities is crucial for the understanding the structure as well as the function of real-world networks. Even though community structure in networks has been widely studied in the past, practically all research makes an implicit assumption that overlaps between communities are less densely connected than the non-overlapping parts themselves. Here we validate this assumption on 6 large scale social, collaboration and information networks where nodes explicitly state their community memberships. By examining such ground-truth communities we find that the community overlaps are more densely connected than the non-overlapping parts, which is in sharp contrast to the conventional wisdom that community overlaps are more sparsely connected than the communities themselves. Practically all existing community detection methods fail to detect communities with dense overlaps. We propose Community-Affiliation Graph Model, a model-based community detection method that builds on bipartite node-community affiliation networks. Our method successfully captures overlapping, non-overlapping as well as hierarchically nested communities, and identifies relevant communities more accurately than the state-of-the-art methods in networks ranging from biological to social and information networks.
【Keywords】: graph theory; information networks; pattern clustering; social networking (online); bipartite node-community affiliation networks; collaboration networks; community memberships; community-affiliation graph model; densely linked clusters; ground-truth communities; hierarchically nested communities; information networks; model-based community detection method; nonoverlapping communities; overlapping network community detection; real-world networks; social networks; Collaboration; Communities; Image edge detection; Organizing; Reliability; YouTube; Community detection; Overlapping communities
【Paper Link】 【Pages】:1176-1181
【Authors】: Jinfeng Yi ; Tianbao Yang ; Rong Jin ; Anil K. Jain ; Mehrdad Mahdavi
【Abstract】: Data clustering is an important task and has found applications in numerous real-world problems. Since no single clustering algorithm is able to identify all different types of cluster shapes and structures, ensemble clustering was proposed to combine different partitions of the same data generated by multiple clustering algorithms. The key idea of most ensemble clustering algorithms is to find a partition that is consistent with most of the available partitions of the input data. One problem with these algorithms is their inability to handle uncertain data pairs, i.e. data pairs for which about half of the partitions put them into the same cluster and the other half do the opposite. When the number of uncertain data pairs is large, they can mislead the ensemble clustering algorithm in generating the final partition. To overcome this limitation, we propose an ensemble clustering approach based on the technique of matrix completion. The proposed algorithm constructs a partially observed similarity matrix based on the data pairs whose cluster memberships are agreed upon by most of the clustering algorithms in the ensemble. It then deploys the matrix completion algorithm to complete the similarity matrix. The final data partition is computed by applying an efficient spectral clustering algorithm to the completed matrix. Our empirical studies with multiple real-world datasets show that the proposed algorithm performs significantly better than the state-of-the-art algorithms for ensemble clustering.
【Keywords】: data handling; matrix algebra; pattern clustering; cluster memberships; cluster shapes; cluster structures; data clustering; data partition; matrix completion algorithm; real-world problems; robust ensemble clustering algorithm; similarity matrix; spectral clustering algorithm; state-of-the-art algorithms; uncertain data pair handling; Clustering algorithms; Matrix decomposition; Measurement uncertainty; Partitioning algorithms; Robustness; Sparse matrices; Ensemble Clustering; Low Rank; Matrix Completion; Sparse
【Paper Link】 【Pages】:1182-1187
【Authors】: Mingxuan Yuan ; Lei Chen ; Weixiong Rao ; Hong Mei
【Abstract】: The privacy protection of graph data has become more and more important in recent years. Many works have been proposed to publish a privacy preserving graph. All these works prefer publishing a graph, which guarantees the protection of certain privacy with the smallest change to the original graph. However, there is no guarantee on how the utilities are preserved in the published graph. In this paper, we propose a general fine-grained adjusting framework to publish a privacy protected and utility preserved graph. With this framework, the data publisher can get a trade-off between the privacy and utility according to his customized preferences. We used the protection of a weighted graph as an example to demonstrate the implementation of this framework.
【Keywords】: data privacy; graph theory; publishing; customized preferences; general fine-grained adjusting framework; general privacy protected graph publishing framework; graph data privacy protection; privacy preserving graph; utility preserved graph publishing framework; weighted graph; Data privacy; Equations; Mathematical model; Privacy; Publishing; Social network services; privacy; weighted graph
【Paper Link】 【Pages】:1188-1193
【Authors】: Jeong-Min Yun ; Saehoon Kim ; Seungjin Choi
【Abstract】: Hashing, which involves learning binary codes to embed high-dimensional data into a similarity-preserving low-dimensional Hamming space, is often formulated as linear dimensionality reduction followed by binary quantization. Linear dimensionality reduction, based on maximum variance formulation, requires leading eigenvectors of data covariance or graph Laplacian matrix. Computing leading singular vectors or eigenvectors in the case of high-dimension and large sample size, is a main bottleneck in most of data-driven hashing methods. In this paper we address the use of generalized Nystrom method where a subset of rows and columns are used to approximately compute leading singular vectors of the data matrix, in order to improve the scalability of hashing methods in the case of high-dimensional data with large sample size. Especially we validate the useful behavior of generalized Nystrom approximation with uniform sampling, in the case of a recently-developed hashing method based on principal component analysis (PCA) followed by an iterative quantization, referred to as PCA+ITQ, developed by Gong and Lazebnik. We compare the performance of generalized Nystrom approximation with uniform and non-uniform sampling, to the full singular value decomposition (SVD) method, confirming that the uniform sampling improves the computational and space complexities dramatically, while the performance is not much sacrificed. In addition we present low-rank approximation error bounds for generalized Nystrom approximation with uniform sampling, which is not a trivial extension of available results on the non-uniform sampling case.
【Keywords】: Hamming codes; approximation theory; binary codes; computational complexity; eigenvalues and eigenfunctions; file organisation; graph theory; iterative methods; matrix algebra; principal component analysis; quantisation (signal); sampling methods; PCA-based hashing method; SVD method; binary quantization; computational complexities; data covariance; data matrix; data-driven hashing methods; eigenvectors; generalized Nystrom approximation; graph Laplacian matrix; high-dimensional data; iterative quantization; large sample size; leading singular vectors; learning binary codes; linear dimensionality reduction; low-rank approximation error bounds; maximum variance formulation; nonuniform sampling; principal component analysis-based hashing method; similarity-preserving low-dimensional Hamming space; singular value decomposition method; space complexities; uniform sampling; Approximation algorithms; Approximation error; Binary codes; Principal component analysis; Quantization; Vectors; CUR decomposition; generalized Nystrom approximation; hashing; pseudoskeleton approximation; uniform sampling
【Paper Link】 【Pages】:1194-1199
【Authors】: Xianchao Zhang ; Shaoping Zhu ; Wenxin Liang
【Abstract】: The Twitter social network has become a target platform for both promoters and stammers to disseminate their target messages. There are a large number of campaigns containing coordinated spam or promoting accounts in Twitter, which are more harmful than the traditional methods, such as email spamming. Since traditional solutions mainly check individual accounts or messages, it is an urgent task to detect spam and promoting campaigns in Twitter. In this paper, we propose a scalable framework to detect both spam and promoting campaigns. Our framework consists of three steps: firstly linking accounts who post URLs for similar purposes, secondly extracting candidate campaigns which may exist for spam or promoting purpose and finally distinguishing their intents. One salient aspect of the framework is introducing a URL-driven estimation method to measure the similarity between accounts' purposes of posting URLs, the other one is proposing multiple features to distinguish the candidate campaigns based on a machine learning method. Over a large-scale dataset from Twitter, we can extract the actual campaigns with high precision and recall and distinguish the majority of the candidate campaigns correctly.
【Keywords】: learning (artificial intelligence); social networking (online); unsolicited e-mail; Twitter social network; URL-driven estimation method; candidate campaign extraction; candidate campaigns; coordinated spam; email spamming; large-scale dataset; machine learning method; message dissemination; salient aspect; scalable campaign promotion framework; scalable spam detection framework; Educational institutions; Estimation; Feature extraction; Twitter; Unsolicited electronic mail; campaign detect; similarity measure; social spam
【Paper Link】 【Pages】:1200-1205
【Authors】: Xinjie Zhou ; Xiaojun Wan ; Jianguo Xiao
【Abstract】: Opinion target extraction is a subtask of opinion mining which is very useful in many applications. In this study, we investigate the problem in a cross-language scenario which leverages the rich labeled data in a source language for opinion target extraction in a different target language. The English labeled corpus is used as training set. We generate two Chinese training datasets with different features. Two labeling models for Chinese opinion target extraction are learned based on Conditional Random Fields (CRF). After that, we use a monolingual co-training algorithm to improve the performance of both models by leveraging the enormous unlabeled Chinese review texts on the web. Experimental results show the effectiveness of our proposed approach.
【Keywords】: data mining; natural language processing; text analysis; training; CRF-based learning; Chinese opinion target extraction; Chinese review texts; English labeled corpus; conditional random fields-based learning; cross-language opinion target extraction; monolingual cotraining algorithm; opinion mining; rich labeled data; source language; target language; training set; Algorithm design and analysis; Corporate acquisitions; Data mining; Feature extraction; Information retrieval; Labeling; Training; cross-language information extraction; opinion mining; opinion target extraction
【Paper Link】 【Pages】:1206-1211
【Authors】: Yan Zhou ; Murat Kantarcioglu ; Bhavani M. Thuraisingham
【Abstract】: Data mining tasks are made more complicated when adversaries attack by modifying malicious data to evade detection. The main challenge lies in finding a robust learning model that is insensitive to unpredictable malicious data distribution. In this paper, we present a sparse relevance vector machine ensemble for adversarial learning. The novelty of our work is the use of individualized kernel parameters to model potential adversarial attacks during model training. We allow the kernel parameters to drift in the direction that minimizes the likelihood of the positive data. This step is interleaved with learning the weights and the weight priors of a relevance vector machine. Our empirical results demonstrate that an ensemble of such relevance vector machine models is more robust to adversarial attacks.
【Keywords】: belief networks; data mining; learning (artificial intelligence); minimisation; support vector machines; adversaries attack; data mining tasks; evade detection; individualized kernel parameters; malicious data; relevance vector machine ensembles; relevance vector machine models; robust learning model; sparse Bayesian adversarial learning; sparse relevance vector; Data models; Error analysis; Kernel; Support vector machines; Training; Training data; Vectors; adversarial learning; kernel parameter learning; relevance vector machine; spare Bayesian learning
【Paper Link】 【Pages】:1212-1217
【Authors】: Hengshu Zhu ; Enhong Chen ; Kuifei Yu ; Huanhuan Cao ; Hui Xiong ; Jilei Tian
【Abstract】: In this paper, we illustrate how to extract personal context-aware preferences from the context-rich device logs (i.e., context logs) for building novel personalized context-aware recommender systems. A critical challenge along this line is that the context log of each individual user may not contain sufficient data for mining his/her context-aware preferences. Therefore, we propose to first learn common context-aware preferences from the context logs of many users. Then, the preference of each user can be represented as a distribution of these common context-aware preferences. Specifically, we develop two approaches for mining common context-aware preferences based on two different assumptions, namely, context independent and context dependent assumptions, which can fit into different application scenarios. Finally, extensive experiments on a real-world data set show that both approaches are effective and outperform baselines with respect to mining personal context-aware preferences for mobile users.
【Keywords】: data mining; mobile computing; recommender systems; context dependent assumptions; context log; context-aware preference mining; context-rich device logs; mobile users; personal context-aware preference mining; personalized context-aware recommender systems; sufficient data mining; Context; Context modeling; Data mining; Markov processes; Mobile communication; Recommender systems; Training; Context-Aware Recommendation; Mobile Users; Personal Context-Aware Preferences
【Paper Link】 【Pages】:1218-1223
【Authors】: Bo Zong ; Yinghui Wu ; Ambuj K. Singh ; Xifeng Yan
【Abstract】: In social networks, information and influence diffuse among users as cascades. While the importance of studying cascades has been recognized in various applications, it is difficult to observe the complete structure of cascades in practice. In this paper we study the cascade inference problem following the independent cascade model, and provide a full treatment from complexity to algorithms: (a) we propose the idea of consistent trees as the inferred structures for cascades, these trees connect source nodes and observed nodes with paths satisfying the constraints from the observed temporal information. (b) We introduce metrics to measure the likelihood of consistent trees as inferred cascades, as well as several optimization problems for finding them. (c) We show that the decision problems for consistent trees are in general NP-complete, and that the optimization problems are hard to approximate. (d) We provide approximation algorithms with performance guarantees on the quality of the inferred cascades, as well as heuristics. We experimentally verify the efficiency and effectiveness of our inference algorithms, using real and synthetic data.
【Keywords】: approximation theory; constraint satisfaction problems; decision theory; inference mechanisms; optimisation; social networking (online); NP-complete problems; approximation algorithms; consistent trees; constraint satisfaction; decision problems; independent cascade model; inferred cascade quality; inferred structures; optimization problems; performance guarantees; social networks; source nodes; temporal information; underlying information cascade structure inference problem; Approximation algorithms; Approximation methods; Inference algorithms; Measurement; Steiner trees; Uncertainty; Vegetation; cascade prediction; information diffusion