13. ICDM 2013:Dallas, TX, USA

2013 IEEE 13th International Conference on Data Mining, Dallas, TX, USA, December 7-10, 2013. IEEE Computer Society 【DBLP Link】

Paper Num: 159 || Session Num: 2

Regular Papers 94
Short Papers 65

Regular Papers 94

【Paper Link】【Pages】:1-10

【Authors】: Aaron B. Adcock ; Blair D. Sullivan ; Michael W. Mahoney

【Abstract】: Although large social and information networks are often thought of as having hierarchical or tree-like structure, this assumption is rarely tested. We have performed a detailed empirical analysis of the tree-like properties of realistic informatics graphs using two very different notions of tree-likeness: Gromov's d-hyperbolicity, which is a notion from geometric group theory that measures how tree-like a graph is in terms of its metric structure, and tree decompositions, tools from structural graph theory which measure how tree-like a graph is in terms of its cut structure. Although realistic informatics graphs often do not have meaningful tree-like structure when viewed with respect to the simplest and most popular metrics, e.g., the value of d or the tree width, we conclude that many such graphs do have meaningful tree-like structure when viewed with respect to more refined metrics, e.g., a size-resolved notion of d or a closer analysis of the tree decompositions. We also show that, although these two rigorous notions of tree-likeness capture very different tree-like structures in worst-case, for realistic informatics graphs they empirically identify surprisingly similar structure. We interpret this tree-like structure in terms of the recently-characterized "nested core-periphery" property of large informatics graphs, and we show that the fast and scalable k-core heuristic can be used to identify this tree-like structure.

【Keywords】: data mining; information networks; social networking (online); trees (mathematics); Gromov d-hyperbolicity; core-periphery property; cut structure; geometric group theory; hierarchical structure; information network; k-core heuristic; large social networks; metric structure; realistic informatics graphs; structural graph theory; tree decompositions; tree width; tree-like structure; Erbium; Extraterrestrial measurements; Facebook; Graph theory; Informatics; Peer-to-peer computing; hyperbolicity; information networks; k-core; network science; network structure; tree decompositions

2. Subgraph Enumeration in Dynamic Graphs.

【Paper Link】【Pages】:11-20

【Authors】: Abhijin Adiga ; Anil Kumar S. Vullikanti ; Dante Wiggins

【Abstract】: A fundamental problem in many applications involving social and biological networks is to identify and count the number of embeddings of a given small sub graph in a large graph. Often, they involve dynamic graphs, in which the graph changes incrementally (e.g., by edge addition/deletion). We study the Dynamic Sub graph Enumeration (DSE) Problem, where the goal is to maintain a dynamic data structure to solve the sub graph enumeration problem efficiently when the graph changes incrementally. We develop a new data structure that combines two techniques: (i) the color-coding technique of Alon et al., 2008, for enumerating trees, and (ii) a dynamic data structure for maintaining the h-index of the graph (developed by Eppstein and Spiro, 2009). We derive worst case bounds for the update time in terms of the h-index of the graph and the maximum degree. We also study the empirical performance of our algorithm in a large set of real networks, and find significant improvement over the static methods.

【Keywords】: data mining; data structures; graph theory; trees (mathematics); DSE problem; biological networks; color-coding technique; data mining; dynamic data structure; dynamic subgraph enumeration problem; enumerating trees; h-index; social networks; subgraph enumeration; worst case bounds; Algorithm design and analysis; Approximation algorithms; Approximation methods; Color; Data structures; Heuristic algorithms; Silicon; Subgraph enumeration; color coding; h-index

3. A Masking Index for Quantifying Hidden Glitches.

【Paper Link】【Pages】:21-30

【Authors】: Laure Berti-Equille ; Ji Meng Loh ; Tamraparni Dasu

【Abstract】: Data glitches are errors in a data set, they are complex entities that often span multiple attributes and records. When they co-occur in data, the presence of one type of glitch can hinder the detection of another type of glitch. This phenomenon is called masking. In this paper, we define two important types of masking, and we propose a novel, statistically rigorous indicator called masking index for quantifying the hidden glitches in four cases of masking: outliers masked by missing values, outliers masked by duplicates, duplicates masked by missing values, and duplicates masked by outliers. The masking index is critical for data quality profiling and data exploration, it enables a user to measure the extent of masking and hence the confidence in the data. In this sense, it is a valuable data quality index for measuring the true cleanliness of the data. It is also an objective and quantitative basis for choosing an anomaly detection method that is best suited for the glitches that are present in any given data set. We demonstrate the utility and effectiveness of the masking index by intensive experiments on synthetic and real-world datasets.

【Keywords】: data analysis; anomaly detection method; data exploration; data glitch detection; data quality index; data quality profiling; masking index; missing values; outliers; real-world datasets; statistical rigorous indicator; synthetic datasets; true cleanliness; Arrays; Cleaning; Data mining; Equations; Indexes; Robustness; Software; Anomaly detection; data cleaning; duplicate record identification; masking; missing values; outlier detection

4. CSI: Charged System Influence Model for Human Behavior Prediction.

【Paper Link】【Pages】:31-40

【Authors】: Yuanjun Bi ; Weili Wu ; Yuqing Zhu

【Abstract】: Social influence has been widely studied in areas of viral marketing, information diffusion and health care. Currently, most influence models only deal with a single influence without the interference of other influences. Also, the influence spreading in previous models must be triggered by individuals who have been activated by the influence. In this paper, we argue that it is the attraction from a specific influence makes an individual choose to spread it among multiple influences. Inspired by charged system theory in physics, a new influence model is proposed, considering individual features and social structure features. It also gives a natural description about how individuals make decisions among multiple influences. Then a novel algorithm based on this model is provided to predict human behavior. Extensive experiments on three real-world datasets demonstrate that our model and algorithm statistically outperform the state-of-the-art methods in terms of prediction accuracy.

【Keywords】: behavioural sciences computing; social networking (online); CSI; charged system influence model; human behavior prediction; prediction accuracy; social structure features; Correlation; Force; Gaussian distribution; Integrated circuit modeling; Prediction algorithms; Predictive models; Social network services

5. Context-Aware MIML Instance Annotation.

【Paper Link】【Pages】:41-50

【Authors】: Forrest Briggs ; Xiaoli Z. Fern ; Raviv Raich

【Abstract】: In multi-instance multi-label (MIML) instance annotation, the goal is to learn an instance classifier while training on a MIML dataset, which consists of bags of instances paired with label sets, instance labels are not provided in the training data. The MIML formulation can be applied in many domains. For example, in an image domain, bags are images, instances are feature vectors representing segments in the images, and the label sets are lists of objects or categories present in each image. Although many MIML algorithms have been developed for predicting the label set of a new bag, only a few have been specifically designed to predict instance labels. We propose MIML-ECC (ensemble of classifier chains), which exploits bag-level context through label correlations to improve instance-level prediction accuracy. The proposed method is scalable in all dimensions of a problem (bags, instances, classes, and feature dimension), and has no parameters that require tuning (which is a problem for prior methods). In experiments on two image datasets, a bioacoustics dataset, and two artificial datasets, MIML-ECC achieves higher or comparable accuracy in comparison to several recent methods and baselines.

【Keywords】: bioacoustics; learning (artificial intelligence); pattern classification; ubiquitous computing; MIML dataset; MIML formulation; MIML-ECC; artificial datasets; bag-level context; bioacoustics dataset; classifier chain ensemble; context-aware MIML instance annotation; image datasets; instance classifier learning; instance labels; instance-level prediction accuracy; label correlations; multi-instance multilabel instance annotation; training data; Accuracy; Bismuth; Context; Probabilistic logic; Training; Training data; Vectors; ECC; MIML; MLC; classifier chains; ensemble; instance annotation; multi-label; multiple instance

6. Maximizing Expected Model Change for Active Learning in Regression.

【Paper Link】【Pages】:51-60

【Authors】: Wenbin Cai ; Ya Zhang ; Jun Zhou

【Abstract】: Active learning is well-motivated in many supervised learning tasks where unlabeled data may be abundant but labeled examples are expensive to obtain. The goal of active learning is to maximize the performance of a learning model using as few labeled training data as possible, thereby minimizing the cost of data annotation. So far, there is still very limited work on active learning for regression. In this paper, we propose a new active learning framework for regression called Expected Model Change Maximization (EMCM), which aims to choose the examples that lead to the largest change to the current model. The model change is measured as the difference between the current model parameters and the updated parameters after training with the enlarged training set. Inspired by the Stochastic Gradient Descent (SGD) update rule, the change is estimated as the gradient of the loss with respect to a candidate example for active learning. Under this framework, we derive novel active learning algorithms for both linear regression and nonlinear regression to select the most informative examples. Extensive experimental results on the benchmark data sets from UCI machine learning repository have demonstrated that the proposed algorithms are highly effective in choosing the most informative examples and robust to various types of data distributions.

【Keywords】: expectation-maximisation algorithm; gradient methods; learning (artificial intelligence); regression analysis; stochastic processes; EMCM; SGD update rule; UCI machine learning repository; active learning algorithms; active learning framework; data annotation; data distributions; enlarged training set; expected model change maximization; labeled training data; nonlinear regression; stochastic gradient descent update rule; supervised learning; unlabeled data; Current measurement; Data models; Linear regression; Machine learning algorithms; Regression tree analysis; Training; Training data; Active learning; Expected Model Change Maximization; Linear Regression; Nonlinear regression

7. The Pairwise Gaussian Random Field for High-Dimensional Data Imputation.

【Paper Link】【Pages】:61-70

【Authors】: Zhuhua Cai ; Christopher M. Jermaine ; Zografoula Vagena ; Dionysios Logothetis ; Luis Leopoldo Perez

【Abstract】: In this paper, we consider the problem of imputation (recovering missing values) in very high-dimensional data with an arbitrary covariance structure. The modern solution to this problem is the Gaussian Markov random field (GMRF). The problem with applying a GMRF to very high-dimensional data imputation is that while the GMRF model itself can be useful even for data having tens of thousands of dimensions, utilizing a GMRF requires access to a sparsified, inverse covariance matrix for the data. Computing this matrix using even state-of-the-art methods is very costly, as it typically requires first estimating the covariance matrix from the data (at a O(nm2) cost for m dimensions and n data points) and then performing a regularized inversion of the estimated covariance matrix, which is also very expensive. This is impractical for even moderately-sized, high-dimensional data sets. In this paper, we propose a very simple alternative to the GMRF called the pair wise Gaussian random field or PGRF for short. The PGRF is a graphical, factor-based model. Unlike traditional Gaussian or GMRF models, a PGRF does not require a covariance or correlation matrix as input. Instead, a PGRF takes as input a set of p (dimension, dimension) pairs for which the user suspects there might be a strong correlation or anti-correlation. This set of pairs defines the graphical structure of the model, with a simple Gaussian factor associated with each of the p (dimension, dimension) pairs. Using this structure, it is easy to perform simultaneous inference and imputation of the model. The key benefit of the approach is that the time required for the PGRF to perform inference is approximately linear with respect to p, where p will typically be much smaller than the number of entries in a m×m covariance or precision matrix.

【Keywords】: Gaussian processes; computational complexity; random processes; Gaussian factor; PGRF; arbitrary covariance structure; graphical factor-based model; graphical structure; high-dimensional data imputation; moderately-sized high-dimensional data sets; pairwise Gaussian random field; Computational modeling; Correlation; Covariance matrices; Data models; Gaussian distribution; Markov processes; Monte Carlo methods; classification; imputation

8. Controlling Attribute Effect in Linear Regression.

【Paper Link】【Pages】:71-80

【Authors】: Toon Calders ; Asim Karim ; Faisal Kamiran ; Wasif Ali ; Xiangliang Zhang

【Abstract】: In data mining we often have to learn from biased data, because, for instance, data comes from different batches or there was a gender or racial bias in the collection of social data. In some applications it may be necessary to explicitly control this bias in the models we learn from the data. This paper is the first to study learning linear regression models under constraints that control the biasing effect of a given attribute such as gender or batch number. We show how propensity modeling can be used for factoring out the part of the bias that can be justified by externally provided explanatory attributes. Then we analytically derive linear models that minimize squared error while controlling the bias by imposing constraints on the mean outcome or residuals of the models. Experiments with discrimination-aware crime prediction and batch effect normalization tasks show that the proposed techniques are successful in controlling attribute effects in linear regression models.

【Keywords】: data mining; learning (artificial intelligence); regression analysis; attribute biasing effect; attribute effect control; batch effect normalization tasks; batch number; data mining; discrimination-aware crime prediction; gender; learning linear regression models; propensity modeling; racial bias; social data collection; squared error minimization; Analytical models; Biological system modeling; Data mining; Data models; Linear regression; Predictive models; Vectors; Batch Effects; Fair Data Mining; Linear Regression; Propensity Score

9. Active Matrix Completion.

【Paper Link】【Pages】:81-90

【Authors】: Shayok Chakraborty ; Jiayu Zhou ; Vineeth Nallure Balasubramanian ; Sethuraman Panchanathan ; Ian Davidson ; Jieping Ye

【Abstract】: Recovering a matrix from a sampling of its entries is a problem of rapidly growing interest and has been studied under the name of matrix completion. It occurs in many areas of engineering and applied science. In most machine learning and data mining applications, it is possible to leverage the expertise of human oracles to improve the performance of the system. It is therefore natural to extend this idea of "human-in-the-loop" to the matrix completion problem. However, considering the enormity of data in the modern era, manually completing all the entries in a matrix will be an expensive process in terms of time, labor and human expertise, human oracles can only provide selective supervision to guide the solution process. Thus, appropriately identifying a subset of missing entries (for manual annotation) in an incomplete matrix is of paramount practical importance, this can potentially lead to better reconstructions of the incomplete matrix with minimal human effort. In this paper, we propose novel algorithms to address this issue. Since the query locations are actively selected by the algorithms, we refer to these methods as active matrix completion algorithms. The proposed techniques are generic and the same frameworks can be used in a wide variety of applications including recommendation systems, transductive / multi-label active learning, active learning in regression and active feature acquisition among others. Our extensive empirical analysis on several challenging real-world datasets certify the merit and versatility of the proposed frameworks in efficiently exploiting human intelligence in data mining / machine learning applications.

【Keywords】: data mining; learning (artificial intelligence); matrix algebra; active matrix completion; data mining applications; empirical analysis; human intelligence; human-in-the-loop; incomplete matrix reconstruction; machine learning; matrix recovery; query location; Collaboration; Covariance matrices; Equations; Manuals; Matrix decomposition; Prediction algorithms; Uncertainty

10. Modeling Temporal Adoptions Using Dynamic Matrix Factorization.

【Paper Link】【Pages】:91-100

【Authors】: Freddy Chong Tat Chua ; Richard Jayadi Oentaryo ; Ee-Peng Lim

【Abstract】: The problem of recommending items to users is relevant to many applications and the problem has often been solved using methods developed from Collaborative Filtering (CF). Collaborative Filtering model-based methods such as Matrix Factorization have been shown to produce good results for static rating-type data, but have not been applied to time-stamped item adoption data. In this paper, we adopted a Dynamic Matrix Factorization (DMF) technique to derive different temporal factorization models that can predict missing adoptions at different time steps in the users' adoption history. This DMF technique is an extension of the Non-negative Matrix Factorization (NMF) based on the well-known class of models called Linear Dynamical Systems (LDS). By evaluating our proposed models against NMF and TimeSVD++ on two real datasets extracted from ACM Digital Library and DBLP, we show empirically that DMF can predict adoptions more accurately than the NMF for several prediction tasks as well as outperforming TimeSVD++ in some of the prediction tasks. We further illustrate the ability of DMF to discover evolving research interests for a few author examples.

【Keywords】: collaborative filtering; matrix decomposition; recommender systems; ACM Digital Library; DBLP; DMF technique; LDS; NMF; TimeSVD++; collaborative filtering model-based method; dynamic matrix factorization technique; linear dynamical systems; missing adoptions; nonnegative matrix factorization; real dataset extraction; recommender systems; temporal adoption modeling; user adoption history; Data models; Heuristic algorithms; Kalman filters; Mathematical model; Predictive models; Probabilistic logic; Vectors; Dynamic Matrix Factorization; Kalman Filter; Linear Dynamical Systems; State Space Models

11. Modeling Preferences with Availability Constraints.

【Paper Link】【Pages】:101-110

【Authors】: Bing Tian Dai ; Hady Wirawan Lauw

【Abstract】: User preferences are commonly learned from historical data whereby users express preferences for items, e.g., through consumption of products or services. Most work assumes that a user is not constrained in their selection of items. This assumption does not take into account the availability constraint, whereby users could only access some items, but not others. For example, in subscription-based systems, we can observe only those historical preferences on subscribed (available) items. However, the objective is to predict preferences on unsubscribed (unavailable) items, which do not appear in the historical observations due to their (lack of) availability. To model preferences in a probabilistic manner and address the issue of availability constraint, we develop a graphical model, called Latent Transition Model (LTM) to discover users' latent interests. LTM is novel in incorporating transitions in interests when certain items are not available to the user. Experiments on a real-life implicit feedback dataset demonstrate that LTM is effective in discovering customers' latent interests, and it achieves significant improvements in prediction accuracy over baselines that do not model transitions.

【Keywords】: consumer behaviour; data handling; probability; psychology; LTM; availability constraint; customer latent interests; graphical model; latent transition model; real-life implicit feedback dataset; unavailable items; unsubscribed items; user latent interests; user preference modelling; Availability; Cable TV; Data models; Industries; Probability distribution; Watches; graphical model; latent interests; topic model; topic transition; user preferences

12. wRACOG: A Gibbs Sampling-Based Oversampling Technique.

【Paper Link】【Pages】:111-120

【Authors】: Barnan Das ; Narayanan Chatapuram Krishnan ; Diane J. Cook

【Abstract】: As machine learning techniques mature and are used to tackle complex scientific problems, challenges arise such as the imbalanced class distribution problem, where one of the target class labels is under-represented in comparison with other classes. Existing over sampling approaches for addressing this problem typically do not consider the probability distribution of the minority class while synthetically generating new samples. As a result, the minority class is not well represented which leads to high misclassification error. We introduce wRACOG, a Gibbs sampling-based over sampling approach to synthetically generating and strategically selecting new minority class samples. The Gibbs sampler uses the joint probability distribution of data attributes to generate new minority class samples in the form of a Markov chain. wRACOG iteratively learns a model by selecting samples from the Markov chain that have the highest probability of being misclassified. We validate the effectiveness of wRACOG using five UCI datasets and one new application domain dataset. A comparative study of wRACOG with three other well-known resampling methods provides evidence that wRACOG offers a definite improvement in classification accuracy for minority class samples over other methods.

【Keywords】: Markov processes; iterative methods; learning (artificial intelligence); pattern classification; sampling methods; statistical distributions; Gibbs sampling-based oversampling technique; Markov chain; UCI datasets; application domain dataset; classilication accuracy improvement; data attributes; imbalanced class distribution problem; iterative learning; machine learning techniques; misclassilication error; probability distribution; strategically selected minority class; synthetically generated minority class; target class labels; wRACOG; Equations; Joints; Markov processes; Predictive models; Probability distribution; Random variables; Sensitivity; Gibbs sampling; Imbalanced class distribution; Markov chain Monte Carlo (MCMC); oversampling

13. Efficient Visualization of Large-Scale Data Tables through Reordering and Entropy Minimization.

【Paper Link】【Pages】:121-130

【Authors】: Nemanja Djuric ; Slobodan Vucetic

【Abstract】: Visualization of data tables with n examples and m columns using heat maps provides a holistic view of the original data. As there are n! ways to order rows and m! ways to order columns, and data tables are typically ordered without regard to visual inspection, heat maps of the original data tables often appear as noisy images. However, if rows and columns of a data table are ordered such that similar rows and similar columns are grouped together, a heat map may provide a deep insight into the underlying data distribution. We propose an information-theoretic approach to produce a well-ordered data table. In particular, we search for ordering that minimizes entropy of residuals of predictive coding applied on the ordered data table. This formalization leads to a novel ordering procedure, EM-ordering, that can be applied separately on rows and columns. For ordering of rows, EM-ordering repeats until convergence the steps of (1) rescaling columns and (2) solving a Traveling Salesman Problem (TSP) where rows are treated as cities. To allow fast ordering of large data tables, we propose an efficient TSP heuristic with modest O(n log(n)) time complexity. When compared to the existing state-of-the-art reordering approaches, we show that the method often provides heat maps of higher visual quality, while being significantly more scalable. Moreover, analysis of real-world traffic and financial data sets using the proposed method, which allowed us to readily gain deeper insights about the data, further confirmed that EM-ordering can be a valuable tool for visual exploration of large-scale data sets.

【Keywords】: computational complexity; data visualisation; entropy; financial data processing; traffic engineering computing; travelling salesman problems; EM-ordering; TSP heuristic; data distribution; entropy minimization; financial data set analysis; heat map; information-theoretic approach; large-scale data set visual exploration; large-scale data table visualization; ordered data table; predictive coding residual; real-world traffic analysis; reordering; rescaling columns; time complexity; traveling salesman problem; Cities and towns; Clustering algorithms; Data visualization; Entropy; Heating; Minimization; Principal component analysis; data reordering; data seriation; data visualization; heatmap; large-scale data; traveling salesman problem

14. Local and Global Discriminative Learning for Unsupervised Feature Selection.

【Paper Link】【Pages】:131-140

【Authors】: Liang Du ; Zhiyong Shen ; Xuan Li ; Peng Zhou ; Yi-Dong Shen

【Abstract】: In this paper, we consider the problem of feature selection in unsupervised learning scenario. Recently, spectral feature selection methods, which leverage both the graph Laplacian and the learning mechanism, have received considerable attention. However, when there are lots of irrelevant or noisy features, such graphs may not be reliable and then mislead the selection of features. In this paper, we propose the Local and Global Discriminative learning for unsupervised Feature Selection (LGDFS), which integrates a global and a set of locally linear regression model with weighted l₂-norm regularization into a unified learning framework. By exploring the discriminative and geometrical information in the weighted feature space, which alleviates the effects of the irrelevant features, our approach can find the most representative features to well respect the cluster structure of the data. Experimental results on several benchmark data sets are provided to validate the effectiveness of the proposed approach.

【Keywords】: feature selection; graph theory; pattern clustering; regression analysis; unsupervised learning; LGDFS; benchmark data sets; data cluster structure; discriminative information; geometrical information; global discriminative learning; graph Laplacian; local discriminative learning; locally linear regression model; spectral feature selection method; unsupervised feature selection; unsupervised learning scenario; weighted feature space; weighted l₂-norm regularization; Clustering algorithms; Complexity theory; Cost function; Estimation; Laplace equations; Manifolds

15. Generative Maximum Entropy Learning for Multiclass Classification.

【Paper Link】【Pages】:141-150

【Authors】: Ambedkar Dukkipati ; Gaurav Pandey ; Debarghya Ghoshdastidar ; Paramita Koley ; D. M. V. Satya Sriram

【Abstract】: Maximum entropy approach to classification is very well studied in applied statistics and machine learning and almost all the methods that exists in literature are discriminative in nature. In this paper, we introduce a maximum entropy classification method with feature selection for large dimensional data such as text datasets that is generative in nature. To tackle the curse of dimensionality of large data sets, we employ conditional independence assumption (Naive Bayes) and we perform feature selection simultaneously, by enforcing a 'maximum discrimination' between estimated class conditional densities. For two class problems, in the proposed method, we use Jeffreys (J) divergence to discriminate the class conditional densities. To extend our method to the multi-class case, we propose a completely new approach by considering a multi-distribution divergence: we replace Jeffreys divergence by Jensen-Shannon (JS) divergence to discriminate conditional densities of multiple classes. In order to reduce computational complexity, we employ a modified Jensen-Shannon divergence (JS_GM), based on AM-GM inequality. We show that the resulting divergence is a natural generalization of Jeffreys divergence to a multiple distributions case. As far as the theoretical justifications are concerned we show that when one intends to select the best features in a generative maximum entropy approach, maximum discrimination using J-divergence emerges naturally in binary classification. Performance and comparative study of the proposed algorithms have been demonstrated on large dimensional text and gene expression datasets that show our methods scale up very well with large dimensional datasets.

【Keywords】: Bayes methods; biology computing; computational complexity; feature selection; learning (artificial intelligence); maximum entropy methods; pattern classification; text analysis; AM-GM inequality; JS divergence; Jeffreys divergence; Jensen-Shannon divergence; binary classification; class conditional density discrimination; class conditional density estimation; computational complexity reduction; conditional independence assumption; feature selection; gene expression datasets; generative maximum entropy learning; large dimensional data; large dimensional text dataset; maximum discrimination; maximum entropy classification method; multiclass classification; multidistribution divergence; naive Bayes; Computational modeling; Data models; Entropy; Estimation; Mathematical model; Training; Training data; Jefferys Divergence; Jensen-Shannon Divergence; Maximum Entropy; Text categorization

16. A Parameter-Free Spatio-Temporal Pattern Mining Model to Catalog Global Ocean Dynamics.

【Paper Link】【Pages】:151-160

【Authors】: James H. Faghmous ; Matthew Le ; Muhammed Uluyol ; Vipin Kumar ; Snigdhansu Chatterjee

【Abstract】: As spatio-temporal data have become ubiquitous, an increasing challenge facing computer scientists is that of identifying discrete patterns in continuous spatio-temporal fields. In this paper, we introduce a parameter-free pattern mining application that is able to identify dynamic anomalies in ocean data, known as ocean eddies. Despite ocean eddy monitoring being an active field of research, we provide one of the first quantitative analyses of the performance of the most used monitoring algorithms. We present an incomplete information validation technique, that uses the performance of two methods to construct an imperfect ground truth to test the significance of patterns discovered as well as the relative performance of pattern mining algorithms. These methods, in addition to the validation schemes discussed provide researchers new directions in analyzing large unlabeled climate datasets.

【Keywords】: data mining; geophysics computing; oceanography; dynamic anomalies identify; global ocean dynamic cataloging; incomplete information validation technique; ocean data; ocean eddies; ocean eddy monitoring; parameter-free spatiotemporal pattern mining model; pattern mining algorithms; unlabeled climate datasets; Data mining; Monitoring; Noise; Ocean temperature; Satellites; Sea surface; ocean eddies; pattern mining; spatio-temporal data mining

17. Transfer Learning across Networks for Collective Classification.

【Paper Link】【Pages】:161-170

【Authors】: Meng Fang ; Jie Yin ; Xingquan Zhu

【Abstract】: This paper addresses the problem of transferring useful knowledge from a source network to predict node labels in a newly formed target network. While existing transfer learning research has primarily focused on vector-based data, in which the instances are assumed to be independent and identically distributed, how to effectively transfer knowledge across different information networks has not been well studied, mainly because networks may have their distinct node features and link relationships between nodes. In this paper, we propose a new transfer learning algorithm that attempts to transfer common latent structure features across the source and target networks. The proposed algorithm discovers these latent features by constructing label propagation matrices in the source and target networks, and mapping them into a shared latent feature space. The latent features capture common structure patterns shared by two networks, and serve as domain-independent features to be transferred between networks. Together with domain-dependent node features, we thereafter propose an iterative classification algorithm that leverages label correlations to predict node labels in the target network. Experiments on real-world networks demonstrate that our proposed algorithm can successfully achieve knowledge transfer between networks to help improve the accuracy of classifying nodes in the target network.

【Keywords】: information networks; iterative methods; learning (artificial intelligence); network theory (graphs); pattern classification; collective classification; common latent structure feature transfer; domain-dependent node features; information networks; iterative classification algorithm; knowledge transfer; label correlations; label propagation matrices; link relationships; node classification accuracy improvement; node features; node label prediction; real-world networks; shared latent feature space; source network; target network; transfer learning algorithm; vector-based data; Convergence; Iterative methods; Knowledge engineering; Knowledge transfer; Optimization; Prediction algorithms; Subspace constraints; Network; Transfer learning

18. Distributed Column Subset Selection on MapReduce.

【Paper Link】【Pages】:171-180

【Authors】: Ahmed K. Farahat ; Ahmed Elgohary ; Ali Ghodsi ; Mohamed S. Kamel

【Abstract】: Given a very large data set distributed over a cluster of several nodes, this paper addresses the problem of selecting a few data instances that best represent the entire data set. The solution to this problem is of a crucial importance in the big data era as it enables data analysts to understand the insights of the data and explore its hidden structure. The selected instances can also be used for data preprocessing tasks such as learning a low-dimensional embedding of the data points or computing a low-rank approximation of the corresponding matrix. The paper first formulates the problem as the selection of a few representative columns from a matrix whose columns are massively distributed, and it then proposes a MapReduce algorithm for selecting those representatives. The algorithm first learns a concise representation of all columns using random projection, and it then solves a generalized column subset selection problem at each machine in which a subset of columns are selected from the sub-matrix on that machine such that the reconstruction error of the concise representation is minimized. The paper then demonstrates the effectiveness and efficiency of the proposed algorithm through an empirical evaluation on benchmark data sets.

【Keywords】: data analysis; distributed algorithms; greedy algorithms; MapReduce algorithm; corresponding matrix; data analysts; distributed column subset selection; generalized column subset selection problem; low-rank approximation; random projection; Approximation algorithms; Approximation methods; Cascading style sheets; Data handling; Data storage systems; Distributed databases; Information management; Big Data; Column Subset Selection; Distributed Computing; Greedy Algorithms; MapReduce

19. Compression-Based Graph Mining Exploiting Structure Primitives.

【Paper Link】【Pages】:181-190

【Authors】: Jing Feng ; Xiao He ; Nina Hubig ; Christian Böhm ; Claudia Plant

【Abstract】: How can we retrieve information from sparse graphs? Traditional graph mining approaches focus on discovering dense patterns inside complex networks, for example modularity-based or cut-based methods. However, most real world data sets are very sparse. Nevertheless, traditional approaches tend to omit interesting sparse patterns like stars. In this paper, we propose a novel graph mining technique modeling the transitivity and the hub ness of a graph using structure primitives. We exploit these structure primitives for effective graph compression using the Minimum Description Length Principle. The compression rate is an unbiased measure for the transitivity or hub ness and therefore provides interesting insights into the structure of even very sparse graphs. Since real graphs can be composed of sub graphs of different structures, we propose a novel algorithm CXprime (Compression-based exploiting Primitives) for clustering graphs using our coding scheme as an objective function. In contrast to traditional graph clustering methods, our algorithm automatically recognizes different types of sub graphs without requiring the user to specify input parameters. Additionally we propose a novel link prediction algorithm based on the detected substructures, which increases the quality of former methods. Extensive experiments evaluate our algorithms on synthetic and real data.

【Keywords】: data compression; data mining; graph theory; pattern clustering; CXprime algorithm; coding scheme; compression rate; compression-based graph mining technique; cut-based methods; dense pattern discovery; example modularity-based method; graph clustering methods; information retrieval; link prediction algorithm; minimum description length principle; objective function; sparse graphs; star parse patterns; structure primitives; Clustering algorithms; Communities; Data mining; Encoding; Entropy; Prediction algorithms; Receivers; Compression; Graph mining; Link prediction; Minimum Description Length; Partition

20. Online Estimation of Discrete Densities.

【Paper Link】【Pages】:191-200

【Authors】: Michael Geilke ; Eibe Frank ; Andreas Karwath ; Stefan Kramer

【Abstract】: We address the problem of estimating a discrete joint density online, that is, the algorithm is only provided the current example and its current estimate. The proposed online estimator of discrete densities, EDDO (Estimation of Discrete Densities Online), uses classifier chains to model dependencies among features. Each classifier in the chain estimates the probability of one particular feature. Because a single chain may not provide a reliable estimate, we also consider ensembles of classifier chains and ensembles of weighted classifier chains. For all density estimators, we provide consistency proofs and propose algorithms to perform certain inference tasks. The empirical evaluation of the estimators is conducted in several experiments and on data sets of up to several million instances: We compare them to density estimates computed from Bayesian structure learners, evaluate them under the influence of noise, measure their ability to deal with concept drift, and measure the run-time performance. Our experiments demonstrate that, even though designed to work online, EDDO delivers estimators of competitive accuracy compared to batch Bayesian structure learners and batch variants of EDDO.

【Keywords】: Bayes methods; estimation theory; learning (artificial intelligence); pattern classification; random processes; Bayesian structure learners; EDDO batch variants; consistency proofs; density estimators; discrete joint density estimation; estimation of discrete densities online; online estimator; probability estimation; weighted random classifier chains; Bayes methods; Density measurement; Estimation; Inference algorithms; Joints; Noise measurement; Radiation detectors; (ensembles of) classifier chains; Hoeffding trees; data streams; density estimation

21. Extraction of Interpretable Multivariate Patterns for Early Diagnostics.

【Paper Link】【Pages】:201-210

【Authors】: Mohamed F. Ghalwash ; Vladan Radosavljevic ; Zoran Obradovic

【Abstract】: Leveraging temporal observations to predict a patient's health state at a future period is a very challenging task. Providing such a prediction early and accurately allows for designing a more successful treatment that starts before a disease completely develops. Information for this kind of early diagnosis could be extracted by use of temporal data mining methods for handling complex multivariate time series. However, physicians usually prefer to use interpretable models that can be easily explained, rather than relying on more complex black-box approaches. In this study, a temporal data mining method is proposed for extracting interpretable patterns from multivariate time series data, which can be used to assist in providing interpretable early diagnosis. The problem is formulated as an optimization based binary classification task addressed in three steps. First, the time series data is transformed into a binary matrix representation suitable for application of classification methods. Second, a novel convex-concave optimization problem is defined to extract multivariate patterns from the constructed binary matrix. Then, a mixed integer discrete optimization formulation is provided to reduce the dimensionality and extract interpretable multivariate patterns. Finally, those interpretable multivariate patterns are used for early classification in challenging clinical applications. In the conducted experiments on two human viral infection datasets and a larger myocardial infarction dataset, the proposed method was more accurate and provided classifications earlier than three alternative state-of-the-art methods.

【Keywords】: concave programming; convex programming; data mining; data reduction; diseases; integer programming; matrix algebra; medical computing; patient diagnosis; pattern classification; time series; binary matrix representation; complex black-box approach; complex multivariate time series data; convex-concave optimization problem; dimensionality reduction; early classification methods; early diagnostics; human viral infection datasets; interpretable multivariate pattern extraction; mixed integer discrete optimization formulation; myocardial infarction dataset; optimization based binary classification task; patient health state prediction; temporal data mining methods; temporal observations; Data mining; Diseases; Logistics; Optimization; Time series analysis; Vectors; early classification; early diagnosis; interpretability; multivariate time series; pattern extraction

22. Search Behavior Based Latent Semantic User Segmentation for Advertising Targeting.

【Paper Link】【Pages】:211-220

【Authors】: Xueqing Gong ; Xinyu Guo ; Rong Zhang ; Xiaofeng He ; Aoying Zhou

【Abstract】: The popularity of internet usage greatly motivates the online advertising activities. Compared to advertising on traditional media, online advertising has rich information as well as necessary techniques to achieve precise user targeting. This rich information includes the search behaviors of a user, such as queries issued, or the ads clicked by the user. For popular websites with large number of active users, ad delivery targeting at individual users puts too much burden on the system. User segmentation is an alternative way to relieve this burden by grouping users of similar interests together, then the ad delivery system targets the user segments to display relevant ads, instead of individual users. Existing user segmentation work either adapts clustering methods without considering the hidden semantics embedded in the data, such as K-means, or treats users as data instance and clusters users indirectly even if the latent semantics is incorporated into the transformed data, such as PLSA or LDA. In this paper, we present a search behavior based latent semantic user segmentation method and validate its effectiveness on new ads. Instead of treating users as data instances, they are used as attributes of user issued queries or clicked ads which are considered to be data instances. LDA is then applied to this data set to directly obtain the user segments. Compared to popular K-means clustering, our approach achieves higher CTR values on new ads, with only simple search information.

【Keywords】: Internet; Web sites; advertising; behavioural sciences; pattern clustering; query processing; CTR values; Internet; K-means clustering; Websites; ad delivery system; ad delivery targeting; advertising targeting; online advertising activities; precise user targeting; search behavior based latent semantic user segmentation method; search information; user search behaviors; Advertising; Clustering algorithms; Clustering methods; Noise; Optimization; Predictive models; Semantics; LDA; advertising targeting; user segmentation

23. Mixed Membership Subspace Clustering.

【Paper Link】【Pages】:221-230

【Authors】: Stephan Günnemann ; Christos Faloutsos

【Abstract】: Clustering is one of the fundamental data mining tasks. While traditional clustering techniques assign each object to a single cluster only, in many applications it has been observed that objects might belong to multiple clusters with different degrees. In this work, we present a Bayesian framework to tackle the challenge of mixed membership clustering for vector data. We exploit the ideas of subspace clustering where the relevance of dimensions might be different for each cluster. Combining the relevance of the dimensions with the cluster membership degree of the objects, we propose a novel type of mixture model able to represent data containing mixed membership subspace clusters. For learning our model, we develop an efficient algorithm based on variational inference allowing easy parallelization. In our empirical study on synthetic and real data we show the strengths of our novel clustering technique.

【Keywords】: Bayes methods; belief networks; data mining; inference mechanisms; pattern clustering; variational techniques; Bayesian framework; data mining; mixed membership subspace clustering technique; object cluster membership degree; real data; synthetic data; variational inference; vector data; Adaptation models; Approximation methods; Bayes methods; Data models; Equations; Random variables; Vectors; mixed membership clustering; model based clustering; subspace clustering; variational inference

24. Spectral Subspace Clustering for Graphs with Feature Vectors.

【Paper Link】【Pages】:231-240

【Authors】: Stephan Günnemann ; Ines Färber ; Sebastian Raubach ; Thomas Seidl

【Abstract】: Clustering graphs annotated with feature vectors has recently gained much attention. The goal is to detect groups of vertices that are densely connected in the graph as well as similar with respect to their feature values. While early approaches treated all dimensions of the feature space as equally important, more advanced techniques consider the varying relevance of dimensions for different groups. In this work, we propose a novel clustering method for graphs with feature vectors based on the principle of spectral clustering. Following the idea of subspace clustering, our method detects for each cluster an individual set of relevant features. Since spectral clustering is based on the eigendecomposition of the affinity matrix, which strongly depends on the choice of features, our method simultaneously learns the grouping of vertices and the affinity matrix. To tackle the fundamental challenge of comparing the clustering structures for different feature subsets, we define an objective function that is unbiased regarding the number of relevant features. We develop the algorithm SSCG and we show its application for multiple real-world datasets.

【Keywords】: eigenvalues and eigenfunctions; graph theory; learning (artificial intelligence); matrix algebra; pattern clustering; SSCG; affinity matrix eigendecomposition; feature space; feature values; feature vector; graph clustering; learning; real-world datasets; spectral subspace clustering; vertex grouping; Clustering methods; Equations; Feature extraction; Kernel; Linear programming; Reactive power; Vectors; attributed graphs; graphs; networks; spectral clustering; subspace clustering

25. Permutation-Based Sequential Pattern Hiding.

【Paper Link】【Pages】:241-250

【Authors】: Robert Gwadera ; Aris Gkoulalas-Divanis ; Grigorios Loukides

【Abstract】: Sequence data are increasingly shared to enable mining applications, in various domains such as marketing, telecommunications, and healthcare. This, however, may expose sensitive sequential patterns, which lead to intrusive inferences about individuals or leak confidential information about organizations. This paper presents the first permutation-based approach to prevent this threat. Our approach hides sensitive patterns by replacing them with carefully selected permutations that avoid changes in the set of frequent nonsensitive patterns (side-effects) and in the ordering information of sequences (distortion). By doing so, it retains data utility in sequence mining and tasks based on item set properties, as permutation preserves the support of items, unlike deletion, which is used in existing works. To realize our approach, we develop an efficient and effective algorithm for generating permutations with minimal side-effects and distortion. This algorithm also avoids implausible symbol orderings that may exist in certain applications. In addition, we propose a method to hide sensitive patterns from a sequence dataset. Extensive experiments verify that our method allows significantly more accurate data analysis than the state-of the-art approach.

【Keywords】: data encapsulation; data mining; distortion; data analysis; data mining; data utility; distortion; implausible symbol orderings; intrusive inferences; item set property; minimal side-effects; permutation-based sequential pattern hiding; sensitive sequential patterns; sequence data; sequence dataset; sequence mining; Algorithm design and analysis; Companies; Data mining; Insurance; Itemsets; Pattern matching; data privacy; permutation; sequential pattern hiding

26. Utilizing URLs Position to Estimate Intrinsic Query-URL Relevance.

【Paper Link】【Pages】:251-260

【Authors】: Xiaogang Han ; Wenjun Zhou ; Xing Jiang ; Hengjie Song ; Ming Zhong ; Toyoaki Nishida

【Abstract】: Query-URL relevance (QUR) is an important criterion to measure the quality of commercial search engines. However, the traditional way to collect high-quality QURs is time-consuming and labor-intensive since it is primarily based on human judges. To address these issues, numerous models have been studied to automatically infer the QURs. Unlike the prior studies in this literature, we first empirically analyze the correlation between multiple annotators' judgments on QURs and URL position in ranking lists. By doing so, we reveal and justify the potential impacts of URL position on inferring intrinsic QURs. Inspired by this finding, a position-sensitive model (PSM) is proposed to infer QURs more accurately. In contrast with most existing approaches that attempt to construct the direct relationship between QURs and the features characterizing query-URL pairs, PSM assumes that the QUR is connected with the features through URL position. We conducted the experiments in real search engine Baidu.com, and compared the experimental results to those of the typical methods used in similar tasks, reporting significant gains over click-through rate and the normalized discounted cumulative gains (NDCGs).

【Keywords】: feature extraction; query processing; relevance feedback; search engines; NDCG; PSM; QUR position; URL position; annotator judgments; click-through rate; commercial search engines; high-quality QURs; intrinsic query-URL relevance estimation; normalized discounted cumulative gains; position-sensitive model; query-URL pairs; ranking lists; Accuracy; Correlation; Educational institutions; Electronic mail; Search engines; Training; Vectors; Evaluation component; Relevance; Seach Engines

27. Parameter-Free Audio Motif Discovery in Large Data Archives.

【Paper Link】【Pages】:261-270

【Authors】: Yuan Hao ; Mohammad Shokoohi-Yekta ; George Papageorgiou ; Eamonn J. Keogh

【Abstract】: The discovery of repeated structure, i.e. motifs/near-duplicates, is often the first step in exploratory data mining. As such, the last decade has seen extensive research efforts in motif discovery algorithms for text, DNA, time series, protein sequences, graphs, images, and video. Surprisingly, there has been less attention devoted to finding repeated patterns in audio sequences, in spite of their ubiquity in science and entertainment. While there is significant work for the special case of motifs in music, virtually all this work makes many assumptions about data (often to the point of being genre specific) and thus these algorithms do not generalize to audio sequences containing animal vocalizations, industrial processes, or a host of other domains that we may wish to explore. In this work we introduce a novel technique for finding audio motifs. Our method does not require any domain-specific tuning and is essentially parameter-free. We demonstrate our algorithm on very diverse domains, finding audio motifs in laboratory mice vocalizations, wild animal sounds, music, and human speech. Our experiments demonstrate that our ideas are effective in discovering objectively correct or subjectively plausible motifs. Moreover, we show our novel probabilistic early abandoning approach is efficient, being two to three orders of magnitude faster than brute-force search, and thus faster than real-time for most problems.

【Keywords】: audio signal processing; data mining; music; probability; animal vocalizations; audio motif finding technique; audio sequence repeated pattern finding; brute-force search; domain-specific tuning; exploratory data mining; human speech; industrial processes; laboratory mice vocalizations; large data archives; parameter-free audio motif discovery; probabilistic early abandoning approach; repeated structure discovery; wild animal sounds; Data mining; Feature extraction; Heuristic algorithms; Mice; Music; Spectrogram; Speech; anytime algorithm; audio motif; spectrogram

28. Kernel Density Metric Learning.

【Paper Link】【Pages】:271-280

【Authors】: Yujie He ; Wenlin Chen ; Yixin Chen ; Yi Mao

【Abstract】: This paper introduces a supervised metric learning algorithm, called kernel density metric learning (KDML), which is easy to use and provides nonlinear, probability-based distance measures. KDML constructs a direct nonlinear mapping from the original input space into a feature space based on kernel density estimation. The nonlinear mapping in KDML embodies established distance measures between probability density functions, and leads to correct classification on datasets for which linear metric learning methods would fail. It addresses the severe challenge to kNN when features are from heterogeneous domains and, as a result, the Euclidean or Mahalanobis distance between original feature vectors is not meaningful. Existing metric learning algorithms can then be applied to the KDML features. We also propose an integrated optimization algorithm that learns not only the Mahalanobis matrix but also kernel bandwidths, the only hyper-parameters in the nonlinear mapping. KDML can naturally handle not only numerical features, but also categorical ones, which is rarely found in previous metric learning algorithms. Extensive experimental results on various datasets show that KDML significantly improves existing metric learning algorithms in terms of kNN classification accuracy.

【Keywords】: learning (artificial intelligence); matrix algebra; optimisation; pattern classification; probability; Euclidean distance; KDML; Mahalanobis distance; Mahalanobis matrix; dataset classification; direct nonlinear mapping; feature space; feature vectors; heterogeneous domain; input space; integrated optimization algorithm; kNN classification accuracy; kernel bandwidth; kernel density estimation; kernel density metric learning; probability density functions; probability-based distance measures; supervised metric learning algorithm; Density measurement; Euclidean distance; Kernel; Learning systems; Optimization; Vectors

29. Classification of Multi-dimensional Streaming Time Series by Weighting Each Classifier's Track Record.

【Paper Link】【Pages】:281-290

【Authors】: Bing Hu ; Yanping Chen ; Jesin Zakaria ; Liudmila Ulanova ; Eamonn J. Keogh

【Abstract】: Extensive research on time series classification in the last decade has produced fast and accurate algorithms for the single-dimensional case. However, the increasing prevalence of inexpensive sensors has reinforced the need for algorithms to handle multi-dimensional time series. For example, modern smartphones have at least a dozen sensors capable of producing streaming time series, and hospital-based (and increasingly, home-based) medical devices can produce time series streams from more than twenty sensors. The two most common ways to generalize from single to multi-dimensional data are to use all the streams or just the single best stream as determined at training time. However, as we show here, both approaches can be very brittle. Moreover, neither approach exploits the observation that different sensors may be considered "experts" on different classes. In this work, we introduce a novel framework for multi-dimensional time series classification that weights the class prediction from each time series stream. These weights are based not only on each stream's previous track record on the class it is currently predicting, but also on the distance from the unlabeled object. As we demonstrate with extensive experiments on real data, our method is more accurate than current approaches and particularly robust in the face of concept drift or sensor noise.

【Keywords】: medical administrative data processing; pattern classification; time series; class prediction; hospital-based medical devices; multidimensional streaming time series classification; Classification algorithms; Footwear; Sensors; Time series analysis; Training; Training data; Wrist; classification; multi-dimensional time series

30. Identifying Transformative Scientific Research.

【Paper Link】【Pages】:291-300

【Authors】: Yi-Hung Huang ; Chun-Nan Hsu ; Kristina Lerman

【Abstract】: Transformative research refers to research that shifts or disrupts established scientific paradigms. Notable examples include the discovery of high-temperature superconductivity that disrupted the theory established 30 years ago. Identifying potential transformative research early and accurately is important for funding agencies to maximize the impact of their investments. It also helps scientists identify and focus their attention on promising emerging works. This paper presents a data driven approach where citation patterns of scientific papers are analyzed to quantify how much a potential challenger idea shifts an established paradigm. The key idea is that transformative research creates an observable disruption in the structure of "information cascades," chains of references that can be traced back to the papers establishing some scientific paradigm. Such a disruption is visible soon after the challenger's introduction. We define a disruption score to quantify the disruption and develop an algorithm to compute it from a large citation network. Experimental results show that our approach can successfully identify transformative scientific papers that disrupt established paradigms in Physics and Computer Science, regardless of whether the challenger paradigm is an instant hit or a classic whose contribution is formally recognized with a Nobel Prize decades later.

【Keywords】: citation analysis; data handling; data-driven approach; high-temperature superconductivity; information cascades; large citation network; transformative scientific research identification; Communities; Computer science; Data mining; High-temperature superconductors; Physics; Reliability; cascades; citation network; diffusion; information spread

31. Min-Max Hash for Jaccard Similarity.

【Paper Link】【Pages】:301-309

【Authors】: Jianqiu Ji ; Jianmin Li ; Shuicheng Yan ; Qi Tian ; Bo Zhang

【Abstract】: Min-wise hash is a widely-used hashing method for scalable similarity search in terms of Jaccard similarity, while in practice it is necessary to compute many such hash functions for certain precision, leading to expensive computational cost. In this paper, we introduce an effective method, i.e. the min-max hash method, which significantly reduces the hashing time by half, yet it has a provably slightly smaller variance in estimating pair wise Jaccard similarity. In addition, the estimator of min-max hash only contains pair wise equality checking, thus it is especially suitable for approximate nearest neighbor search. Since min-max hash is equally simple as min-wise hash, many extensions based on min-wise hash can be easily adapted to min-max hash, and we show how to combine it with b-bit minwise hash. Experiments show that with the same length of hash code, min-max hash reduces the hashing time to half as much as that of min-wise hash, while achieving smaller mean squared error (MSE) in estimating pair wise Jaccard similarity, and better best approximate ratio (BAR) in approximate nearest neighbor search.

【Keywords】: mean square error methods; minimax techniques; pattern clustering; search problems; approximate nearest neighbor search; b-bit minwise hash; best approximate ratio; hash code; hashing method; hashing time reduction; mean squared error; min-max hash estimator; min-max hash method; min-wise hash; pairwise Jaccard similarity estimation; pairwise equality checking; scalable similarity search; Approximation algorithms; Approximation methods; Computational efficiency; Computer science; Educational institutions; Laboratories; Nearest neighbor searches; Jaccard similarity; approximate nearest neighbor search; min-max hash; min-wise hash

32. GRIAS: An Entity-Relation Graph Based Framework for Discovering Entity Aliases.

【Paper Link】【Pages】:310-319

【Authors】: Lili Jiang ; Ping Luo ; Jianyong Wang ; Yuhong Xiong ; Bingduan Lin ; Min Wang ; Ning An

【Abstract】: Recognizing the various aliases of an entity is a critical task for many applications, including Web search, recommendation system, and e-discovery. The goal of this paper is to accurately identify entity aliases, especially the long tail ones in the unstructured data. Our solution GRIAS (abbr. for a Graph-based framework for discovering entity Aliases) is motivated by the entity relationships collected from both the structured and unstructured data. These relationships help to build an entity-relation graph, and the graph-based similarity is calculated between an entity and its alias candidates which are first chosen by our proposed candidate selection method. Extensive experimental results on two real-world datasets demonstrate both the effectiveness and efficiency of the proposed framework.

【Keywords】: data mining; graph theory; GRIAS; Web search; candidate selection method; e-discovery; entity-relation graph; graph-based framework for discovering entity aliases; graph-based similarity; recommendation system; structured data; unstructured data; Cameras; Databases; Earth Observing System; Graphics; Organizations; Terminology; alias similarity; entity alias; entity-relation graph

33. Focal-Test-Based Spatial Decision Tree Learning: A Summary of Results.

【Paper Link】【Pages】:320-329

【Authors】: Zhe Jiang ; Shashi Shekhar ; Xun Zhou ; Joseph Knight ; Jennifer Corcoran

【Abstract】: Given a raster spatial framework, as well as training and test sets, the spatial decision tree learning (SDTL) problem aims to minimize classification errors as well as salt-and-pepper noise. The SDTL problem is important due to many societal applications such as land cover classification in remote sensing. However, the SDTL problem is challenging due to the spatial autocorrelation of class labels, and the potentially exponential number of candidate trees. Related work is limited due to the use of local-test-based decision nodes, which can not adequately model spatial autocorrelation during test phase, leading to high salt-and-pepper noise. In contrast, we propose a focal-test-based spatial decision tree (FTSDT) model, where the tree traversal direction for a location is based on not only local but also focal (i.e., neighborhood) properties of the location. Experimental results on real world remote sensing datasets show that the proposed approach reduces salt-and-pepper noise and improves classification accuracy.

【Keywords】: decision trees; geophysical image processing; image classification; image denoising; learning (artificial intelligence); remote sensing; FTSDT model; SDTL problem; candidate trees; classification error minimization; focal-test-based spatial decision tree learning problem; land cover classification; real world remote sensing datasets; remote sensing; salt-and-pepper noise; societal applications; spatial autocorrelation; tree traversal direction; Correlation; Decision trees; Noise; Noise measurement; Prediction algorithms; Remote sensing; Training; focal test; spatial autocorrelation; spatial data mining; spatial decision tree

34. Conformal Prediction Using Decision Trees.

【Paper Link】【Pages】:330-339

【Authors】: Ulf Johansson ; Henrik Boström ; Tuve Löfström

【Abstract】: Conformal prediction is a relatively new framework in which the predictive models output sets of predictions with a bound on the error rate, i.e., in a classification context, the probability of excluding the correct class label is lower than a predefined significance level. An investigation of the use of decision trees within the conformal prediction framework is presented, with the overall purpose to determine the effect of different algorithmic choices, including split criterion, pruning scheme and way to calculate the probability estimates. Since the error rate is bounded by the framework, the most important property of conformal predictors is efficiency, which concerns minimizing the number of elements in the output prediction sets. Results from one of the largest empirical investigations to date within the conformal prediction framework are presented, showing that in order to optimize efficiency, the decision trees should be induced using no pruning and with smoothed probability estimates. The choice of split criterion to use for the actual induction of the trees did not turn out to have any major impact on the efficiency. Finally, the experimentation also showed that when using decision trees, standard inductive conformal prediction was as efficient as the recently suggested method cross-conformal prediction. This is an encouraging results since cross-conformal prediction uses several decision trees, thus sacrificing the interpretability of a single decision tree.

【Keywords】: conformal mapping; decision trees; pattern classification; prediction theory; probability; class label; conformal prediction framework; cross-conformal prediction; decision trees; output prediction sets; predictive models; probability estimates; pruning scheme; split criterion; Accuracy; Calibration; Decision trees; Prediction algorithms; Predictive models; Probability; Standards; Conformal prediction; Decision trees

35. An Unsupervised Algorithm for Learning Blocking Schemes.

【Paper Link】【Pages】:340-349

【Authors】: Mayank Kejriwal ; Daniel P. Miranker

【Abstract】: A pair wise comparison of data objects is a requisite step in many data mining applications, but has quadratic complexity. In applications such as record linkage, blocking methods may be applied to reduce the cost. That is, the data is first partitioned into a set of blocks, and pair wise comparisons computed for pairs within each block. To date, blocking methods have required the blocking scheme be given, or the provision of training data enabling supervised learning algorithms to determine a blocking scheme. In either case, a domain expert is required. This paper develops an unsupervised method for learning a blocking scheme for tabular data sets. The method is divided into two phases. First, a weakly labeled training set is generated automatically in time linear in the number of records of the entire dataset. The second phase casts blocking key discovery as a Fisher feature selection problem. The approach is compared to a state-of-the-art supervised blocking key discovery algorithm on three real-world databases and achieves favorable results.

【Keywords】: data mining; expert systems; unsupervised learning; Fisher feature selection problem; cost reduction; data mining; data objects; domain expert; learning blocking schemes; quadratic complexity; record linkage; supervised blocking key discovery algorithm; supervised learning algorithms; tabular data sets; unsupervised learning algorithm; weakly labeled training set; Complexity theory; Couplings; Indexing; Measurement; Personnel; Training; Blocking; Record Linkage

36. Semantic Frame-Based Document Representation for Comparable Corpora.

【Paper Link】【Pages】:350-359

【Authors】: Hyungsul Kim ; Xiang Ren ; Yizhou Sun ; Chi Wang ; Jiawei Han

【Abstract】: Document representation is a fundamental problem for text mining. Many efforts have been done to generate concise yet semantic representation, such as bag-of-words, phrase, sentence and topic-level descriptions. Nevertheless, most existing techniques counter difficulties in handling monolingual comparable corpus, which is a collection of monolingual documents conveying the same topic. In this paper, we propose the use of frame, a high-level semantic unit, and construct frame-based representations to semantically describe documents by bags of frames, using an information network approach. One major challenge in this representation is that semantically similar frames may be of different forms. For example, "radiation leaked" in one news article can appear as "the level of radiation increased" in another article. To tackle the problem, a text-based information network is constructed among frames and words, and a link-based similarity measure called SynRank is proposed to calculate similarity between frames. As a result, different variations of the semantically similar frames are merged into a single descriptive frame using clustering, and a document can then be represented as a bag of representative frames. It turns out that frame-based document representation not only is more interpretable, but also can facilitate other text analysis tasks such as event tracking effectively. We conduct both qualitative and quantitative experiments on three comparable news corpora, to study the effectiveness of frame-based document representation and the similarity measure SynRank, respectively, and demonstrate that the superior performance of frame-based document representation on different real-world applications.

【Keywords】: data mining; data structures; pattern clustering; text analysis; SynRank; bag-of-words; comparable corpora; event tracking; high-level semantic unit; link-based similarity measure; monolingual comparable corpus handling; pattern clustering; semantic frame-based document representation; single descriptive frame; text analysis tasks; text mining; text-based information network approach; topic-level descriptions; Context; Data mining; Earthquakes; Equations; Labeling; Semantics; Tsunami; Clustering; Document Representation; Graph Similarity

37. Regularization Paths for Sparse Nonnegative Least Squares Problems with Applications to Life Cycle Assessment Tree Discovery.

【Paper Link】【Pages】:360-369

【Authors】: Jingu Kim ; Naren Ramakrishnan ; Manish Marwah ; Amip Shah ; Haesun Park

【Abstract】: The nonnegative least squares problems are useful in applications where the physical nature of problem domain permits only additive linear combinations. We discuss the l₁-regularized nonnegative least squares (L1-NLS) problem, where l1-regularization is used to induce sparsity. Although l₁-regularization has been successfully used in least squares regression, when combined with nonnegativity constraints, developments of algorithms and their understandings have been limited. We propose an algorithm that generates the entire regularization paths of the L1-NLS problem. We prove the correctness of the proposed algorithm and illustrate a novel application in environmental sustainability. The application relates to life cycle assessment (LCA), a technique used to estimate environmental impact during the entire lifetime of a product. We address an inverse problem in LCA. Given environmental impact factors of a target product and of a large library of constituents, the goal is to reverse engineer an inventory tree for the product. Using real-world data sets, we demonstrate how our L1-NLS approach controls the size of discovered trees, and how the full regularization paths effectively illustrate the spectrum of discovered trees with varying sparsity and compositions.

【Keywords】: data handling; inventory management; inverse problems; least squares approximations; product life cycle management; production engineering computing; regression analysis; reverse engineering; sustainable development; trees (mathematics); L1-NLS problem; LCA technique; environmental impact estimation; environmental impact factors; environmental sustainability; inventory tree; inverse problem; l₁-regularized nonnegative least square problem; least squares regression; life cycle assessment tree discovery; nonnegative least squares problems; nonnegativity constraints; product lifetime; real-world data sets; reverse engineer; sparse nonnegative least square problem regularization path; Books; Electronic publishing; Indexes; Libraries; Materials; Vegetation; full regularization path; l1-regularization; life cycle assessment; nonnegative least squares

38. A High-Dimensional Set Top Box Ad Targeting Algorithm Including Experimental Comparisons to Traditional TV Algorithms.

【Paper Link】【Pages】:370-378

【Authors】: Brendan Kitts ; Dyng Au ; Brian Burdick

【Abstract】: We present a method for targeting ads on television that works on today's TV systems. The method works by mining vast amounts of Set Top Box data, as well as advertiser customer data. From both sources the system builds demographic profiles, and then looks for media that have the highest match per dollar to the customer profile. The method was tested in four live television campaigns, comprising over 22,000 airings, and we present experimental results.

【Keywords】: advertising; data mining; digital television; set-top boxes; TV algorithms; advertiser customer data mining; customer profile; demographic profiles; high-dimensional set-top box ad targeting algorithm; live television campaigns; set-top box data mining; Advertising; Educational institutions; Media; Sociology; Statistics; TV; Vectors; advertising; set top box; targeting; television

39. Efficient Algorithms for Selecting Features with Arbitrary Group Constraints via Group Lasso.

【Paper Link】【Pages】:379-388

【Authors】: Deguang Kong ; Chris H. Q. Ding

【Abstract】: Feature structure information plays an important role for regression and classification tasks. We consider a more generic problem: group lasso problem, where structures over feature space can be represented as a combination of features in a group. These groups can be either overlapped or non-overlapped, which are specified in different structures, e.g., structures over a line, a tree, a graph or even a forest. We propose a new approach to solve this generic group lasso problem, where certain features are selected in a group, and an arbitrary family of subset is allowed. We employ accelerated proximal gradient method to solve this problem, where a key step is solve the associated proximal operator. We propose a fast method to compute the proximal operator, where its convergence is rigorously proved. Experimental results on different structures (e.g., group, tree, graph structures) demonstrate the efficiency and effectiveness of the proposed algorithm.

【Keywords】: feature selection; gradient methods; pattern classification; regression analysis; accelerated proximal gradient method; arbitrary group constraints; associated proximal operator; classification task; feature selection; feature space; feature structure information; group lasso problem; regression task; Acceleration; Convergence; Gradient methods; Indexes; Input variables; Standards; Vegetation; exclusive lasso; feature group constraint; feature selection; group lasso; lasso

40. BIG-ALIGN: Fast Bipartite Graph Alignment.

【Paper Link】【Pages】:389-398

【Authors】: Danai Koutra ; Hanghang Tong ; David Lubensky

【Abstract】: How can we find the virtual twin (i.e., the same or similar user) on Linked In for a user on Facebook? How can we effectively link an information network with a social network to support cross-network search? Graph alignment - the task of finding the node correspondences between two given graphs - is a fundamental building block in numerous application domains, such as social networks analysis, bioinformatics, chemistry, pattern recognition. In this work, we focus on aligning bipartite graphs, a problem which has been largely ignored by the extensive existing work on graph matching, despite the ubiquity of those graphs (e.g., users-groups network). We introduce a new optimization formulation and propose an effective and fast algorithm to solve it. We also propose a fast generalization of our approach to align unipartite graphs. The extensive experimental evaluations show that our method outperforms the state-of-art graph matching algorithms in both alignment accuracy and running time, being up to 10x more accurate or 174x faster on real graphs.

【Keywords】: graph theory; optimisation; pattern matching; social networking (online); BIG-ALIGN; Facebook; Linked In; bioinformatics; chemistry; cross-network search; fast bipartite graph alignment; fast generalization; graph matching algorithms; information network; optimization formulation; pattern recognition; social network; social network analysis; virtual twin; Bipartite graph; Cost function; Facebook; LinkedIn; Probabilistic logic; Sparse matrices

41. Blocking Simple and Complex Contagion by Edge Removal.

【Paper Link】【Pages】:399-408

【Authors】: Chris J. Kuhlman ; Gaurav Tuli ; Samarth Swarup ; Madhav V. Marathe ; S. S. Ravi

【Abstract】: Eliminating interactions among individuals is an important means of blocking contagion spread, e.g., closing schools during an epidemic or shutting down electronic communication channels during social unrest. We study contagion blocking in networked populations by identifying edges to remove from a network, thus blocking contagion transmission pathways. We formulate various problems to minimize contagion spread and show that some are efficiently solvable while others are formally hard. We also compare our hardness results to those from node blocking problems and show interesting differences between the two. Our main problem is not only hard, but also has no approximation guarantee, unless P=NP. Therefore, we devise a heuristic for the problem and compare its performance to state-of-the-art heuristics from the literature. We show, through results of 12 (network, heuristic) combinations on three real social networks, that our method offers considerable improvement in the ability to block contagions in weighted and unweighted networks. We also conduct a parametric study to understand the limitations of our approach.

【Keywords】: computational complexity; network theory (graphs); social sciences; complex contagion; contagion blocking; edge removal; networked populations; node blocking problems; social networks; social unrest; unweighted networks; weighted networks; Approximation methods; Context; Contracts; Educational institutions; Integrated circuit modeling; Public healthcare; Social network services; complex contagion; contagion blocking; simple contagion

42. Guiding Autonomous Agents to Better Behaviors through Human Advice.

【Paper Link】【Pages】:409-418

【Authors】: Gautam Kunapuli ; Phillip Odom ; Jude W. Shavlik ; Sriraam Natarajan

【Abstract】: Inverse Reinforcement Learning (IRL) is an approach for domain-reward discovery from demonstration, where an agent mines the reward function of a Markov decision process by observing an expert acting in the domain. In the standard setting, it is assumed that the expert acts (nearly) optimally, and a large number of trajectories, i.e., training examples are available for reward discovery (and consequently, learning domain behavior). These are not practical assumptions: trajectories are often noisy, and there can be a paucity of examples. Our novel approach incorporates advice-giving into the IRL framework to address these issues. Inspired by preference elicitation, a domain expert provides advice on states and actions (features) by stating preferences over them. We evaluate our approach on several domains and show that with small amounts of targeted preference advice, learning is possible from noisy demonstrations, and requires far fewer trajectories compared to simply learning from trajectories alone.

【Keywords】: Markov processes; decision theory; learning (artificial intelligence); software agents; IRL framework; Markov decision process; autonomous agents; domain expert; domain-reward discovery; human advice; inverse reinforcement learning; learning domain behavior; preference elicitation; reward function; Data mining; Educational institutions; Equations; Learning (artificial intelligence); Noise measurement; Training; Trajectory

43. TL-PLSA: Transfer Learning between Domains with Different Classes.

【Paper Link】【Pages】:419-427

【Authors】: Anastasia Krithara ; Georgios Paliouras

【Abstract】: A new transfer learning method is presented in this paper, addressing a particularly hard transfer learning problem: the case where the target domain shares only a subset of its classes with the source domain and only unlabeled data are provided for the target domain. This is a situation that occurs frequently in real-world applications, such as the multiclass document classification problems that motivated our work. The proposed approach is a transfer learning variant of the Probabilistic Latent Semantic Analysis (PLSA) model that we name TL-PLSA. Unlike most approaches in the literature, TL-PLSA captures both the difference of the domains and the commonalities of the class sets, given no labelled data from the target domain. We perform experiments over three different datasets and show the difficulty of the task, as well as the promising results that we obtained with the new method.

【Keywords】: learning (artificial intelligence); probability; TL-PLSA; multiclass document classification problems; probabilistic latent semantic analysis model; source domain; target domain; transfer learning method; unlabeled data; Data models; Graphical models; Mathematical model; Probabilistic logic; Semantics; Training; Training data; PLSA; multiclass classification; transfer learning

44. Learning, Analyzing and Predicting Object Roles on Dynamic Networks.

【Paper Link】【Pages】:428-437

【Authors】: Kang Li ; Suxin Guo ; Nan Du ; Jing Gao ; Aidong Zhang

【Abstract】: Dynamic networks are structures with objects and links between the objects that vary in time. Temporal information in dynamic networks can be used to reveal many important phenomena such as bursts of activities in social networks and human communication patterns in email networks. In this area, one very important problem is to understand dynamic patterns of object roles. For instance, will a user become a peripheral node in a social network? Could a website become a hub on the Internet? Will a gene be highly expressed in gene-gene interaction networks in the later stage of a cancer? In this paper, we propose a novel approach that identifies the role of each object, tracks the changes of object roles over time, and predicts the evolving patterns of the object roles in dynamic networks. In particular, a probability model is proposed to extract latent features of object roles from dynamic networks. The extracted latent features are discriminative in learning object roles and are capable of characterizing network structures. The probability model is then extended to learn the dynamic patterns and make predictions on object roles. We assess our method on two data sets on the tasks of exploring how users' importance and political interests evolve as time progresses on dynamic networks. Overall, the extensive experimental evaluations confirm the effectiveness of our approach for identifying, analyzing and predicting object roles on dynamic networks.

【Keywords】: data mining; feature extraction; learning (artificial intelligence); network theory (graphs); probability; Internet; Web site; dynamic networks; dynamic patterns; gene-gene interaction networks; latent feature extraction; network structures; object role analysis; object role identification; object role learning; object role patterns; object role prediction; probability model; social networks; Analytical models; Bayes methods; Bismuth; Data mining; Feature extraction; Predictive models; Social network services; Dynamic Network; Object Role

45. Tag-Weighted Dirichlet Allocation.

【Paper Link】【Pages】:438-447

【Authors】: Shuangyin Li ; Guan Huang ; Ruiyang Tan ; Rong Pan

【Abstract】: In the past two decades, there has been a huge amount of document data with rich tag information during the evolution of the Internet, which can be called semi-structured data. These semi-structured data contain both unstructured features (e.g., plain text) and metadata, such as tags in html files or author and venue information in research articles. It's of great interest to model such kind of data. Most previous works focused on modeling the unstructured data. Some other methods have been proposed to model the unstructured data with specific tags. To build a general model for semi-structured documents remains an important problem in terms of both model fitness and efficiency. In this paper, we propose a novel method to model the tagged documents by a so-called Tag-Weighted Dirichlet Allocation (TWDA). TWDA is a framework that leverages both the tags and words in each document to infer the topic components for the documents. This allows not only to learn the document-topic and topic-word distributions, but also to infer the tag-topic distributions for text mining (e.g., classification, clustering, and recommendations). Moreover, TWDA can automatically infer the probabilistic weights of tags for each document, that can be used to predict the tags in one document. We present an efficient variational inference method with an EM algorithm for estimating the model parameters. The experimental results show the effectiveness, efficiency and robustness of our TWDA approach by comparing it with the state-of-the-art methods on four corpora in document modeling, tags prediction and text classification.

【Keywords】: data mining; expectation-maximisation algorithm; inference mechanisms; parameter estimation; pattern classification; pattern clustering; probability; recommender systems; text analysis; EM algorithm; HTML files; TWDA framework; author information; document tag leveraging; document word leveraging; document-topic distribution learning; metadata; model efficiency; model fitness; model parameter estimation; plain text; probabilistic weights; research articles; semistructured document data; tag information; tag prediction; tag-topic distribution inference; tag-weighted Dirichlet allocation; tagged document modelling; text classification; text clustering; text mining; text recommendations; topic components; topic-word distribution learning; variational inference method; venue information; Analytical models; Data models; Mathematical model; Predictive models; Probabilistic logic; Resource management; Vectors; Dirichlet allocation; Tag-Weighted; topic model; variational inference

46. Mining Probabilistic Frequent Spatio-Temporal Sequential Patterns with Gap Constraints from Uncertain Databases.

【Paper Link】【Pages】:448-457

【Authors】: Yuxuan Li ; James Bailey ; Lars Kulik ; Jian Pei

【Abstract】: Uncertainty is common in real-world applications, for example, in sensor networks and moving object tracking, resulting in much interest in item set mining for uncertain transaction databases. In this paper, we focus on pattern mining for uncertain sequences and introduce probabilistic frequent spatial-temporal sequential patterns with gap constraints. Such patterns are important for the discovery of knowledge given uncertain trajectory data. We propose a dynamic programming approach for computing the frequentness probability of these patterns, which has linear time complexity, and we explore its embedding into pattern enumeration algorithms using both breadth-first search and depth-first search strategies. Our extensive empirical study shows the efficiency and effectiveness of our methods for synthetic and real-world datasets.

【Keywords】: computational complexity; data mining; dynamic programming; tree searching; breadth-first search strategy; depth-first search strategy; dynamic programming approach; gap constraints; knowledge discovery; linear time complexity; pattern enumeration algorithms; probabilistic frequent spatio-temporal sequential pattern mining; uncertain databases; uncertain trajectory data; Data mining; Databases; Dynamic programming; Mathematical model; Probabilistic logic; Trajectory; Uncertainty; Sequential patterns; Spatial-temporal data; Uncertain databases; Uncertain pattern mining

47. Mining Following Relationships in Movement Data.

【Paper Link】【Pages】:458-467

【Authors】: Zhenhui Li ; Fei Wu ; Margaret Crofoot

【Abstract】: Movement data have been widely collected from GPS and sensors, allowing us to analyze how moving objects interact in terms of space and time and to learn about the relationships that exist among the objects. In this paper, we investigate an interesting relationship that has not been adequately studied so far: the following relationship. Intuitively, a follower has similar trajectories as its leader but always arrives at a location with some time lag. The challenges in mining the following relationship are: (1) the following time lag is usually unknown and varying, (2) the trajectories of the follower and leader are not identical, and (3) the relationship is subtle and only occurs in a short period of time. In this paper, we propose a simple but practical method that addresses all these challenges. It requires only two intuitive parameters and is able to mine following time intervals between two trajectories in linear time. We conduct comprehensive experiments on both synthetic and real datasets to demonstrate the effectiveness of our method.

【Keywords】: data mining; directed graphs; following time lag; mining following relationships; movement data; Animals; Correlation; Data mining; Educational institutions; Heuristic algorithms; Time series analysis; Trajectory

【Paper Link】【Pages】:468-477

【Authors】: Qi Liu ; Biao Xiang ; Lei Zhang ; Enhong Chen ; Chang Tan ; Ji Chen

【Abstract】: Recent years have witnessed the increased interests in exploiting influence in social networks for many applications. To the best of our knowledge, from the computational aspect of social influence analysis, most of existing work focus on either describing the influence propagation process or identifying the set of most influential seed nodes. However, these work usually do not distinguish the "independent influence" of each single seed node after removing other seeds. Since it is important to quickly figure out the real contribution of each seed, in this paper we propose to measure the seed's independent influence by a linear social influence model. Specifically, we first describe the linear social influence model, and then define the independent influence under this model for eliminating the "mutual enrichment" between seed nodes. Meanwhile, we find that the influence of a set of nodes is actually the sum of their independent influence, and we also give upper bounds for independent influence. Moreover, these findings are evaluated by two applications, i.e., ranking the seeds by their independent influence and identifying the Top-K influential ones. Finally, the experimental results on several real-world datasets validate the effectiveness and efficiency of the proposed independent social influence measures.

【Keywords】: social networking (online); independent social influence analysis; independent social influence measures; influence propagation process; influential seed nodes; linear computation; linear social influence model; social networks; top-k influential seed identification; Computational modeling; Equations; Integrated circuit modeling; Mathematical model; Upper bound; Vectors; Independent; Linear Computation; Ranking; Social influence; Upper Bound

49. Learning Imbalanced Multi-class Data with Optimal Dichotomy Weights.

【Paper Link】【Pages】:478-487

【Authors】: Xu-Ying Liu ; Qian-Qian Li ; Zhi-Hua Zhou

【Abstract】: Class-imbalance is very common in real data mining tasks. Previous studies focused on binary-class imbalance problem, whereas multi-class imbalance problem is more challenging. Error correcting output codes (ECOC) technique can be applied to class-imbalance problem, however, the standard ECOC aims at maximizing accuracy, ignoring the fact that, when class-imbalance is really a problem, the minority classes are more important than the majority classes. To enable ECOC to tackle multi-class imbalance, it is desired to have an appropriate code matrix, an effective learning strategy and a decoding strategy emphasizing the minority classes. In this paper, based on the aforementioned consideration, we propose the imECOC method which works on dichotomies to handle both the between-class imbalance and within-class imbalance. As the dichotomy classifiers contribute differently to the final prediction, imECOC assigns weights to dichotomies and uses weighted distance for decoding, where the optimal dichotomy weights are obtained by minimizing a weighted loss in favor of the minority classes. Experimental results on fourteen data sets show that, imECOC performs significantly better than many state-of-the-art multi-class imbalance learning methods, no matter whether multi-class F1, G-mean or AUC are used as evaluation measures.

【Keywords】: data mining; decoding; error correction codes; learning (artificial intelligence); pattern classification; AUC; ECOC technique; between-class imbalance; binary-class imbalance problem; code matrix; data mining tasks; decoding strategy; error correcting output codes; evaluation measures; imECOC; imbalanced multiclass data learning strategy; majority classes; minority classes; multiclass F1; optimal dichotomy weights; weighted distance; weighted loss minimization; within-class imbalance; Accuracy; Complexity theory; Decoding; Encoding; Learning systems; Standards; Training; Class-imbalance learning; ECOC; Multi-class

50. Mining Statistically Significant Sequential Patterns.

【Paper Link】【Pages】:488-497

【Authors】: Cécile Low-Kam ; Chedy Raïssi ; Mehdi Kaytoue ; Jian Pei

【Abstract】: Recent developments in the frequent pattern mining framework uses additional measures of interest to reduce the set of discovered patterns. We introduce a rigorous and efficient approach to mine statistically significant, unexpected patterns in sequences of item sets. The proposed methodology is based on a null model for sequences and on a multiple testing procedure to extract patterns of interest. Experiments on sequences of replays of a video game demonstrate the scalability and the efficiency of the method to discover unexpected game strategies.

【Keywords】: computer games; data mining; statistical analysis; frequent pattern mining framework; game strategies; interest pattern extraction; item set sequences; multiple testing procedure; pattern discovery; sequences null model; statistically significant sequential pattern mining; video game; Computational modeling; Data mining; Entropy; Hidden Markov models; Itemsets; Random variables; Sequential pattern; null model; significance test

51. Mining Summaries of Propagations.

【Paper Link】【Pages】:498-507

【Authors】: Lucrezia Macchia ; Francesco Bonchi ; Francesco Gullo ; Luca Chiarandini

【Abstract】: Analyzing the traces left by a meme of information propagating through a social network or by a user browsing a website can help to unveil the structure and dynamics of such complex networks. This may in turn open the door to concrete applications, such as finding influential users for a topic in a social network, or detecting the typical structure of a web browsing session that leads to a product purchase. In this paper we define the problem of mining summaries of propagations as a constrained pattern-mining problem. A propagation is a DAG where an entity (e.g., information exchanged in a social network, or a user browsing a website) flows following the underlying hierarchical structure of the nodes. A summary is a set of propagations that (i) involve a similar population of nodes, and (ii) exhibit a coherent hierarchical structure when merged altogether to form a single graph. The first constraint is defined based on the Jaccard coefficient, while the definition of the second one relies on the graph-theoretic concept of "agony" of a graph. It turns out that both constraints satisfy the downward closure property, thus enabling Apriori-like algorithms. However, motivated by the fact that computing agony is much more expensive than computing Jaccard, we devise two algorithms that explore the search space differently. The first algorithm is an Apriori-like, bottom-up method that checks both the constraints level-by-level. The second algorithm consists of a first phase where the search space is pruned as much as possible by exploiting the Jaccard constraint only, while involving the second constraint only afterwards, in a subsequent phase. We test our algorithms on four real-world datasets. Quantitative results reveal that the choice of the most efficient algorithm depends on the selectivity of the two constraints. Qualitative results show the relevance of the extracted summaries in a number of real-world scenarios.

【Keywords】: data mining; directed graphs; social networking (online); Apriori-like algorithms; Apriori-like bottom-up method; DAG; Jaccard coefficient; Jaccard constraint; Web browsing session; Web site; agony; complex networks; constrained pattern mining problem; downward-closure property; entity; graph-theoretic concept; hierarchical structure; information propagation; search space; social network; summary mining; Data mining; Databases; Lattices; Periodic structures; Social network services; Sociology; Statistics; Apriori; graphs; influence maximization; information propagation; pattern mining; social networks; web browsing

52. Active Density-Based Clustering.

【Paper Link】【Pages】:508-517

【Authors】: Son T. Mai ; Xiao He ; Nina Hubig ; Claudia Plant ; Christian Böhm

【Abstract】: The density-based clustering algorithm DBSCAN is a fundamental technique for data clustering with many attractive properties and applications. However, DBSCAN requires specifying all pair wise (dis)similarities among objects that can be non-trivial to obtain in many applications. To tackle this problem, in this paper, we propose a novel active density-based clustering algorithm, named Act-DBSCAN, which works under a restricted number of used pair wise similarities. Act-DBSCAN exploits the pair wise lower-bounding (LB) similarities to initialize the cluster structure. Then, it adaptively selects the most informative pair wise LB similarities to update with the real ones in order to reconstruct the result until the budget limitation is reached. The goal is to approximate as much as possible the true clustering result with each update. Our Act-DBSCAN framework is built upon a proposed probabilistic model to score the impact of the update of each pair wise LB similarity on the change of the intermediate clustering structure. Deriving from this scoring system and the monotonicity and reduction property of our active clustering process, we propose the two efficient algorithms to iteratively select and update pair wise similarities and cluster structure. Experiments on real datasets show that Act-DBSCAN acquires good clustering results with only a few pair wise similarities, and requires only a small fraction of all pair wise similarities to reach the DBSCAN results. Act-DBSCAN also outperforms other related techniques such as active spectral clustering.

【Keywords】: pattern clustering; probability; Act-DBSCAN framework; active density-based clustering algorithm; active spectral clustering; cluster structure; data clustering; informative pairwise LB similarities; informative pairwise LB similarity selection; intermediate clustering structure; monotonicity property; pairwise lower-bounding similarities; reduction property; scoring system; true clustering; Clustering algorithms; Estimation; Kernel; Merging; Noise; Probabilistic logic; Training; Active clustering; Active learning; Density-based clustering

53. Explaining Outliers by Subspace Separability.

【Paper Link】【Pages】:518-527

【Authors】: Barbora Micenková ; Raymond T. Ng ; Xuan Hong Dang ; Ira Assent

【Abstract】: Outliers are extraordinary objects in a data collection. Depending on the domain, they may represent errors, fraudulent activities or rare events that are subject of our interest. Existing approaches focus on detection of outliers or degrees of outlierness (ranking), but do not provide a possible explanation of how these objects deviate from the rest of the data. Such explanations would help user to interpret or validate the detected outliers. The problem addressed in this paper is as follows: given an outlier detected by an existing algorithm, we propose a method that determines possible explanations for the outlier. These explanations are expressed in the form of subspaces in which the given outlier shows separability from the inliers. In this manner, our proposed method complements existing outlier detection algorithms by providing additional information about the outliers. Our method is designed to work with any existing outlier detection algorithm and it also includes a heuristic that gives a substantial speedup over the baseline strategy.

【Keywords】: data acquisition; data analysis; data collection; inliers; outlier detection algorithms; outlierness degrees; subspace separability; Accuracy; Data visualization; Databases; Detection algorithms; Extraterrestrial measurements; Feature extraction; Gaussian distribution; data exploration; outlier explanation; subspace selection

54. Quantification Trees.

【Paper Link】【Pages】:528-536

【Authors】: Letizia Milli ; Anna Monreale ; Giulio Rossetti ; Fosca Giannotti ; Dino Pedreschi ; Fabrizio Sebastiani

【Abstract】: In many applications there is a need to monitor how a population is distributed across different classes, and to track the changes in this distribution that derive from varying circumstances, an example such application is monitoring the percentage (or "prevalence") of unemployed people in a given region, or in a given age range, or at different time periods. When the membership of an individual in a class cannot be established deterministically, this monitoring activity requires classification. However, in the above applications the final goal is not determining which class each individual belongs to, but simply estimating the prevalence of each class in the unlabeled data. This task is called quantification. In a supervised learning framework we may estimate the distribution across the classes in a test set from a training set of labeled individuals. However, this may be sub optimal, since the distribution in the test set may be substantially different from that in the training set (a phenomenon called distribution drift). So far, quantification has mostly been addressed by learning a classifier optimized for individual classification and later adjusting the distribution it computes to compensate for its tendency to either under-or over-estimate the prevalence of the class. In this paper we propose instead to use a type of decision trees (quantification trees) optimized not for individual classification, but directly for quantification. Our experiments show that quantification trees are more accurate than existing state-of-the-art quantification methods, while retaining at the same time the simplicity and understandability of the decision tree framework.

【Keywords】: decision trees; learning (artificial intelligence); pattern classification; class prevalence estimation; classifier learning; decision tree framework; distribution drift; individual classification; labeled individual training set; quantification trees; supervised learning framework; test set; unlabeled data; Accuracy; Decision trees; Estimation; Sociology; Standards; Training; Vegetation

55. Mining Evolving Network Processes.

【Paper Link】【Pages】:537-546

【Authors】: Misael Mongiovì ; Petko Bogdanov ; Ambuj K. Singh

【Abstract】: Processes within real world networks evolve according to the underlying graph structure. A number of examples exists in diverse network genres: botnet communication growth, moving traffic jams [1], information foraging [2] in document networks (WWW and Wikipedia), and spread of viral memes or opinions in social networks. The network structure in all the above examples remains relatively fixed, while the shape, size and position of the affected network regions change gradually with time. Traffic jams grow, move, shrink and eventually disappear. Public attention shifts among current hot topics inducing a similar shift of highly accessed Wikipedia articles. Discovery of such smoothly evolving network processes has the potential to expose the intrinsic mechanisms of complex network dynamics, enable new data-driven models and improve network design. We introduce the novel problem of Mining smoothly evolving processes (MINESMOOTH) in networks with dynamic real-valued node/edge weights. We show that ensuring smooth transitions in the solution is NP-hard even on restricted network structures such as trees. We propose an efficient filtering based framework, called LEGATO. It achieves 3-7 times higher scores (i.e. larger and more significant processes) compared to alternatives on real networks, and above 80% accuracy in discovering realistic "embedded" processes in synthetic networks. In transportation networks, LEGATO discovers processes that conform to existing traffic jams models. Its results in Wikipedia reveal the temporal evolution of information seeking of Internet users.

【Keywords】: Web sites; data mining; graph theory; optimisation; traffic engineering computing; Internet users; LEGATO; MINESMOOTH; NP-hard problem; Wikipedia; dynamic real-valued node/edge weights; evolving network processes mining; information seeking; realistic embedded processes; synthetic networks; traffic jam models; transportation networks; Accuracy; Electronic publishing; Encyclopedias; Heuristic algorithms; Internet; Upper bound; dynamic networks; graph mining; network processes

56. Enumeration of Time Series Motifs of All Lengths.

【Paper Link】【Pages】:547-556

【Authors】: Abdullah Mueen

【Abstract】: Time series motifs are repeated patterns in long and noisy time series. Motifs are typically used to understand the dynamics of the source because repeated patterns with high similarity evidentially rule out the presence of noise. Recently, time series motifs have also been used for clustering, summarization, rule discovery and compression as features. For all such purposes, many high quality motifs of various lengths are desirable and thus, originates the problem of enumerating motifs for a wide range of lengths. Existing algorithms find motifs for a given length. A trivial way to enumerate motifs is to run one of the algorithms for the whole range of lengths. However, such parameter sweep is computationally infeasible for large real datasets. In this paper, we describe an exact algorithm, called MOEN, to enumerate motifs. The algorithm is an order of magnitude faster than the naive algorithm. The algorithm frees us from re-discovering the same motif at different lengths and tuning multiple data-dependent parameters. The speedup comes from using a novel bound on the similarity function across lengths and the algorithm uses only linear space unlike other motif discovery algorithms. We describe three case studies in entomology and activity recognition where MOEN enumerates several high quality motifs.

【Keywords】: data mining; pattern clustering; time series; MOEN exact algorithm; activity recognition; entomology; high quality motifs; large real datasets; linear space; motif discovery algorithms; multiple data-dependent parameter tuning; naive algorithm; parameter sweep; pattern clustering; repeated patterns; rule discovery; similarity function; time series motif enumeration; Algorithm design and analysis; Clustering algorithms; Electroencephalography; Force; Noise measurement; Time series analysis; Upper bound; Distance bound; Enumeration; Time series motif

57. Dominance Programming for Itemset Mining.

【Paper Link】【Pages】:557-566

【Authors】: Benjamin Négrevergne ; Anton Dries ; Tias Guns ; Siegfried Nijssen

【Abstract】: Finding small sets of interesting patterns is an important challenge in pattern mining. In this paper, we argue that several well-known approaches that address this challenge are based on performing pair wise comparisons between patterns. Examples include finding closed patterns, free patterns, relevant subgroups and skyline patterns. Although progress has been made on each of these individual problems, a generic approach for solving these problems (and more) is still lacking. This paper tackles this challenge. It proposes a novel, generic approach for handling pattern mining problems that involve pair wise comparisons between patterns. Our key contributions are the following. First, we propose a novel algebra for programming pattern mining problems. This algebra extends relational algebras in a novel way towards pattern mining. It allows for the generic combination of constraints on individual patterns with dominance relations between patterns. Second, we introduce a modified generic constraint satisfaction system to evaluate these algebraic expressions. Experiments show that this generic approach can indeed effectively identify patterns expressed in the algebra.

【Keywords】: data mining; relational algebra; dominance programming; dominance relations; generic approach; generic constraint satisfaction system; itemset mining; pattern mining problems; pattern pair wise comparison; relational algebra; Algebra; Data mining; Frequency measurement; Generators; Itemsets; Programming

【Paper Link】【Pages】:567-576

【Authors】: Dung T. Nguyen ; Huiyuan Zhang ; Soham Das ; My T. Thai ; Thang N. Dinh

【Abstract】: The least cost influence (LCI) problem, which asks to identify a minimum number of seed users who can eventually influence a large number of users, has become one of the central research topics recently in online social networks (OSNs). However, existing works mostly focused on a single network while users nowadays often join several OSNs. Thus, it is crucial to investigate the influence in multiplex networks, i.e. the influence is diffused across a set of networks via shared users, in order to obtain the best set of seed users.In this paper, we propose a unified framework to represent and analyze the influence diffusion in multiplex networks. More specifically, we tackle the LCI problem in multiplex OSNs by reducing multiplex networks to a single network via various coupling schemes while preserving the most influence propagation properties. Besides the coupling schemes to represent the diffusion process, the framework also includes the influence relay, a new metric to measure the flow of influence inside and between networks. The experiments on both real and synthesized datasets validate the effectiveness of the coupling schemes as well as provide some interesting insights into the process of influence propagation in multiplex networks.

【Keywords】: social networking (online); LCI problem; OSN; coupling schemes; influence diffusion process; influence propagation properties; least cost influence problem; model analysis; model representation; multiplex social networks; online social networks; seed users; shared users; Couplings; Facebook; Logic gates; Multiplexing; Stochastic processes; Twitter

59. Bayesian Discovery of Multiple Bayesian Networks via Transfer Learning.

【Paper Link】【Pages】:577-586

【Authors】: Diane Oyen ; Terran Lane

【Abstract】: Bayesian network structure learning algorithms with limited data are being used in domains such as systems biology and neuroscience to gain insight into the underlying processes that produce observed data. Learning reliable networks from limited data is difficult, therefore transfer learning can improve the robustness of learned networks by leveraging data from related tasks. Existing transfer learning algorithms for Bayesian network structure learning give a single maximum a posteriori estimate of network models. Yet, many other models may be equally likely, and so a more informative result is provided by Bayesian structure discovery. Bayesian structure discovery algorithms estimate posterior probabilities of structural features, such as edges. We present transfer learning for Bayesian structure discovery which allows us to explore the shared and unique structural features among related tasks. Efficient computation requires that our transfer learning objective factors into local calculations, which we prove is given by a broad class of transfer biases. Theoretically, we show the efficiency of our approach. Empirically, we show that compared to single task learning, transfer learning is better able to positively identify true edges. We apply the method to whole-brain neuroimaging data.

【Keywords】: belief networks; data mining; learning (artificial intelligence); maximum likelihood estimation; probability; Bayesian network discovery; Bayesian network structure learning algorithms; Bayesian structure discovery algorithms; limited data; maximum a posteriori estimate; multiple Bayesian networks; network models; neuroscience; posterior probability estimation; single maximum a posteriori estimation; single task learning; system biology; transfer biases; transfer learning algorithms; true edge identification; whole-brain neuroimaging data method; Approximation algorithms; Approximation methods; Bayes methods; Data models; Joints; Periodic structures; Robustness

60. Forecasting Spatiotemporal Impact of Traffic Incidents on Road Networks.

【Paper Link】【Pages】:587-596

【Authors】: Bei Pan ; Ugur Demiryurek ; Cyrus Shahabi ; Chetan Gupta

【Abstract】: The advances in sensor technologies enable real-time collection of high-fidelity spatiotemporal data on transportation networks of major cities. In this paper, using two real-world transportation datasets: 1) incident data and 2) traffic data, we address the problem of predicting and quantifying the impact of traffic incidents. Traffic incidents include any non-recurring events on road networks, including accidents, weather hazard, road construction or work zone closures. By analyzing archived incident data, we classify incidents based on their features (e.g., time, location, type of incident). Subsequently, we model the impact of each incident class on its surrounding traffic by analyzing the archived traffic data at the time and location of the incidents. Consequently, in real-time, if we observe a similar incident (from real-time incident data), we can predict and quantify its impact on the surrounding traffic using our developed models. This information, in turn, can help drivers to effectively avoid impacted areas in real-time. To be useful for such real-time navigation application, and unlike current approaches, we study the dynamic behavior of incidents and model the impact as a quantitative time varying spatial span. In addition to utilizing incident features, we improve our classification approach further by analyzing traffic density around the incident area and the initial behavior of the incident. We evaluated our approach with very large traffic and incident datasets collected from the road networks of Los Angeles County and the results show that we can improve our baseline approach, which solely relies on incident features, by up to 45%.

【Keywords】: data analysis; driver information systems; forecasting theory; pattern classification; road accidents; road traffic; traffic engineering computing; transportation; Los Angeles County; accidents; archived traffic data analysis; dynamic incident behavior; incident datasets; incident features; quantitative time varying spatial span; real-time high-fidelity spatiotemporal data collection; real-time incident data; real-time navigation application; real-world transportation dataset; road construction; road networks; sensor technologies; spatiotemporal impact forecasting; traffic data; traffic datasets; traffic density analysis; traffic incident classification; weather hazard; work zone closures; Accidents; Navigation; Real-time systems; Roads; Time series analysis; Vehicles; impact analysis; intelligent transportation; spatiotemporal data; traffic forecast; traffic incidents

61. Scaling Log-Linear Analysis to High-Dimensional Data.

【Paper Link】【Pages】:597-606

【Authors】: François Petitjean ; Geoffrey I. Webb ; Ann E. Nicholson

【Abstract】: Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We develop an efficient approach to log-linear analysis that scales to hundreds of variables by melding the classical statistical machinery of log-linear analysis with advanced data mining techniques from association discovery and graphical modeling.

【Keywords】: data mining; statistical analysis; advanced data mining techniques; association discovery; classical statistical machinery; data mining task; graphical modeling; high-dimensional data; log-linear analysis scaling; primary statistical approach; Analytical models; Computational modeling; Data mining; Entropy; Lattices; Maximum likelihood estimation; Particle separators; Association Discovery; Data Modeling; High-dimensional Data; Log-linear Analysis

62. Fast Pairwise Query Selection for Large-Scale Active Learning to Rank.

【Paper Link】【Pages】:607-616

【Authors】: Buyue Qian ; Xiang Wang ; Jun Wang ; Hongfei Li ; Nan Cao ; Weifeng Zhi ; Ian Davidson

【Abstract】: Pair wise learning to rank algorithms (such as Rank SVM) teach a machine how to rank objects given a collection of ordered object pairs. However, their accuracy is highly dependent on the abundance of training data. To address this limitation and reduce annotation efforts, the framework of active pair wise learning to rank was introduced recently. However, in such a framework the number of possible query pairs increases quadratic ally with the number of instances. In this work, we present the first scalable pair wise query selection method using a layered (two-step) hashing framework. The first step relevance hashing aims to retrieve the strongly relevant or highly ranked points, and the second step uncertainty hashing is used to nominate pairs whose ranking is uncertain. The proposed framework aims to efficiently reduce the search space of pair wise queries and can be used with any pair wise learning to rank algorithm with a linear ranking function. We evaluate our approach on large-scale real problems and show it has comparable performance to exhaustive search. The experimental results demonstrate the effectiveness of our approach, and validate the efficiency of hashing in accelerating the search of massive pair wise queries.

【Keywords】: file organisation; learning (artificial intelligence); query processing; active pair wise learning; algorithm ranking; fast pairwise query selection; large-scale active learning; layered hashing framework; scalable pair wise query selection method; search space reduction; uncertainty hashing; Accuracy; Artificial neural networks; Databases; Search problems; Training data; Uncertainty; Vectors; Active Learning; Hashing; Learning to Rank

63. Power to the Points: Validating Data Memberships in Clusterings.

【Paper Link】【Pages】:617-626

【Authors】: Parasaran Raman ; Suresh Venkatasubramanian

【Abstract】: In this paper, we present a method to attach affinity scores to the implicit labels of individual points in a clustering. The affinity scores capture the confidence level of the cluster that claims to "own" the point. We demonstrate that these scores accurately capture the quality of the label assigned to the point. We also show further applications of these scores to estimate global measures of clustering quality, as well as accelerate clustering algorithms by orders of magnitude using active selection based on affinity. This method is very general and applies to clusterings derived from any geometric source. It lends itself to easy visualization and can prove useful as part of an interactive visual analytics framework. It is also efficient: assigning an affinity score to a point depends only polynomially on the number of clusters and is independent both of the size and dimensionality of the data. It is based on techniques from the theory of interpolation, coupled with sampling and estimation algorithms from high dimensional computational geometry.

【Keywords】: computational geometry; data analysis; data visualisation; estimation theory; interpolation; pattern clustering; sampling methods; active selection; affinity scores; cluster confidence level; clustering algorithm; clustering quality global measure estimation; data membership validation; estimation algorithm; high dimensional computational geometry; interactive visual analytics framework; interpolation theory; label quality; sampling algorithm; visualization; Clustering algorithms; Data models; Data visualization; Educational institutions; Probabilistic logic; Stability analysis; Standards; Natural Neighbor Interpolation; Power Diagrams; Validating Clusterings

64. Weighted-Object Ensemble Clustering.

【Paper Link】【Pages】:627-636

【Authors】: Ya-Zhou Ren ; Carlotta Domeniconi ; Guoji Zhang ; Guo-Xian Yu

【Abstract】: Ensemble clustering, also known as consensus clustering, aims to generate a stable and robust clustering through the consolidation of multiple base clusterings. In recent years many ensemble clustering methods have been proposed, most of which treat each clustering and each object as equally important. Some approaches make use of weights associated with clusters, or with clusterings, when assembling the different base clusterings. Boosting algorithms developed for classification have also led to the idea of considering weighted objects during the clustering process. However, not much effort has been put towards incorporating weighted objects into the consensus process. To fill this gap, in this paper we propose an approach called Weighted-Object Ensemble Clustering (WOEC). We first estimate how difficult it is to cluster an object by constructing the co-association matrix that summarizes the base clustering results, and we then embed the corresponding information as weights associated to objects. We propose three different consensus techniques to leverage the weighted objects. All three reduce the ensemble clustering problem to a graph partitioning one. We present extensive experimental results which demonstrate that our WOEC approach outperforms state-of-the-art consensus clustering methods and is robust to parameter settings.

【Keywords】: data mining; matrix algebra; pattern clustering; WOEC; coassociation matrix; consensus clustering; graph partitioning; weighted-object ensemble clustering; Bipartite graph; Boosting; Clustering algorithms; Clustering methods; Educational institutions; Partitioning algorithms; Vectors; Ensemble clustering; consensus clustering; graph partition; weighted objects

65. Mining User Lifecycles from Online Community Platforms and their Application to Churn Prediction.

【Paper Link】【Pages】:637-646

【Authors】: Matthew Rowe

【Abstract】: Recent work has studied user development in the domains of both telecommunication and online community platforms, examining how users develop in terms of the company they keep (socially) and the language they use (lexically). Such works afford key insights into user changes along individual dimensions, yet they do not examine how users develop relative to their prior behaviour along multiple dimensions. In this paper we examine how users develop along various properties (in-degree, out-degree, posted terms) in three online community platforms (Facebook, SAP Community Network, and Server Fault) and using three models of user development: (i) isolated lifecycle periods, (ii) historical contrasts, and (iii) community contrasts. We present an approach to mine the lifecycle trajectories of users as a means to characterise user development along the different properties and development models, and demonstrate the utility of such trajectories in predicting churners. We find consistent effects with past work: users tend to reflect the behaviour of the community in early portions of their lifecycles, before then diverging from the community towards the end. We also find that users form sub-communities with whom they communicate and remain within.

【Keywords】: Internet; data mining; social networking (online); Facebook; SAP community network; churn prediction; community contrasts; historical contrasts; isolated lifecycle periods; lifecycle trajectory mining; online community platforms; server fault; telecommunication community platforms; user development; user lifecycle mining; Communities; Entropy; Facebook; Probability distribution; Servers; Trajectory; churn prediction; online communities; social networks; user development; user lifecycles

66. Statistical Selection of Congruent Subspaces for Mining Attributed Graphs.

【Paper Link】【Pages】:647-656

【Authors】: Patricia Iglesias Sánchez ; Emmanuel Müller ; Fabian Laforet ; Fabian Keller ; Klemens Böhm

【Abstract】: Current mining algorithms for attributed graphs exploit dependencies between attribute information and edge structure, referred to as homophily. However, techniques fail if this assumption does not hold for the full attribute space. In multivariate spaces, some attributes have high dependency with the graph structure while others do not show any dependency. Hence, it is important to select congruent subspaces (i.e., subsets of the node attributes) showing dependencies with the graph structure. In this work, we propose a method for the statistical selection of such congruent subspaces. More specifically, we define a measure which assesses the degree of congruence between a set of attributes and the entire graph. We use it as the core of a statistical test, which congruent subspaces must pass. To illustrate its applicability to common graph mining tasks and in order to evaluate our selection scheme, we apply it to community outlier detection. Our selection of congruent subspaces enhances outlier detection by measuring outlier ness scores in selected subspaces only. Experiments on attributed graphs show that our approach outperforms traditional full space approaches and gives way to better outlier detection.

【Keywords】: data mining; graph theory; statistical testing; attribute information; attributed graph mining algorithm; community outlier detection; congruent subspaces; edge structure; full attribute space; graph structure; multivariate spaces; outlier ness scores; statistical selection method; statistical test; Communities; Data mining; Estimation; Image edge detection; Monte Carlo methods; Social network services; Vectors; attributed graphs; homophily; subspace selection

67. Classifying Spam Emails Using Text and Readability Features.

【Paper Link】【Pages】:657-666

【Authors】: Rushdi Shams ; Robert E. Mercer

【Abstract】: Supervised machine learning methods for classifying spam emails are long-established. Most of these methods use either header-based or content-based features. Spammers, however, can bypass these methods easily-especially the ones that deal with header features. In this paper, we report a novel spam classification method that uses features based on email content-language and readability combined with the previously used content-based task features. The features are extracted from four benchmark datasets viz. CSDMC2010, Spam Assassin, Ling Spam, and Enron-Spam. We use five well-known algorithms to induce our spam classifiers: Random Forest (RF), BAGGING, ADABOOSTM1, Support Vector Machine (SVM), and Naïve Bayes (NB). We evaluate the classifier performances and find that BAGGING performs the best. Moreover, its performance surpasses that of a number of state-of-the-art methods proposed in previous studies. Although applied only to English language emails, the results indicate that our method may be an excellent means to classify spam emails in other languages, as well.

【Keywords】: Bayes methods; feature extraction; learning (artificial intelligence); pattern classification; support vector machines; text analysis; unsolicited e-mail; AdaboosTM1; CSDMC2010; English language emails; Enron-spam; Ling spam; NB; Naïve Bayes; RF; SVM; bagging; benchmark datasets; content-based task features; email content-language; feature extraction; header-based features; random forest; readability features; spam assassin; spam classification method; spam classifiers; spam email classification; spammers; supervised machine learning methods; support vector machine; text features; Bagging; HTML; Indexes; Radio frequency; Support vector machines; Unsolicited electronic mail; Spam classification; anti-spam filter; feature importance; machine-learning application; performance evaluation; readability features; text categorization; text features

68. Most-Surely vs. Least-Surely Uncertain.

【Paper Link】【Pages】:667-676

【Authors】: Manali Sharma ; Mustafa Bilgic

【Abstract】: Active learning methods aim to choose the most informative instances to effectively learn a good classifier. Uncertainty sampling, arguably the most frequently utilized active learning strategy, selects instances which are uncertain according to the model. In this paper, we propose a framework that distinguishes between two types of uncertainties: a model is uncertain about an instance due to strong and conflicting evidence (most-surely uncertain) vs. a model is uncertain about an instance because it does not have conclusive evidence (least-surely uncertain). We show that making a distinction between these uncertainties makes a huge difference to the performance of active learning. We provide a mathematical formulation to distinguish between these uncertainties for naive Bayes, logistic regression and support vector machines and empirically evaluate our methods on several real-world datasets.

【Keywords】: Bayes methods; learning (artificial intelligence); pattern classification; regression analysis; support vector machines; uncertain systems; active learning method; active learning strategy; least-surely uncertain; logistic regression; mathematical formulation; most-surely uncertain; naive Bayes method; real-world datasets; support vector machines; uncertainty sampling; Equations; Learning systems; Logistics; Mathematical model; Measurement uncertainty; Support vector machines; Uncertainty; Active learning; uncertainty sampling

69. On Pattern Preserving Graph Generation.

【Paper Link】【Pages】:677-686

【Authors】: Hong-Han Shuai ; De-Nian Yang ; Philip S. Yu ; Chih-Ya Shen ; Ming-Syan Chen

【Abstract】: Real datasets always play an essential role in graph mining and analysis. However, nowadays most available real datasets only support millions of nodes. Therefore, the literature on Big Data analysis utilizes statistical graph generators to generate a massive graph (e.g., billions of nodes) for evaluating the scalability of an algorithm. Nevertheless, current popular statistical graph generators are properly designed to preserve only the statistical metrics, such as the degree distribution, diameter, and clustering coefficient of the original social graphs. Recently, the importance of frequent graph patterns has been recognized in the various works on graph mining, but unfortunately this crucial criterion has not been noticed in the existing graph generators. To address this important need, we make the first attempt to design a Pattern Preserving Graph Generation (PPGG) algorithm to generate a graph including all frequent patterns and three most popular statistical parameters: degree distribution, clustering coefficient, and average vertex degree. The experimental results show that PPGG, which we have released as a free download, is efficient and able to generate a billion-node graph in approximately 10 minutes, much faster than the existing graph generators.

【Keywords】: Big Data; data analysis; data mining; graph theory; pattern clustering; statistical analysis; Big Data analysis; PPGG algorithm; average vertex degree; clustering coefficient; degree distribution; frequent graph pattern; graph analysis; graph mining; pattern preserving graph generation; real datasets; statistical graph generators; statistical metrics; Algorithm design and analysis; Biology; Clustering algorithms; Data mining; Databases; Generators; Histograms; Algorithms; Graph Mining

70. Time Series Classification Using Compression Distance of Recurrence Plots.

【Paper Link】【Pages】:687-696

【Authors】: Diego Furtado Silva ; Vinícius M. A. de Souza ; Gustavo E. A. P. A. Batista

【Abstract】: There is a huge increase of interest for time series methods and techniques. Virtually every piece of information collected from human, natural, and biological processes is susceptible to changes over time, and the study of how these changes occur is a central issue in fully understanding such processes. Among all time series mining tasks, classification is likely to be the most prominent one. In time series classification there is a significant body of empirical research that indicates that k-nearest neighbor rule in the time domain is very effective. However, certain time series features are not easily identified in this domain and a change in representation may reveal some significant and unknown features. In this work, we propose the use of recurrence plots as representation domain for time series classification. Our approach measures the similarity between recurrence plots using Campana-Keogh (CK-1) distance, a Kolmogorov complexity-based distance that uses video compression algorithms to estimate image similarity. We show that recurrence plots allied to CK-1 distance lead to significant improvements in accuracy rates compared to Euclidean distance and Dynamic Time Warping in several data sets. Although recurrence plots cannot provide the best accuracy rates for all data sets, we demonstrate that we can predict ahead of time that our method will outperform the time representation with Euclidean and Dynamic Time Warping distances.

【Keywords】: data compression; data mining; image classification; time series; video coding; CK-1 distance; Campana-Keogh distance; Euclidean distance; Kolmogorov complexity-based distance; dynamic time warping; image similarity estimation; k-nearest neighbor rule; recurrence plot compression distance; similarity measurement; time series classification; time series mining task; video compression algorithms; Accuracy; Complexity theory; Equations; Euclidean distance; Mathematical model; Time series analysis; Training; Time series; classification; dstance measure; recurrence plot

71. Dynamic Pattern Detection with Temporal Consistency and Connectivity Constraints.

【Paper Link】【Pages】:697-706

【Authors】: Skyler Speakman ; Yating Zhang ; Daniel B. Neill

【Abstract】: We explore scalable and accurate dynamic pattern detection methods in graph-based data sets. We apply our proposed Dynamic Subset Scan method to the task of detecting, tracking, and source-tracing contaminant plumes spreading through a water distribution system equipped with noisy, binary sensors. While static patterns affect the same subset of data over a period of time, dynamic patterns may affect different subsets of the data at each time step. These dynamic patterns require a new approach to define and optimize penalized likelihood ratio statistics in the subset scan framework, as well as new computational techniques that scale to large, real-world networks. To address the first concern, we develop new subset scan methods that allow the detected subset of nodes to change over time, while incorporating temporal consistency constraints to reward patterns that do not dramatically change between adjacent time steps. Second, our Additive Graph Scan algorithm allows our novel scan statistic to process small graphs (500 nodes) in 4.1 seconds on average while maintaining an approximation ratio over 99% compared to an exact optimization method, and to scale to large graphs with over 12,000 nodes in 30 minutes on average. Evaluation results across multiple detection, tracking, and source-tracing tasks demonstrate substantial performance gains achieved by the Dynamic Subset Scan approach.

【Keywords】: contamination; data mining; environmental science computing; graph theory; sensor fusion; statistical analysis; additive graph scan algorithm; approximation ratio; connectivity constraints; dynamic pattern detection method; dynamic subset scan method; exact optimization method; graph-based data sets; multiple detection; noisy binary sensors; penalized likelihood ratio statistics; sensor fusion; source-tracing contaminant plumes; static patterns; temporal consistency; temporal consistency constraints; water distribution system; Additives; Equations; Heuristic algorithms; Mathematical model; Optimization; Sensor systems; likelihood ratio statistics; sensor fusion; spatial and subset scan statistics; water distribution systems

72. Noise-Resistant Bicluster Recognition.

【Paper Link】【Pages】:707-716

【Authors】: Huan Sun ; Gengxin Miao ; Xifeng Yan

【Abstract】: Biclustering is crucial in finding co-expressed genes and their associated conditions in gene expression data. While various biclustering algorithms (e.g., combinatorial, probabilistic modelling, and matrix factorization) have been proposed and constantly improved in the past decade, data noise and bicluster overlaps make biclustering a still challenging task. It becomes difficult to further improve biclustering performance, without resorting to a new approach. Inspired by the recent progress in unsupervised feature learning using deep neural networks, in this work, we propose a novel model for biclustering, named Auto Decoder (AD), by relating biclusters to features and leveraging a neural network that is able to automatically learn features from the input data. To suppress severe noise present in gene expression data, we introduce a non-uniform signal recovery mechanism: Instead of reconstructing the whole input data to capture the bicluster patterns, AD weighs the zero and non-zero parts of the input data differently and is more flexible in dealing with different types of noise. AD is also properly regularized to deal with bicluster overlaps. To the best of our knowledge, this is the first biclustering algorithm that leverages neural network techniques to recover overlapped biclusters hidden in noisy gene expression data. We compared our approach with four state-of-the-art biclustering algorithms on both synthetic and real datasets. On three out of the four real datasets, AD significantly outperforms the other approaches. On controlled synthetic datasets, AD performs the best when noise level is beyond 15%.

【Keywords】: biology computing; genetics; neural nets; pattern clustering; pattern recognition; unsupervised learning; AD; AutoDecoder; automatic feature learning; bicluster patterns; biclustering algorithms; co-expressed genes; deep neural networks; noise suppression; noise-resistant bicluster recognition; noisy gene expression data; nonuniform signal recovery mechanism; overlapped bicluster recovery; real datasets; synthetic datasets; unsupervised feature learning; Algorithm design and analysis; Gene expression; Neural networks; Neurons; Noise; Noise level; Robustness; Biclustering; Gene Expression; Neural Network

73. Itemsets for Real-Valued Datasets.

【Paper Link】【Pages】:717-726

【Authors】: Nikolaj Tatti

【Abstract】: Pattern mining is one of the most well-studied sub fields in exploratory data analysis. While there is a significant amount of literature on how to discover and rank item sets efficiently from binary data, there is surprisingly little research done in mining patterns from real-valued data. In this paper we propose a family of quality scores for real-valued item sets. We approach the problem by considering casting the dataset into a binary data and computing the support from this data. This naive approach requires us to select thresholds. To remedy this, instead of selecting one set of thresholds, we treat thresholds as random variables and compute the average support. We show that we can compute this support efficiently, and we also introduce two normalisations, namely comparing the support against the independence assumption and, more generally, against the partition assumption. Our experimental evaluation demonstrates that we can discover statistically significant patterns efficiently.

【Keywords】: data analysis; data mining; random processes; binary data; exploratory data analysis; itemset discovery; itemset ranking; naive approach; normalisations; pattern mining; quality scores; random variables; real-valued datasets; real-valued itemsets; statistically significant pattern discovery; Data mining; Itemsets; Random variables; Reactive power; Standards; Vectors; itemsets; pattern mining; real-valued itemsets

74. Collective Response Spike Prediction for Mutually Interacting Consumers.

【Paper Link】【Pages】:727-736

【Authors】: Rikiya Takahashi ; Hideyuki Mizuta ; Naoki Abe ; Ruby L. Kennedy ; Vincent J. Jeffs ; Ravi Shah ; Robert H. Crites

【Abstract】: Modeling how marketing actions in various channels influence or cause consumer purchase decisions is crucial for marketing decision-making. Marketing campaigns stimulate consumer awareness, interest and help drive interactions such as the browsing of product web pages, ultimately impacting an individual's purchase decision. In addition, some successful campaigns stimulate word-of-mouth and social trends among consumers, and such collective behavior of consumers result in concurrent and correlated responses over a short term. Though each consumer's response should be attributed with both the same individual's experiences and the collective factors, unobservability of most word-of-mouth events makes the estimation challenging. The authors propose a new continuous-time predictive model for time-dependent response rates of each consumer, which can incorporate both the individual and the collective factors without explicit word-of-mouth observations. The individual factor is modeled as staircase functions associated with the experienced events by each consumer, and provides a clear psychological interpretation about how marketing advertising communications impact short-term and mid-term memories of consumers. The collective factor is modeled with aggregate response frequencies for mutually-interacting groups that are automatically estimated from data. The key idea to mine the mutually-interacting groups exists in a three-step estimator, which initially performs a Poisson regression without the collective factor, then does clustering of the residual time-series in the initial regression, and finally performs another Poisson regression involving the collective factor. The proposed collective factor robustly incorporates the underlying trends even when causality from one consumer's event spikes to another consumer's response is weak. High predictive accuracy of the proposed approach is empirically validated using real-world data provided by an online retailer in Europ- .

【Keywords】: advertising; data mining; decision making; pattern clustering; purchasing; retailing; time series; Europe; Poisson regression; aggregate response frequency; collective factor; collective response spike prediction; consumer event spikes; consumer mid-term memory; consumer purchase decision; consumer short-term memory; continuous-time predictive model; individual factor; marketing action modelling; marketing advertising communications; marketing decision-making; mutually interacting consumer; mutually-interacting group mining; online retailer; residual time-series clustering; social trends; staircase functions; three-step estimator; time-dependent response rates; word-of-mouth; Aggregates; Approximation methods; Computational modeling; Market research; Prediction algorithms; Predictive models; Vectors; Continuous-time event prediction; Poisson regression; residual clustering; time-decaying curves

75. Price Information Patterns in Web Search Advertising: An Empirical Case Study on Accommodation Industry.

【Paper Link】【Pages】:737-746

【Authors】: Guanting Tang ; Yupin Yang ; Jian Pei

【Abstract】: Unlike advertising in traditional media, web search advertising content can be easily customized with little cost. In this paper, we apply content analysis and regression models on 11,818 unique ads related to the accommodation industry to empirically investigate how advertisers customize price information in their web search advertising content. To the best of our knowledge, our study is the first of this kind. We find that advertiser characteristics, such as website traffic, product quality, and position in the distribution chain, affect both the amount and forms of price information in its search advertising content. Moreover, the use of price information by an advertiser depends on query characteristics, such as search volume, cost per click ("CPC"), and specific words (e.g., trademark, location, price cue) in queries. Our empirical findings shed new light on how to effectively manage price information in search advertising, and suggest new research opportunities on web search advertising.

【Keywords】: Internet; advertising data processing; pricing; query processing; regression analysis; CPC; Web search advertising content analysis; Website traffic; accommodation industry; advertiser characteristics; cost per click; distribution chain position; price information management; price information patterns; product quality; query; regression models; search volume; Advertising; Educational institutions; Google; Industries; Quality assessment; Trademarks; Web search; content analysis; regression; search advertising

76. Discovering Non-redundant Overlapping Biclusters on Gene Expression Data.

【Paper Link】【Pages】:747-756

【Authors】: Duy Tin Truong ; Roberto Battiti ; Mauro Brunato

【Abstract】: Given a gene expression data matrix where each cell is the expression level of a gene under a certain condition, biclustering is the problem of searching for a subset of genes that co regulate and co express only under a subset of conditions. As some genes can belong to different functional categories, searching for non-redundant overlapping biclusters is an important problem in biclustering. However, most recent algorithms can only either produce disjoint biclusters or redundant biclusters with significant overlap. In other words, these algorithms do not allow users to specify the maximum overlap between the biclusters. In this paper, we propose a novel algorithm which can generate K overlapping biclusters where the maximum overlap between them is below a predefined threshold. Unlike the other approaches which often generate all biclusters at once, our algorithm produces the biclusters sequentially, where each newly generated bicluster is guaranteed to be different from the previous ones but can still overlap with them. The experiments on real datasets confirm that different meaningful overlapping biclusters are successfully discovered. Besides, under the same constraints, our algorithm returns much larger and higher-quality biclusters compared to those of the other state-of-the art algorithms.

【Keywords】: biology computing; data mining; pattern clustering; disjoint biclusters; functional category; gene expression data matrix; nonredundant overlapping bicluster discovery; Biological system modeling; Clustering algorithms; Coherence; Complexity theory; Gene expression; Search problems; gene expression data; non-redundant overlapping biclustering

77. Cox Regression with Correlation Based Regularization for Electronic Health Records.

【Paper Link】【Pages】:757-766

【Authors】: Bhanukiran Vinzamuri ; Chandan K. Reddy

【Abstract】: Survival Regression models play a vital role in analyzing time-to-event data in many practical applications ranging from engineering to economics to healthcare. These models are ideal for prediction in complex data problems where the response is a time-to-event variable. An event is defined as the occurrence of a specific event of interest such as a chronic health condition. Cox regression is one of the most popular survival regression model used in such applications. However, these models have the tendency to over fit the data which is not desirable for healthcare applications because it limits their generalization to other hospital scenarios. In this paper, we address these challenges for the cox regression model. We combine two unique correlation based regularizers with cox regression to handle correlated and grouped features which are commonly seen in many practical problems. The proposed optimization problems are solved efficiently using cyclic coordinate descent and Alternate Direction Method of Multipliers algorithms. We conduct experimental analysis on the performance of these algorithms over several synthetic datasets and electronic health records (EHR) data about heart failure diagnosed patients from a hospital. We demonstrate through our experiments that these regularizers effectively enhance the ability of cox regression to handle correlated features. In addition, we extensively compare our results with other regularized linear and logistic regression algorithms. We validate the goodness of the features selected by these regularized cox regression models using the biomedical literature and different feature selection algorithms.

【Keywords】: cardiology; data analysis; electronic health records; feature selection; hospitals; optimisation; patient diagnosis; regression analysis; EHR data; biomedical literature; chronic health condition; correlation based regularization; cox regression model; cyclic coordinate descent; electronic health records; feature selection algorithm; heart failure diagnosed patients; hospital; logistic regression algorithm; multipliers algorithm alternate direction method; optimization problems; regularized linear regression algorithm; survival regression models; synthetic datasets; time-to-event data analysis; Algorithm design and analysis; Biological system modeling; Equations; Hazards; Kernel; Medical services; Vectors; cox regression; feature selection; healthcare; regularization

78. Constructing Topical Hierarchies in Heterogeneous Information Networks.

【Paper Link】【Pages】:767-776

【Authors】: Chi Wang ; Marina Danilevsky ; Jialu Liu ; Nihit Desai ; Heng Ji ; Jiawei Han

【Abstract】: A digital data collection (e.g., scientific publications, enterprise reports, news, and social media) can often be modeled as a heterogeneous information network, linking text with multiple types of entities. Constructing high-quality concept hierarchies that can represent topics at multiple granularities benefits tasks such as search, information browsing, and pattern mining. In this work we present an algorithm for recursively constructing multi-typed topical hierarchies. Contrary to traditional text-based topic modeling, our approach handles both textual phrases and multiple types of entities by a newly designed clustering and ranking algorithm for heterogeneous network data, as well as mining and ranking topical patterns of different types. Our experiments on datasets from two different domains demonstrate that our algorithm yields high quality, multi-typed topical hierarchies.

【Keywords】: data acquisition; data mining; network theory (graphs); pattern clustering; text analysis; clustering algorithm; digital data collection; heterogeneous information network data; high quality multityped topical hierarchies; high-quality concept hierarchies; multityped topical hierarchies; ranking algorithm; textual phrases; topical pattern mining; topical pattern ranking; Algorithm design and analysis; Clustering algorithms; Data mining; Data models; Distributed databases; Inference algorithms; Query processing; heterogeneous network; information network; topic hierarchy

79. Binary Time-Series Query Framework for Efficient Quantitative Trait Association Study.

【Paper Link】【Pages】:777-786

【Authors】: Hongfei Wang ; Xiang Zhang

【Abstract】: Quantitative trait association study examines the association between quantitative traits and genetic variants. As a promising tool, it has been widely applied to dissect the genetic basis of complex diseases. However, such study usually involves testing trillions of variant-trait pairs and demands intensive computational resources. Recently, several algorithms have been developed to improve its efficiency. In this paper, we propose a framework, Fabrique, which models quantitative trait association study as querying binary time-series and bridges the two seemly different problems. Specifically, in the proposed framework, genetic variants are treated as a database consisting of binary time-series. Finding trait-associated variants is equivalent to finding the nearest neighbors of the trait. For efficient query process, Fabrique partitions and normalizes the binary time-series, and estimates a tight upper bound for each group of time-series to prune the search space. Extensive experimental results demonstrate that Fabrique only needs to search a very small portion of the database to locate the target variants and significantly outperforms the state-of-the-art method. We also show that Fabrique can be applied to other binary time-series query problem in addition to the genetic association study.

【Keywords】: biology computing; genetics; query processing; search problems; time series; Fabrique partitions; binary time-series query framework; complex diseases; genetic association; genetic variants; quantitative trait association study; query process; search space pruning; trait-associated variants; Correlation; Equations; Genetics; IP networks; Indexes; Time series analysis; Upper bound; Lower bound; Pruning; Quantitative Trait Association Study; Time-Series Query; Upper bound

80. Communication-Efficient Distributed Multiple Reference Pattern Matching for M2M Systems.

【Paper Link】【Pages】:787-796

【Authors】: Jui-Pin Wang ; Yu-Chen Lu ; Mi-Yen Yeh ; Shou-De Lin ; Phillip B. Gibbons

【Abstract】: In M2M applications, it is very common to encounter the ad hoc snapshot query that requires fast responses from many local machines in which all the data are distributed. In the scenario when the query is more complex, the communication cost for sending it to all the local machines for processing can be very high. This paper aims to address this issue. Given a reference set of multiple and large-size patterns, we propose an approach to identifying its k nearest and farthest neighbors globally across all the local machines. By decomposing the reference patterns into a multi-resolution representation and using novel distance bound designs, our method guarantees the exact results in a communication-efficient manner. Analytical and empirical studies show that our method outperforms the state-of-the-art methods in saving significant bandwidth usage, especially for large numbers of machines and large-sized reference patterns.

【Keywords】: distributed processing; pattern matching; query processing; M2M applications; M2M systems; ad hoc snapshot query; bandwidth usage; communication-efficient distributed multiple reference pattern matching; distance bound designs; k farthest neighbors; k nearest neighbors; large-sized reference patterns; local machines; machine-to-machine systems; multiresolution representation; reference pattern decomposition; Bandwidth; Couplings; Servers; Time series analysis; Upper bound; Vegetation; Wavelet transforms

81. Sparse K-Means with the l_q(0leq q< 1) Constraint for High-Dimensional Data Clustering.

【Paper Link】【Pages】:797-806

【Authors】: Yu Wang ; Xiangyu Chang ; Rongjian Li ; Zongben Xu

【Abstract】: Sparse clustering, which aims at finding a proper partition of extremely high dimensional data set with fewest relevant features, has been attracted more and more attention. Most researches model the problem through minimizing weighted feature contributions subject to a l₁ constraint. However, the l₀ constraint is the essential constraint for sparse modeling while the l₁ constraint is only a convex relaxation of it. In this article, we bridge the gap between the l₀ constraint and the l₁ constraint through development of two new sparse clustering models, which are the sparse k-means with the l_q(0 <; q <; 1) constraint and the sparse k-means with the l₀ constraint. By proving the certain forms of the optimal solution of particular l_q(0 = q <; 1) non-convex optimizations, two efficient iterative algorithms are proposed. We conclude with experiments on both synthetic data and the Allen Developing on both synthetic data and the l_q(0 = q <; 1) models exhibit the advantages compared with the standard k-mans and sparse k-means with the l₁ constraint.

【Keywords】: concave programming; iterative methods; pattern clustering; convex relaxation; high-dimensional data clustering; iterative algorithms; l₀ constraint; l₁ constraint; l_q constraint; nonconvex optimizations; sparse clustering models; sparse k-means; synthetic data; weighted feature minimization; Conferences; Data mining; 0l_q(0 = q < 1) Constraint; High-Dimensional Clustering; Sparse K-means

82. A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics.

【Paper Link】【Pages】:807-816

【Authors】: Sean Whalen ; Gaurav Pandey

【Abstract】: The combination of multiple classifiers using ensemble methods is increasingly important for making progress in a variety of difficult prediction problems. We present a comparative analysis of several ensemble methods through two case studies in genomics, namely the prediction of genetic interactions and protein functions, to demonstrate their efficacy on real-world datasets and draw useful conclusions about their behavior. These methods include simple aggregation, meta-learning, cluster-based meta-learning, and ensemble selection using heterogeneous classifiers trained on resampled data to improve the diversity of their predictions. We present a detailed analysis of these methods across 4 genomics datasets and find the best of these methods offer statistically significant improvements over the state of the art in their respective domains. In addition, we establish a novel connection between ensemble selection and meta-learning, demonstrating how both of these disparate methods establish a balance between ensemble diversity and performance.

【Keywords】: bioinformatics; genomics; pattern classification; proteins; cluster-based meta-learning; comparative analysis; ensemble classifier; ensemble diversity; ensemble methods; ensemble performance; ensemble selection; genetic interaction prediction; genomics; heterogeneous classifiers; meta-learning; protein functions; simple aggregation; Accuracy; Bioinformatics; Diversity reception; Genomics; Proteins; Stacking; Bioinformatics; Ensemble methods; Ensemble selection; Genomics; Stacking; Supervised learning

83. Stochastic Blockmodel with Cluster Overlap, Relevance Selection, and Similarity-Based Smoothing.

【Paper Link】【Pages】:817-826

【Authors】: Joyce Jiyoung Whang ; Piyush Rai ; Inderjit S. Dhillon

【Abstract】: Stochastic block models provide a rich, probabilistic framework for modeling relational data by expressing the objects being modeled in terms of a latent vector representation. This representation can be a latent indicator vector denoting the cluster membership (hard clustering), a vector of cluster membership probabilities (soft clustering), or more generally a real-valued vector (latent space representation). Recently, a new class of overlapping stochastic block models has been proposed where the idea is to allow the objects to have hard memberships in multiple clusters (in form of a latent binary vector). This aspect captures the properties of many real-world networks in domains such as biology and social networks where objects can simultaneously have memberships in multiple clusters owing to the multiple roles they may have. In this paper, we improve upon this model in three key ways: (1) we extend the overlapping stochastic block model to the bipartite graph case which enables us to simultaneously learn the overlapping clustering of two different sets of objects in the graph, the unipartite graph is just a special case of our model, (2) we allow objects (in either set) to not have membership in any cluster by using a relevant object selection mechanism, and (3) we make use of additionally available object features (or a kernel matrix of pair wise object similarities) to further improve the overlapping clustering performance. We do this by explicitly encouraging similar objects to have similar cluster membership vectors. Moreover, using nonparametric Bayesian prior distributions on the key model parameters, we side-step the model selection issues such as selecting the number of clusters a priori. Our model is quite general and can be applied for both overlapping clustering and link prediction tasks in unipartite and bipartite networks (directed/undirected), or for overlapping co-clustering of general binary-valued data. Experiments on synthetic and real-world d- tasets from biology and social networks demonstrate that our model outperforms several state-of-the-art methods.

【Keywords】: Bayes methods; biology computing; graph theory; learning (artificial intelligence); matrix algebra; pattern clustering; social networking (online); stochastic processes; vectors; biology; bipartite graph case; bipartite network; cluster membership probability vector; cluster membership vectors; general binary-valued data overlapping co-clustering; hard clustering; kernel matrix; latent binary vector; latent indicator vector representation; latent space representation; link prediction tasks; nonparametric Bayesian prior distributions; overlapping clustering learning; overlapping stochastic block model; pair wise object similarities; probabilistic framework; real-valued vector; real-world networks; relational data modelling; relevant object selection mechanism; similarity-based smoothing; social networks; soft clustering; unipartite graph; unipartite network; Bayes methods; Bipartite graph; Data models; Kernel; Social network services; Stochastic processes; Vectors; link prediction; nonparametric Bayesian; overlapping clustering; relevance selection; stochastic blockmodel

84. Multi-instance Multi-graph Dual Embedding Learning.

【Paper Link】【Pages】:827-836

【Authors】: Jia Wu ; Xingquan Zhu ; Chengqi Zhang ; Zhihua Cai

【Abstract】: Multi-instance learning concerns about building learning models from a number of labeled instance bags, where each bag consists of instances with unknown labels. A bag is labeled positive if one or more multiple instances inside the bag is positive, and negative otherwise. For all existing multi-instance learning algorithms, they are only applicable to the setting where instances in each bag are represented by a set of well defined feature values. In this paper, we advance the problem to a multi-instance multi-graph setting, where a bag contains a number of instances and graphs in pairs, and the learning objective is to derive classification models from labeled bags, containing both instances and graphs, to predict previously unseen bags with maximum accuracy. To achieve the goal, the main challenge is to properly represent graphs inside each bag and further take advantage of complementary information between instance and graph pairs for learning. In the paper, we propose a Dual Embedding Multi-Instance Multi-Graph Learning (DE-MIMG) algorithm, which employs a dual embedding learning approach to (1) embed instance distributions into the informative sub graphs discovery process, and (2) embed discovered sub graphs into the instance feature selection process. The dual embedding process results in an optimal representation for each bag to provide combined instance and graph information for learning. Experiments and comparisons on real-world multi-instance multi-graph learning tasks demonstrate the algorithm performance.

【Keywords】: data mining; feature selection; graph theory; learning (artificial intelligence); pattern classification; DE-MIMG algorithm; classification models; dual embedding learning approach; dual embedding multiinstance multigraph learning algorithm; dual embedding process; graph information; informative subgraph discovery process; instance feature selection process; labeled instance bags; multiinstance multigraph setting; Bismuth; Computer science; Educational institutions; Kernel; Laplace equations; Linear programming; Vectors; Classification; Embedding; Graph; Multi-graph; Multi-instance

85. TopicSketch: Real-Time Bursty Topic Detection from Twitter.

【Paper Link】【Pages】:837-846

【Authors】: Wei Xie ; Feida Zhu ; Jing Jiang ; Ee-Peng Lim ; Ke Wang

【Abstract】: Twitter has become one of the largest platforms for users around the world to share anything happening around them with friends and beyond. A bursty topic in Twitter is one that triggers a surge of relevant tweets within a short time, which often reflects important events of mass interest. How to leverage Twitter for early detection of bursty topics has therefore become an important research problem with immense practical value. Despite the wealth of research work on topic modeling and analysis in Twitter, it remains a huge challenge to detect bursty topics in real-time. As existing methods can hardly scale to handle the task with the tweet stream in real-time, we propose in this paper Topic Sketch, a novel sketch-based topic model together with a set of techniques to achieve real-time detection. We evaluate our solution on a tweet stream with over 30 million tweets. Our experiment results show both efficiency and effectiveness of our approach. Especially it is also demonstrated that Topic Sketch can potentially handle hundreds of millions tweets per day which is close to the total number of daily tweets in Twitter and present bursty event in finer-granularity.

【Keywords】: information analysis; social networking (online); TopicSketch; Twitter; real-time bursty topic detection; sketch-based topic model; topic analysis; topic modeling; tweet stream; Acceleration; Equations; Monitoring; Optimization; Real-time systems; Surges; Twitter; TopicSketch; bursty topic; realtime; tweet stream

86. Efficient Learning on Point Sets.

【Paper Link】【Pages】:847-856

【Authors】: Liang Xiong ; Barnabás Póczos ; Jeff G. Schneider

【Abstract】: Recently several methods have been proposed to learn from data that are represented as sets of multidimensional vectors. Such algorithms usually suffer from the high demand of computational resources, making them impractical on large-scale problems. We propose to solve this problem by condensing i.e. reducing the sizes of the sets while maintaining the learning performance. Three methods are examined and evaluated with a wide spectrum of set learning algorithms on several large-scale image data sets. We discover that k-Means can successfully achieve the goal of condensing. In many cases, k-Means condensing can improve the algorithms' speed, space requirements, and surprisingly, learning performances simultaneously.

【Keywords】: data mining; learning (artificial intelligence); algorithm space requirements; algorithm speed; efficient learning; k-means condensing; point sets; Approximation algorithms; Approximation methods; Complexity theory; Feature extraction; Kernel; Training; Vectors; collective data; efficient; fast; image classification; kmeans; large-scale; point set

87. Markov Blanket Feature Selection with Non-faithful Data Distributions.

【Paper Link】【Pages】:857-866

【Authors】: Kui Yu ; Xindong Wu ; Zan Zhang ; Yang Mu ; Hao Wang ; Wei Ding

【Abstract】: In faithful Bayesian networks, the Markov blanket of the class attribute is a unique and minimal feature subset for optimal feature selection. However, little attention has been paid to Markov blanket feature selection in a non-faithful environment which widely exists in the real world. To tackle this issue, in this paper, we deal with non-faithful data distributions and propose the concept of representative sets instead of Markov blankets. With a standard sparse group lasso for selection of features from the representative sets, we design an effective algorithm, SRS, for Markov blanket feature Selection via Representative Sets with non-faithful data distributions. Empirical studies demonstrate that SRS outperforms the state-of-the-art Markov blanket feature selectors and other well-established feature selection methods.

【Keywords】: Markov processes; belief networks; feature selection; Bayesian networks; Markov blanket feature selection; Markov blanket feature selector; SRS; class attribute; nonfaithful data distribution; representative sets; standard sparse group lasso; Algorithm design and analysis; Bayes methods; Joints; Markov processes; Probability distribution; Redundancy; Standards; Faithful Bayesian networks; Feature selection; Markov blankets; Representative sets; Sparse group lasso

88. An Efficient Approach to Updating Closeness Centrality and Average Path Length in Dynamic Networks.

【Paper Link】【Pages】:867-876

【Authors】: Chia-Chen Yen ; Mi-Yen Yeh ; Ming-Syan Chen

【Abstract】: Closeness centrality measures the communication efficiency of a specific vertex within a network while the average path length (APL) measures that of the whole network. Since the nature of these two measurements is based on the computation of all-pair shortest path distances, one can perform the breadth-first search method starting at every vertex and obtain the two measurements. However, as the edge counts in the real-world networks like Facebook increase over time, this naive way is obviously inefficient. In this paper, we proposed CENDY, an efficient approach to updating Closeness centrality and average path length in Dynamic networks when there is an edge insertion or deletion. In CENDY, we derived some theoretical properties to quickly identify a set of vertices whose shortest path changed after an edge update, and then update the closeness centralities of those vertices only as well as the APL of the graph by a few of single-source shortest path computations. We conducted extensive experiments to show that, when compared to the existing methods of computing exact or approximate values, CENDY outperformed others in significantly low update time while providing exact values of the two measurements on various real-world graph datasets.

【Keywords】: computational complexity; directed graphs; network theory (graphs); tree searching; APL measures; CENDY approach; Facebook; all-pair shortest path distances; average path length; breadth-first search method; communication efficiency; dynamic networks; edge counts; edge deletion; edge insertion; real-world graph datasets; single-source shortest path computations; specific vertex; updating closeness centrality; Algorithm design and analysis; Approximation algorithms; Approximation methods; Heuristic algorithms; Joining processes; Length measurement; Time measurement; average path length; closeness centrality; dynamic networks; update algorithm

89. Reconstructing Individual Mobility from Smart Card Transactions: A Space Alignment Approach.

【Paper Link】【Pages】:877-886

【Authors】: Nicholas Jing Yuan ; Yingzi Wang ; Fuzheng Zhang ; Xing Xie ; Guangzhong Sun

【Abstract】: Smart card transactions capture rich information of human mobility and urban dynamics, therefore are of particular interest to urban planners and location-based service providers. However, since most transaction systems are only designated for billing purpose, typically, fine-grained location information, such as the exact boarding and alighting stops of a bus trip, is only partially or not available at all, which blocks deep exploitation of this rich and valuable data at individual level. This paper presents a "space alignment" framework to reconstruct individual mobility history from a large-scale smart card transaction dataset pertaining to a metropolitan city. Specifically, we show that by delicately aligning the monetary space and geospatial space with the temporal space, we are able to extrapolate a series of critical domain specific constraints. Later, these constraints are naturally incorporated into a semi-supervised conditional random field to infer the exact boarding and alighting stops of all transit routes with a surprisingly high accuracy, e.g., given only 10% trips with known alighting/boarding stops, we successfully inferred more than 78% alighting and boarding stops from all unlabeled trips. In addition, we demonstrated that the smart card data enriched by the proposed approach dramatically improved the performance of a conventional method for identifying users' home and work places (with 88% improvement on home detection and 35% improvement on work place detection). The proposed method offers the possibility to mine individual mobility from common public transit transactions, and showcases how uncertain data can be leveraged with domain knowledge and constraints, to support cross-application data mining tasks.

【Keywords】: data mining; smart cards; town and country planning; cross-application data mining tasks; fine-grained location information; geospatial space; individual mobility reconstruction; large-scale smart card transaction dataset; location-based service providers; metropolitan city; monetary space; semisupervised conditional random field; space alignment approach; temporal space; urban planners; Cities and towns; Data mining; Geospatial analysis; Labeling; Legged locomotion; Roads; Smart cards

90. Feature Transformation with Class Conditional Decorrelation.

【Paper Link】【Pages】:887-896

【Authors】: Xu-Yao Zhang ; Kaizhu Huang ; Cheng-Lin Liu

【Abstract】: The well-known feature transformation model of Fisher linear discriminant analysis (FDA) can be decomposed into an equivalent two-step approach: whitening followed by principal component analysis (PCA) in the whitened space. By proving that whitening is the optimal linear transformation to the Euclidean space in the sense of minimum log-determinant divergence, we propose a transformation model called class conditional decor relation (CCD). The objective of CCD is to diagonalize the covariance matrices of different classes simultaneously, which is efficiently optimized using a modified Jacobi method. CCD is effective to find the common principal components among multiple classes. After CCD, the variables become class conditionally uncorrelated, which will benefit the subsequent classification tasks. Combining CCD with the nearest class mean (NCM) classification model can significantly improve the classification accuracy. Experiments on 15 small-scale datasets and one large-scale dataset (with 3755 classes) demonstrate the scalability of CCD for different applications. We also discuss the potential applications of CCD for other problems such as Gaussian mixture models and classifier ensemble learning.

【Keywords】: Jacobian matrices; covariance matrices; decorrelation; optimisation; pattern classification; principal component analysis; CCD; Euclidean space; Fisher linear discriminant analysis; Gaussian mixture model; class conditional decorrelation; classifier ensemble learning; covariance matrices; equivalent two-step approach; feature transformation model; large-scale dataset; minimum log-determinant divergence; modified Jacobi method; nearest class mean classification model; optimal linear transformation; principal component analysis; Charge coupled devices; Covariance matrices; Decorrelation; Indexes; Jacobian matrices; Principal component analysis; Sparse matrices; class conditional decorrelation; feature transformation; simultaneous diagonalization

91. Efficient Learning for Models with DAG-Structured Parameter Constraints.

【Paper Link】【Pages】:897-906

【Authors】: Leon Wenliang Zhong ; James T. Kwok

【Abstract】: In high-dimensional models, hierarchical and structural relationships among features are often used to constrain the search for the more important interactions. These relationships may come from prior knowledge or traditional design principles, such as that low-order effects should have larger contributions than higher-order ones and should be included into the model earlier. However, these structural constraints also make the optimization problem more challenging. In this paper, we propose the use of the alternating direction method of multipliers (ADMM) and accelerated gradient methods. In particular, we show that ADMM can be used to either directly solve the problem or serve as a key building block. Experimental results on a number of synthetic and real-world data sets demonstrate that the proposed algorithm is efficient and flexible. Moreover, the use of the hierarchical relationships consistently improves generalization performance and parameter estimation.

【Keywords】: directed graphs; generalisation (artificial intelligence); gradient methods; learning (artificial intelligence); optimisation; parameter estimation; ADMM; DAG-structured parameter constraints; accelerated gradient methods; alternating direction method of multipliers; design principles; feature hierarchical relationships; feature structural relationships; generalization performance; high-dimensional models; key building block; learning; low-order effects; optimization problem; parameter estimation; structural constraints; Conferences; Data mining; Accelerated gradient methods; Alternating direction method of multipliers; Heredity; Structural sparsity

【Paper Link】【Pages】:907-916

【Authors】: Chuan Zhou ; Peng Zhang ; Jing Guo ; Xingquan Zhu ; Li Guo

【Abstract】: Influence maximization, defined as finding a small subset of nodes that maximizes spread of influence in social networks, is NP-hard under both Linear Threshold (LT) and Independent Cascade (IC) models, where a line of greedy/heuristic algorithms have been proposed. The simple greedy algorithm [14] achieves an approximation ratio of 1-1/e. The advanced CELF algorithm [16], by exploiting the sub modular property of the spread function, runs 700 times faster than the simple greedy algorithm on average. However, CELF is still inefficient [4], as the first iteration calls for N times of spread estimations (N is the number of nodes in networks), which is computationally expensive especially for large networks. To this end, in this paper we derive an upper bound function for the spread function. The bound can be used to reduce the number of Monte-Carlo simulation calls in greedy algorithms, especially in the first iteration of initialization. Based on the upper bound, we propose an efficient Upper Bound based Lazy Forward algorithm (UBLF in short), by incorporating the bound into the CELF algorithm. We test and compare our algorithm with prior algorithms on real-world data sets. Experimental results demonstrate that UBLF, compared with CELF, reduces more than 95% Monte-Carlo simulations and achieves at least 2-5 times speed-raising when the seed set is small.

【Keywords】: Monte Carlo methods; approximation theory; greedy algorithms; optimisation; social networking (online); CELF algorithm; IC model; LT model; Monte-Carlo simulation calls; UBLF; approximation ratio; greedy algorithm; heuristic algorithms; independent cascade model; influence maximization; influential node discovery; linear threshold model; social networks; spread estimations; spread function; submodular property; upper bound based lazy forward algorithm; upper bound function; Approximation algorithms; Computational modeling; Estimation; Greedy algorithms; Integrated circuit modeling; Upper bound; Greedy Algorithms; Independent Cascade Model; Influence Maximization; Social Networks

93. Divide-and-Conquer Anchoring for Near-Separable Nonnegative Matrix Factorization and Completion in High Dimensions.

【Paper Link】【Pages】:917-926

【Authors】: Tianyi Zhou ; Wei Bian ; Dacheng Tao

【Abstract】: Nonnegative matrix factorization (NMF) becomes tractable in polynomial time with unique solution under separability assumption, which postulates all the data points are contained in the conical hull of a few anchor data points. Recently developed linear programming and greedy pursuit methods can pick out the anchors from noisy data and results in a near-separable NMF. But their efficiency could be seriously weakened in high dimensions. In this paper, we show that the anchors can be precisely located from low-dimensional geometry of the data points even when their high dimensional features suffer from serious incompleteness. Our framework, entitled divide-and-conquer anchoring (DCA), divides the high-dimensional anchoring problem into a few cheaper sub-problems seeking anchors of data projections in low-dimensional random spaces, which can be solved in parallel by any near-separable NMF, and combines all the detected low-dimensional anchors via a fast hypothesis testing to identify the original anchors. We further develop two non-iterative anchoring algorithms in 1D and 2D spaces for data in convex hull and conical hull, respectively. These two rapid algorithms in the ultra low dimensions suffice to generate a robust and efficient near-separable NMF for high-dimensional or incomplete data via DCA. Compared to existing methods, two vital advantages of DCA are its scalability for big data, and capability of handling incomplete and high-dimensional noisy data. A rigorous analysis proves that DCA is able to find the correct anchors of a rank-k matrix by solving math cal O(klog k) sub-problems. Finally, we show DCA outperforms state-of-the-art methods on various datasets and tasks.

【Keywords】: data handling; divide and conquer methods; geometry; iterative methods; linear programming; matrix decomposition; conical hull; convex hull; divide-and-conquer anchoring; fast hypothesis testing; greedy pursuit methods; linear programming; low- dimensional geometry; low-dimensional random spaces; near-separable nonnegative matrix factorization; polynomial time; separability assumption; Bismuth; Geometry; Noise; Noise measurement; Robustness; Testing; Vectors; Nonnegative matrix factorization; anchor points; big data; conical hull; divide-and-conquer; matrix completion; random projection; separability assumption

94. A Novel Relational Learning-to-Rank Approach for Topic-Focused Multi-document Summarization.

【Paper Link】【Pages】:927-936

【Authors】: Yadong Zhu ; Yanyan Lan ; Jiafeng Guo ; Pan Du ; Xueqi Cheng

【Abstract】: Topic-focused multi-document summarization aims to produce a summary over a set of documents and conveys the most important aspects of a given topic. Most existing extractive methods view the task as a multi-criteria ranking problem over sentences, where relevance, salience and diversity are three typical requirements. However, diversity is a challenging problem as it involves modeling the relationship between sentences during ranking, where traditional methods usually tackle it in a heuristic or implicit way. In this paper, we propose a novel relational learning-to-rank approach (R-LTR) to solve this problem. Relational learning-to-rank is a new learning framework which further incorporates relationships into traditional learning-to-rank in an elegant way. Specifically, the ranking function is defined as the combination of content-based score of individual sentence, and relation-based score between the current sentence and those already selected. On this basis, we propose to learn the ranking function by minimizing the likelihood loss based on Plackett-Luce model, which can naturally model the sequential ranking procedure of candidate sentences. Stochastic gradient descent is then employed to conduct the learning process, and the summary is predicted by the greedy selection procedure based on the learned ranking function. Finally, we conduct extensive experiments on benchmark data sets TAC2008 and TAC2009. Experimental results show that our approach can significantly outperform the state-of-the-art methods from both quantitative and qualitative aspects.

【Keywords】: document handling; gradient methods; learning (artificial intelligence); stochastic processes; Plackett-Luce model; R-LTR; TAC2008 benchmark data sets; TAC2009 benchmark data sets; candidate sentences; content-based score; greedy selection procedure; heuristic; likelihood loss minimization; multicriteria ranking problem; relation-based score; relational learning-to-rank approach; sequential ranking procedure; stochastic gradient descent; topic-focused multidocument summarization; Computational modeling; Hidden Markov models; Measurement; Stochastic processes; Support vector machines; Training data; Vectors

Short Papers 65

95. Cartification: A Neighborhood Preserving Transformation for Mining High Dimensional Data.

【Paper Link】【Pages】:937-942

【Authors】: Emin Aksehirli ; Bart Goethals ; Emmanuel Müller ; Jilles Vreeken

【Abstract】: The analysis of high dimensional data comes with many intrinsic challenges. In particular, cluster structures become increasingly hard to detect when the data includes dimensions irrelevant to the individual clusters. With increasing dimensionality, distances between pairs of objects become very similar, and hence, meaningless for knowledge discovery. In this paper we propose Cartification, a new transformation to circumvent this problem. We transform each object into an item set, which represents the neighborhood of the object. We do this for multiple views on the data, resulting in multiple neighborhoods per object. This transformation enables us to preserve the essential pair wise-similarities of objects over multiple views, and hence, to improve knowledge discovery in high dimensional data. Our experiments show that frequent item set mining on the certified data outperforms competing clustering approaches on the original data space, including traditional clustering, random projections, principle component analysis, subspace clustering, and clustering ensemble.

【Keywords】: data mining; pattern clustering; cartification; clustering ensemble; high dimensional data mining; knowledge discovery; neighborhood preserving transformation; principle component analysis; random projections; subspace clustering; Algorithm design and analysis; Clustering algorithms; Data mining; Itemsets; Noise; Noise measurement; clustering; frequent itemset mining; high dimensional data; subspace projections; transformation

96. Improved Electricity Load Forecasting via Kernel Spectral Clustering of Smart Meters.

【Paper Link】【Pages】:943-948

【Authors】: Carlos Alzate ; Mathieu Sinn

【Abstract】: This paper explores kernel spectral clustering methods to improve forecasts of aggregated electricity smart meter data. The objective is to cluster the data in such a way that building a forecasting models separately for each cluster and taking the sum of forecasts leads to a better accuracy than building one forecasting model for the total aggregate of all meters. To measure the similarity between time series, we consider wavelet feature extraction and several positive-definite kernels. To forecast the aggregated meter data, we use a periodic autoregressive model with calendar and temperature information as exogenous variable. The data used in the experiments are smart meter recordings from 6,000 residential customers and small-to-medium enterprises collected by the Irish Commission for Energy Regulation (CER). The results show a 20% improvement in forecasting accuracy, where the highest gain is obtained using a kernel with the Spearman's distance. The resulting clusters show distinctive patterns particularly during hours of peak demand.

【Keywords】: feature extraction; load forecasting; pattern clustering; power engineering computing; small-to-medium enterprises; smart meters; time series; wavelet transforms; CER; Irish Commission for Energy Regulation; improved electricity load forecasting; kernel spectral clustering; periodic autoregressive model; positive-definite kernels; residential customers; small-to-medium enterprises; smart meters; time series; wavelet feature extraction; Data models; Forecasting; Kernel; Load modeling; Predictive models; Reactive power; Time series analysis; clustering; disaggregation; load forecasting; smart meter data

97. Large-Scale Elastic Net Regularized Linear Classification SVMs and Logistic Regression.

【Paper Link】【Pages】:949-954

【Authors】: Balamurugan P.

【Abstract】: Elastic Net Regularizers have shown much promise in designing sparse classifiers for linear classification. In this work, we propose an alternating optimization approach to solve the dual problems of elastic net regularized linear classification Support Vector Machines (SVMs) and logistic regression (LR). One of the sub-problems turns out to be a simple projection. The other sub-problem can be solved using dual coordinate descent methods developed for non-sparse L2-regularized linear SVMs and LR, without altering their iteration complexity and convergence properties. Experiments on very large datasets indicate that the proposed dual coordinate descent - projection (DCD-P) methods are fast and achieve comparable generalization performance after the first pass through the data, with extremely sparse models.

【Keywords】: gradient methods; logistics; optimisation; pattern classification; regression analysis; support vector machines; convergence properties; dual coordinate descent methods; dual coordinate descent-projection method; elastic net regularizers; iteration complexity; large-scale elastic net regularized linear classification; logistic regression; nonsparse L2-regularized linear SVM; optimization approach; support vector machines; Accuracy; Convergence; Convex functions; Logistics; Optimization; Support vector machines; Training

98. Influence-Based Network-Oblivious Community Detection.

【Paper Link】【Pages】:955-960

【Authors】: Nicola Barbieri ; Francesco Bonchi ; Giuseppe Manco

【Abstract】: How can we detect communities when the social graphs is not available? We tackle this problem by modeling social contagion from a log of user activity, that is a dataset of tuples (u, i, t) recording the fact that user u "adopted" item i at time t. This is the only input to our problem. We propose a stochastic framework which assumes that item adoptions are governed by un underlying diffusion process over the unobserved social network, and that such diffusion model is based on community-level influence. By fitting the model parameters to the user activity log, we learn the community membership and the level of influence of each user in each community. This allows to identify for each community the "key" users, i.e., the leaders which are most likely to influence the rest of the community to adopt a certain item. The general framework can be instantiated with different diffusion models. In this paper we define two models: the extension to the community level of the classic (discrete time) Independent Cascade model, and a model that focuses on the time delay between adoptions. To the best of our knowledge, this is the first work studying community detection without the network.

【Keywords】: learning (artificial intelligence); social networking (online); stochastic processes; classic independent cascade model; community membership learning; community-level influence; diffusion model; diffusion process; influence-based network-oblivious community detection; item adoptions; social contagion modelling; social network; stochastic framework; user activity log; Adaptation models; Communities; Delays; Peer-to-peer computing; Social network services; Standards; Stochastic processes; community detection; social influence

99. Beyond Boolean Matrix Decompositions: Toward Factor Analysis and Dimensionality Reduction of Ordinal Data.

【Paper Link】【Pages】:961-966

【Authors】: Radim Belohlávek ; Markéta Krmelová

【Abstract】: Boolean matrix factorization (BMF), or decomposition, received a considerable attention in data mining research. In this paper, we argue that research should extend beyond the Boolean case toward more general type of data such as ordinal data. Technically, such extension amounts to replacement of the two-element Boolean algebra utilized in BMF by more general structures, which brings non-trivial challenges. We first present the problem formulation, survey the existing literature, and provide an illustrative example. Second, we present new theorems regarding decompositions of matrices with ordinal data. Third, we propose a new algorithm based on these results along with an experimental evaluation.

【Keywords】: Boolean algebra; data mining; data reduction; matrix decomposition; BMF; Boolean matrix decompositions; Boolean matrix factorization; data mining research; factor analysis; ordinal data dimensionality reduction; two-element Boolean algebra; Algorithm design and analysis; Data mining; Dogs; Educational institutions; Lattices; Matrix decomposition; Vectors; Galois connection; aggregation; concept lattice; factor analysis; matrix decomposition; ordinal data

100. Validating Network Value of Influencers by Means of Explanations.

【Paper Link】【Pages】:967-972

【Authors】: Glenn S. Bevilacqua ; Shealen Clare ; Amit Goyal ; Laks V. S. Lakshmanan

【Abstract】: Recently, there has been significant interest in social influence analysis. One of the central problems in this area is the problem of identifying influencers, such that by convincing these users to perform a certain action (like buying a new product), a large number of other users get influenced to follow the action. The client of such an application is essentially a marketer who would target these influencers for marketing a given new product, say by providing free samples or discounts. It is natural that before committing resources for targeting an influencer the marketer would be interested in validating the influence (or network value) of influencers returned. This requires digging deeper into such analytical questions as: who are their followers, on what actions (or products) they are influential, etc. However, the current approaches to identifying influencers largely work as a black box in this respect. The goal of this paper is to open up the black box, address these questions and provide informative and crisp explanations for validating the network value of influencers. We formulate the problem of providing explanations (called PROXI) as a discrete optimization problem of feature selection. We show that PROXI is not only NP-hard to solve exactly, it is NP-hard to approximate within any reasonable factor. Nevertheless, we show interesting properties of the objective function and develop an intuitive greedy heuristic. We perform detailed experimental analysis on two real world datasets - Twitter and Flixster, and show that our approach is useful in generating concise and insightful explanations of the influence distribution of users and that our greedy algorithm is effective and efficient with respect to several baselines.

【Keywords】: computational complexity; data mining; feature selection; greedy algorithms; heuristic programming; marketing; optimisation; social networking (online); Flixster datasets; NP-hard problem; PROXI; Twitter datasets; black box; data mining; discrete optimization problem; feature selection; influencer identification; intuitive greedy heuristic; marketing; network value; objective function; social influence analysis; Algorithm design and analysis; Blogs; Communities; Greedy algorithms; Motion pictures; TV; Twitter; explanations; influence propagation analysis; social networks; supermodular and submodular functions; viral marketing

101. External Evaluation of Topic Models: A Graph Mining Approach.

【Paper Link】【Pages】:973-978

【Authors】: Hau Chan ; Leman Akoglu

【Abstract】: Given a topic and its top-k most relevant words generated by a topic model, how can we tell whether it is a low-quality or a high-quality topic? Topic models provide a low-dimensional representation of large document corpora, and drive many important applications such as summarization, document segmentation, word-sense disambiguation, etc. Evaluation of topic models is an important issue, since low-quality topics potentially degrade the performance of these applications. In this paper, we develop a graph mining and machine learning approach for the external evaluation of topic models. Based on the graph-centric features we extract from the projection of topic words on the Wikipedia page-links graph, we learn models that can predict the human-perceived quality of topics (based on human judgments), and classify them as high or low quality. Experiments on four real-world corpora show that our approach boosts the prediction performance up to 30% over three baselines of various complexities, and demonstrate the generality of our method to diverse domains. In addition, we provide an interpretation of our models and outline the discriminating characteristics of topic quality.

【Keywords】: Web sites; data mining; feature extraction; graph theory; learning (artificial intelligence); natural language processing; pattern classification; text analysis; Wikipedia page-links graph; document segmentation; graph mining; graph-centric feature extraction; high-quality topic; human-perceived quality; large document corpora; low-dimensional representation; low-quality topic; machine learning approach; summarization; top-k most relevant words; topic model; topic quality; topic word projection; word sense disambiguation; Digital signal processing; Electronic publishing; Encyclopedias; Feature extraction; Internet; Predictive models; graph mining; human evaluation; topic models

102. Multimedia LEGO: Learning Structured Model by Probabilistic Logic Ontology Tree.

【Paper Link】【Pages】:979-984

【Authors】: Shiyu Chang ; Guo-Jun Qi ; Jinhui Tang ; Qi Tian ; Yong Rui ; Thomas S. Huang

【Abstract】: Recent advances in Multimedia research have generated a large collection of concept models, e.g., LSCOM and Media mill 101, which become accessible to other researchers. While most current research effort still focuses on building new concepts from scratch, little effort has been made on constructing new concepts upon the existing models already in the warehouse. To address this issue, we develop a new framework in this paper, termed LEGO, to seamlessly integrate both the new target training examples and the existing primitive concept models. LEGO treats the primitive concept models as a lego toy to potentially construct an unlimited vocabulary of new concepts. Specifically, LEGO first formulates the logic operations to be the lego connectors to combine existing concept models hierarchically in probabilistic logic ontology trees. LEGO then simultaneously incorporates new target training information to efficiently disambiguate the underlying logic tree and correct the error propagation. We present extensive experimental results on a large vehicle domain data set from Image Net, and demonstrate significantly superior performance over existing state-of-the-art approaches which build new concept models from scratch.

【Keywords】: multimedia systems; ontologies (artificial intelligence); probabilistic logic; trees (mathematics); ImageNet; large vehicle domain data set; lego connectors; lego toy; logic operations; multimedia LEGO; primitive concept models; probabilistic logic ontology tree; structured model learning; unlimited vocabulary; Computational modeling; Mathematical model; Multimedia communication; Ontologies; Probabilistic logic; Semantics; Training; Concept recycling; Logical operations; Model warehouse; Multimedia LEGO; Probabilistic logic ontology tree

103. Discriminatively Enhanced Topic Models.

【Paper Link】【Pages】:985-990

【Authors】: Snigdha Chaturvedi ; Hal Daumé III ; Taesun Moon

【Abstract】: This paper proposes a space-efficient, discriminatively enhanced topic model: a V structured topic model with an embedded log-linear component. The discriminative log-linear component reduces the number of parameters to be learnt while outperforming baseline generative models. At the same time, the explanatory power of the generative component is not compromised. We establish its superiority over a purely generative model by applying it to two different ranking tasks: (a) In the first task, we look at the problem of proposing alternative citations given textual and bibliographic evidence. We solve it as a ranking problem in itself and as a platform for further qualitative analysis of convergence of scientific phenomenon. (b) In the second task we address the problem of ranking potential email recipients based on email content and sender information.

【Keywords】: citation analysis; recommender systems; V structured topic model; alternative citations; bibliographic evidence; bibliography recommendation; citation recommendation; discriminatively enhanced topic models; embedded log-linear component; ranking problem; textual evidence; Context modeling; Electronic mail; Hidden Markov models; Hybrid power systems; Mathematical model; Predictive models; Vectors; Log linear models; Probabilistic ranking; Text Mining; Topic Models

104. PerturBoost: Practical Confidential Classifier Learning in the Cloud.

【Paper Link】【Pages】:991-996

【Authors】: Keke Chen ; Shumin Guo

【Abstract】: Mining large data requires intensive computing resources and data mining expertise, which might not be available for many users. With the development of cloud computing and services computing, data mining tasks can now be moved to the cloud or outsourced to third parties to save costs. In this new paradigm, data and model confidentiality becomes the major concern to the data owner. Meanwhile, users are also concerned about the potential tradeoff among costs, model quality, and confidentiality. In this paper, we propose the PerturBoost framework to address the problems in confidential cloud or outsourced learning. PerturBoost combined with the random space perturbation (RASP) method that was also developed by us can effectively protect data confidentiality, model confidentiality, and model quality with low client-side costs. Based on the boosting framework, we develop a number of base learner algorithms that can learn linear classifiers from the RASP-perturbed data. This approach has been evaluated with public datasets. The result shows that the RASP-based PerturBoost can provide model accuracy very close to the classifiers trained with the original data and the AdaBoost method, with high confidentiality guarantee and acceptable costs.

【Keywords】: cloud computing; costing; data mining; learning (artificial intelligence); pattern classification; security of data; AdaBoost method; PerturBoost framework; RASP method; RASP-perturbed data; base learner algorithm; boosting framework; cloud computing; confidential cloud; cost; data confidentiality; data mining; linear classifier learning; model confidentiality; model quality; outsourced learning; practical confidential classifier learning; random space perturbation method; services computing; Accuracy; Cloud computing; Computational modeling; Data mining; Data models; Servers; Vectors

105. A Feature-Enhanced Ranking-Based Classifier for Multimodal Data and Heterogeneous Information Networks.

【Paper Link】【Pages】:997-1002

【Authors】: Scott Deeann Chen ; Ying-Yu Chen ; Jiawei Han ; Pierre Moulin

【Abstract】: We propose a heterogeneous information network mining algorithm: feature-enhanced Rank Class (F-Rank Class). F-Rank Class extends Rank Class to a unified classification framework that can be applied to binary or multiclass classification of unimodal or multimodal data. We experimented on a multimodal document dataset, 2008/9 Wikipedia Selection for Schools. For unimodal classification, F-Rank Class is compared to support vector machines (SVMs). F-Rank Class provides improvements up to 27.3% on the Wikipedia dataset. For multimodal document classification, F-Rank Class shows improvements up to 19.7% in accuracy when compared to SVM-based meta-classifiers. We also study 1) how the structure of the network and 2) how the choice of parameters affect the classification results.

【Keywords】: Web sites; data mining; document handling; pattern classification; F-RankClass; Wikipedia dataset; binary classification; feature-enhanced ranking-based classifier; heterogeneous information network mining algorithm; multiclass classification; multimodal data; multimodal document classification; multimodal document dataset; unified classification framework; unimodal data; Accuracy; Data mining; Educational institutions; Encyclopedias; Feature extraction; Image edge detection; Support vector machines; classification; heterogeneous information network; multimodal; ranking

106. Nonlinear Causal Discovery for High Dimensional Data: A Kernelized Trace Method.

【Paper Link】【Pages】:1003-1008

【Authors】: Zhitang Chen ; Kun Zhang ; Laiwan Chan

【Abstract】: Causal discovery for high-dimensional observations is a useful tool in many fields such as climate analysis and financial market analysis. A linear Trace method has been proposed to identify the causal direction between two linearly coupled high-dimensional observations X and Y. However, in reality, the relations between X and Y are usually nonlinear and consequently the linear Trace method may fail. In this paper, we propose a method to infer the nonlinear causal relations for two high-dimensional observations X and Y. The idea is to map the observations to high dimensional Reproducing Kernel Hilbert Space (RKHS) such that the nonlinear relations become simple linear ones. We show that the linear Trace condition holds for the causal direction but it is violated for the anti-causal direction in RKHS. Based on this theoretical result, we develop a simple algorithm to infer the causal direction for nonlinearly coupled causal pairs. Synthetic data and real world data experiments are conducted to show the effectiveness of our proposed method.

【Keywords】: Hilbert spaces; data handling; RKHS; anticausal direction; high dimensional data; high dimensional reproducing kernel Hilbert space; kernelized trace method; linear Trace method; linearly coupled high-dimensional observations; nonlinear causal discovery; nonlinearly coupled causal pairs; real world data experiments; synthetic data; Accuracy; Covariance matrices; Eigenvalues and eigenfunctions; Electronic mail; Hilbert space; Kernel; Meteorology; high dimensional data; kernel methods; linear Trace method; nonlinear causal discovery

107. Discriminative Link Prediction Using Local Links, Node Features and Community Structure.

【Paper Link】【Pages】:1009-1018

【Authors】: Abir De ; Niloy Ganguly ; Soumen Chakrabarti

【Abstract】: A link prediction (LP) algorithm is given a graph, and has to rank, for each node, other nodes that are candidates for new linkage. LP is strongly motivated by social search and recommendation applications. LP techniques often focus on global properties (graph conductance, hitting or commute times, Katz score) or local properties (Adamic-Adar and many variations, or node feature vectors), but rarely combine these signals. Furthermore, neither of these extremes exploit link densities at the intermediate level of communities. In this paper we describe a discriminative LP algorithm that exploits two new signals. First, a co-clustering algorithm provides community level link density estimates, which are used to qualify observed links with a surprise value. Second, links in the immediate neighborhood of the link to be predicted are interpreted %at face value, but through a local model of node feature similarities. These signals are combined into a discriminative link predictor. We evaluate the new predictor using five diverse data sets that are standard in the literature. We report on significant accuracy boosts compared to standard LP methods (including Adamic-Adar and random walk). Apart from the new predictor, another contribution is a rigorous protocol for benchmarking and reporting LP algorithms, which reveals the regions of strengths and weaknesses of all the predictors studied here, and establishes the new proposal as the most robust.

【Keywords】: recommender systems; social networking (online); coclustering algorithm; community level link density estimates; community structure; discriminative LP algorithm; discriminative link prediction; local links; node features; Accuracy; Communities; Couplings; Motion pictures; Prediction algorithms; Social network services; Vectors; Link prediction; Recommendation; Social network

108. Most Clusters Can Be Retrieved with Short Disjunctive Queries.

【Paper Link】【Pages】:1019-1024

【Authors】: Vinay Deolalikar

【Abstract】: Simple keyword based searches are ubiquitous in today's internet age. It is hard to imagine an information system today that does not permit a simple keyword based search. This method of information retrieval has the obvious benefits of being highly interpretable, and having wide usage. However, a general perception is that keyword search may not be as powerful an information retrieval paradigm as those that utilize data mining technologies. At the same time, the tremendous growth in textual information in various domains has also given impetus to data mining technologies such as document clustering. Document clustering is a powerful technique, having wide applications in enterprise information management (EIM). However, there is a general perception that the clusters it produces are not always easily interpretable. This hampers its usage in certain settings. This leads us to the following question: can we retrieve a cluster (from a corpus) using a keyword search with precision and recall that are reasonable from the point of view of a retrieval system? What is the form of such a keyword search? How many keywords do we require? How do we arrive at these keywords? Not only are these questions natural, they have immediate use in several highly regulated applications in EIM such as eDiscovery and compliance, where document sets must be specified using keywords. In order to answer our question, we construct a framework that uses maximal frequent discriminative item sets. The novelty of our usage of these item sets is that although their definition as frequent item sets is conjunctive, we use them to form a disjunctive query upon the corpus. We then study the results of this query as an information retrieval problem whose target is the cluster. Our study yields a surprising result: most clusters can be retrieved, up to reasonable precision and recall, using a disjunctive query of only three terms. Among other ramifications, this gives us a readily interpretable descrip- ion of a cluster in terms of the disjunctive query that returns it.

【Keywords】: information systems; pattern clustering; query processing; text analysis; EIM; Internet; compliance; data mining technology; document clustering; document sets; e-discovery; enterprise information management; information retrieval method; information system; keyword based searches; maximal frequent discriminative item sets; short disjunctive query; textual information; Clustering algorithms; Data mining; Itemsets; Keyword search; Organizations; Standards; Disjunctive Queries; Document Clustering; Frequent Itemsets; Keyword Search; Retrieval

109. Integrity Verification of Outsourced Frequent Itemset Mining with Deterministic Guarantee.

【Paper Link】【Pages】:1025-1030

【Authors】: Boxiang Dong ; Ruilin Liu ; Wendy Hui Wang

【Abstract】: In this paper, we focus on the problem of result integrity verification for outsourcing of frequent item set mining. We design efficient cryptographic approaches that verify whether the returned frequent item set mining results are correct and complete with deterministic guarantee. The key of our solution is that the service provider constructs cryptographic proofs of the mining results. Both correctness and completeness of the mining results are measured against the proofs. We optimize the verification by minimizing the number of proofs. Our empirical study demonstrates the efficiency and effectiveness of the verification approaches.

【Keywords】: cryptography; data mining; program verification; cryptographic proofs; deterministic guarantee; integrity verification approach; outsourced frequent item set mining; service provider; Complexity theory; Cryptography; Data mining; Itemsets; Optimization; Protocols; Servers; Cloud computing; Data-mining-as-a-service; frequent itemset mining; integrity verification

110. Progression Analysis of Community Strengths in Dynamic Networks.

【Paper Link】【Pages】:1031-1036

【Authors】: Nan Du ; Jing Gao ; Aidong Zhang

【Abstract】: Community formation analysis of dynamic networks has been a hot topic in data mining which has attracted much attention. Recently, there are many studies which focus on discovering communities successively from each snapshot by considering both current and historical information. However, the detected communities are isolated at a certain snapshot, because these approaches ignore important historical or successive information. Different from previous studies which focus on community detection in dynamic networks, we define a new problem of tracking the progression of the community strength - a novel measure that reflects the community robustness and coherence throughout the entire observation period. The proposed community strength analysis provides significant insights into entity properties and relationships in a wide variety of applications. To tackle this problem, we propose a novel two-stage framework: we first identify communities via non-negative matrix factorization, and then calculate the strength of each detected community corresponding to each specific snapshot by solving an optimization problem. Experimental results show that the proposed approach is highly effective in discovering the progression of community strengths and detecting interesting communities.

【Keywords】: data mining; matrix decomposition; optimisation; community strength analysis; community strengths; dynamic networks; nonnegative matrix factorization; progression analysis; Biology; Communities; Entropy; Joining processes; Linear programming; Picture archiving and communication systems; Symmetric matrices; dynamic networks; temporal community analysis

111. Walk 'n' Merge: A Scalable Algorithm for Boolean Tensor Factorization.

【Paper Link】【Pages】:1037-1042

【Authors】: Dóra Erdös ; Pauli Miettinen

【Abstract】: Tensors are becoming increasingly common in data mining, and consequently, tensor factorizations are becoming more important tools for data miners. When the data is binary, it is natural to ask if we can factorize it into binary factors while simultaneously making sure that the reconstructed tensor is still binary. Such factorizations, called Boolean tensor factorizations, can provide improved interpretability and find Boolean structure that is hard to express using normal factorizations. Unfortunately the algorithms for computing Boolean tensor factorizations do not usually scale well. In this paper we present a novel algorithm for finding Boolean CP and Tucker decompositions of large and sparse binary tensors. In our experimental evaluation we show that our algorithm can handle large tensors and accurately reconstructs the latent Boolean structure.

【Keywords】: Boolean algebra; data mining; matrix decomposition; tensors; Boolean tensor factorization; data mining; latent Boolean structure; normal factorizations; reconstructed tensor; scalable algorithm; sparse binary tensors; Additive noise; Encoding; Facebook; Matrix decomposition; Merging; Tensile stress; Boolean tensors; MDL principle; Random walks; Tensor factorizations

112. Online Active Learning with Imbalanced Classes.

【Paper Link】【Pages】:1043-1048

【Authors】: Zahra Ferdowsi ; Rayid Ghani ; Raffaella Settimi

【Abstract】: This paper proposes an online algorithm for active learning that switches between different candidate instance selection strategies (ISS) for classification in imbalanced data sets. This is important for two reasons: 1) many real-world problems have imbalanced class distributions and 2) there is no ISS that always outperforms all the other techniques. We first empirically compare the performance of existing techniques on imbalanced data sets and show that different strategies work better on different data sets and some techniques even hurt compared to random selection. We then propose an unsupervised score to track and predict the performance of individual instance selection techniques, allowing us to select an effective technique without using a holdout set and wasting valuable labeled data. This score is used in a simple online learning approach that switches between different ISS at each iteration. The proposed approach performs better than the best individual strategy available to the online algorithm over data sets in this paper and provides a way to build practical and effective active learning system for imbalanced data sets.

【Keywords】: data analysis; learning (artificial intelligence); ISS; imbalanced data sets classification; instance selection strategies; online active learning; random selection; unsupervised score; Correlation; Educational institutions; Learning systems; Measurement; Prediction algorithms; Training; Uncertainty

113. Efficient Invariant Search for Distributed Information Systems.

【Paper Link】【Pages】:1049-1054

【Authors】: Yong Ge ; Guofei Jiang ; Yuan Ge

【Abstract】: In today's distributed information systems, a large amount of monitoring data such as log files have been collected. These monitoring data at various points of a distributed information system provide unparallel opportunities for us to characterize and track the information system via effectively correlating all monitoring data across the distributed system. Jiang1 proposed a concept named flow intensity to measure the intensity with which the monitoring data reacts to the volume of different user requests. The Autoregressive model with exogenous inputs (ARX) was used to quantify the relationship between each pair of flow intensity measured at various points across distributed systems. If such relationships hold all the time, they are considered as invariants of the underlying systems. Such invariants have been successfully used to characterize complex systems and support various system management tasks, such as system fault detection and localization. However, it is very time-consuming to search the complete set of invariants of large scale systems and existing algorithms are not scalable for thousands of flow intensity measurements. To this end, in this paper, we develop effective pruning techniques based on the identified upper bounds. Accordingly, two efficient algorithms are proposed to search the complete set of invariants based on the pruning techniques. Finally we demonstrate the efficiency and effectiveness of our algorithms with both real-world and synthetic data sets.

【Keywords】: autoregressive processes; distributed processing; information systems; ARX; autoregressive model with exogenous inputs; distributed information systems; flow intensity measurements; information system characterization; information system tracking; invariant search; monitoring data; pruning technique; real-world data set; synthetic data set; Computational modeling; Distributed information systems; Equations; Mathematical model; Monitoring; Time series analysis; Upper bound; ARX Model; AutoRegressive Model; Efficient Search; Invariant

114. Coupled Heterogeneous Association Rule Mining (CHARM): Application Toward Inference of Modulatory Climate Relationships.

【Paper Link】【Pages】:1055-1060

【Authors】: Doel L. Gonzalez ; Saurabh V. Pendse ; Kanchana Padmanabhan ; Michael P. Angus ; Isaac K. Tetteh ; Shashank Srinivas ; Andrea Villanes ; Fredrick H. M. Semazzi ; Vipin Kumar ; Nagiza F. Samatova

【Abstract】: The complex dynamic climate system often exhibits hierarchical modularity of its organization and function. Scientists have spent decades trying to discover and understand the driving mechanisms behind western African Sahel summer rainfall variability, mostly via hypothesis-driven and/or first-principles based research. Their work has furthered theory regarding the connections between various climate patterns, but the key relationships are still not fully understood. We present Coupled Heterogeneous Association Rule Mining (CHARM), a computationally efficient methodology that mines higher-order relationships between these subsystems' anomalous temporal phases with respect to their effect on the system's response. We apply this to climate science data, aiming to infer putative pathways/cascades of modulating events and the modulating signs that collectively define the network of pathways for the rainfall anomaly in the Sahel. Experimental results are consistent with fundamental theories of phenomena in climate science, especially physical processes that best describe sub-regional climate.

【Keywords】: climatology; data mining; geophysics computing; CHARM; climate science data; coupled heterogeneous association rule mining; higher-order relationships mining; modulatory climate relationships inference; rainfall anomaly; Couplings; Data mining; Itemsets; Manganese; Measurement; Meteorology; Oceans; association rules; climate; data coupling; knowledge discovery

115. Leveraging Supervised Label Dependency Propagation for Multi-label Learning.

【Paper Link】【Pages】:1061-1066

【Authors】: Bin Fu ; Guandong Xu ; Zhihai Wang ; Longbing Cao

【Abstract】: Exploiting label dependency is a key challenge in multi-label learning, and current methods solve this problem mainly by training models on the combination of related labels and original features. However, label dependency cannot be exploited dynamically and mutually in this way. Therefore, we propose a novel paradigm of leveraging label dependency in an iterative way. Specifically, each label's prediction will be updated and also propagated to other labels via an random walk with restart process. Meanwhile, the label propagation is implemented as a supervised learning procedure via optimizing a loss function, thus more appropriate label dependency can be learned. Extensive experiments are conducted, and the results demonstrate that our method can achieve considerable improvements in terms of several evaluation metrics.

【Keywords】: learning (artificial intelligence); evaluation metrics; loss function optimization; multilabel learning; random walk; restart process; supervised label dependency propagation; supervised learning procedure; Educational institutions; Feature extraction; Measurement; Prediction algorithms; Predictive models; Training; Vectors; Label dependency; Multi-label learning; Random walk with restart

116. A Mobility Simulation Framework Of Humans With Group Behavior Modeling.

【Paper Link】【Pages】:1067-1072

【Authors】: Anshul Gupta ; Aurosish Mishra ; Satya Gautam Vadlamudi ; P. P. Chakrabarti ; Sudeshna Sarkar ; Tridib Mukherjee ; Nathan Gnanasambandam

【Abstract】: We present a mobility simulation framework that simulates the movement behaviors of people to generate spatiotemporal movement data. There is a growing interest in applications that make use of patterns mined from spatio-temporal data. However, since the availability of actual spatio-temporal movement data in the public domain is limited, it is useful to have simulation frameworks that generate data close to the real life behavior of people, so that data mining techniques can be tested. We argue that modeling group behavior effectively is a key element of any real-life simulation framework, because there are many applications that require the knowledge of groups and events. In this work, we propose generic models to represent individual and group movement behaviors. We present an algorithm that takes various behaviors created using the proposed models, and generates spatio-temporal movement data for as many individuals as needed. Experimental analysis shows the efficacy of the proposed framework handling a broad spectrum of behaviors with high scalability.

【Keywords】: behavioural sciences computing; data mining; data mining techniques; group behavior modeling; group movement behaviors; mobility simulation framework; movement behavior simulation; real-life simulation framework; spatiotemporal movement data; Computational modeling; Data mining; Data models; Educational institutions; Electronic mail; Probability distribution; Roads; group behavior; mobility simulation; spatio-temporal

【Paper Link】【Pages】:1073-1078

【Authors】: Bo Hu ; Mohsen Jamali ; Martin Ester

【Abstract】: Mobile networks enable users to post on social media services (e.g., Twitter) from anywhere and anytime. This new phenomenon led to the emergence of a new line of work of mining the behavior of mobile users taking into account the spatio-temporal aspects of their engagement with online social media. In this paper, we address the problem of recommending the right locations to users at the right time. We claim to propose the first comprehensive model, called STT (Spatio-Temporal Topic), to capture the spatio-temporal aspects of user check-ins in a single probabilistic model for location recommendation. Our proposed generative model does not only captures spatio-temporal aspects of check-ins, but also profiles users. We conduct experiments on real life data sets from Twitter, Go Walla, and Bright kite. We evaluate the effectiveness of STT by evaluating the accuracy of location recommendation. The experimental results show that STT achieves better performance than the state-of-the-art models in the areas of recommender systems as well as topic modeling.

【Keywords】: data mining; mobile computing; recommender systems; social networking (online); Bright kite; Go Walla; STT; Twitter; location recommendation; mobile social media; mobile user behavior mining; single probabilistic model; spatio-temporal topic modeling; user data check-ins; Accuracy; Cities and towns; Data models; Entropy; Indexes; Random variables; Twitter; location recommendation; spatio-temporal; topic model

118. Active Query Driven by Uncertainty and Diversity for Incremental Multi-label Learning.

【Paper Link】【Pages】:1079-1084

【Authors】: Sheng-Jun Huang ; Zhi-Hua Zhou

【Abstract】: In multi-label learning, it is rather expensive to label instances since they are simultaneously associated with multiple labels. Therefore, active learning, which reduces the labeling cost by actively querying the labels of the most valuable data, becomes particularly important for multi-label learning. A strong multi-label active learning algorithm usually consists of two crucial elements: a reasonable criterion to evaluate the gain of queried label, and an effective classification model, based on whose prediction the criterion can be accurately computed. In this paper, we first introduce an effective multi-label classification model by combining label ranking with threshold learning, which is incrementally trained to avoid retraining from scratch after every query. Based on this model, we then propose to exploit both uncertainty and diversity in the instance space as well as the label space, and actively query the instance-label pairs which can improve the classification model most. Experimental results demonstrate the superiority of the proposed approach to state-of-the-art methods.

【Keywords】: learning (artificial intelligence); pattern classification; query processing; active query; classification model; diversity; incremental multilabel learning; instance space; instance-label pairs; label instances; label ranking; label space; labeling cost reduction; multilabel active learning algorithm; reasonable criterion; threshold learning; uncertainty; Computational modeling; Data models; Diversity reception; Measurement uncertainty; Prediction algorithms; Uncertainty; Vectors; active learning; diversity; multi-label learning; uncertainty

119. Accelerating Active Learning with Transfer Learning.

【Paper Link】【Pages】:1085-1090

【Authors】: David C. Kale ; Yan Liu

【Abstract】: Active learning, transfer learning, and related techniques are unified by a core theme: efficient and effective use of available data. Active learning offers scalable solutions for building effective supervised learning models while minimizing annotation effort. Transfer learning utilizes existing labeled data from one task to help learning related tasks for which limited labeled data are available. There has been limited research, however, on how to combine these two techniques. In this paper, we present a simple and principled transfer active learning framework that leverages pre-existing labeled data from related tasks to improve the performance of an active learner. We derive an intuitive bound on generalization error for the classifiers learned by this algorithm that provides insight into the algorithm's behavior and the problem in general. Experimental results using several well-known transfer learning data sets confirm our theoretical analysis and demonstrate the effectiveness of our approach.

【Keywords】: learning (artificial intelligence); pattern classification; active learning; classifiers generalization error; labeled data; supervised learning models; transfer learning; Acceleration; Algorithm design and analysis; Labeling; Query processing; Supervised learning; Training; Upper bound; Active Learning; Learning Theory; Machine Learning; Transfer Learning

120. Statistical Inference of Protein "LEGO Bricks".

【Paper Link】【Pages】:1091-1096

【Authors】: Arun Siddharth Konagurthu ; Lloyd Allison ; David Abramson ; Peter J. Stuckey ; Arthur M. Lesk

【Abstract】: Proteins are biomolecules of life. They fold into a great variety of three-dimensional (3D) shapes. Underlying these folding patterns are many recurrent structural fragments or building blocks (analogous to 'LEGO® bricks'). This paper reports an innovative statistical inference approach to discover a comprehensive dictionary of protein structural building blocks from a large corpus of experimentally determined protein structures. Our approach is built on the Bayesian and information theoretic criterion of minimum message length. To the best of our knowledge, this work is the first systematic and rigorous treatment of a very important data mining problem that arises in the cross-disciplinary area of structural bioinformatics. The quality of the dictionary we find is demonstrated by its explanatory power - any protein within the corpus of known 3D structures can be dissected into successive regions assigned to fragments from this dictionary. This induces a novel one-dimensional representation of three-dimensional protein folding patterns, suitable for application of the rich repertoire of character-string processing algorithms, for rapid identification of folding patterns of newly determined structures. This paper presents the details of the methodology used to infer the dictionary of building blocks, and is supported by illustrative examples to demonstrate its effectiveness and utility.

【Keywords】: Bayes methods; bioinformatics; data mining; dictionaries; information theory; proteins; statistical analysis; 3D shape; 3D structures; Bayesian criterion; biomolecules; character-string processing algorithm; data mining problem; information theoretic criterion; minimum message length; one-dimensional representation; protein LEGO bricks; protein structural building block dictionary; recurrent structural fragments; statistical inference; structural bioinformatics; three-dimensional protein folding patterns; three-dimensional shape; Amino acids; Dictionaries; Educational institutions; Encoding; Equations; Proteins; Three-dimensional displays; information theoretic inference; minimum message length; protein structure

121. A Probabilistic Behavior Model for Discovering Unrecognized Knowledge.

【Paper Link】【Pages】:1097-1102

【Authors】: Takeshi Kurashima ; Tomoharu Iwata ; Noriko Takaya ; Hiroshi Sawada

【Abstract】: Discovering interesting behavior patterns and profiles of users as they interact with E-commerce (EC) sites is an important task for site managers. We propose a probabilistic behavior model for extracting latent classes of items that impact the users' item selections but cannot be inferred from the current knowledge of the managers. The proposed model assumes that the current knowledge is represented by categories of items that are defined in the EC site, and a user selects items depending on both of their categories and latent classes. By estimating latent classes, each of which shows items accessed by users with common interests, we can find interesting factors for explaining user behavior. We evaluate our proposed model using item-access log data observed in an EC site. The results show that our model can accurately predict users' item selection, and actually discover latent classes of items having similar latent characteristic such as "colored design" and "impression" by using item categories such as "coat" and "hat" as the current knowledge of the managers.

【Keywords】: data mining; electronic commerce; probability; EC sites; behavior pattern discovery; e-commerce sites; item latent class extraction; item-access log data; probabilistic behavior model; unrecognized knowledge discovery; user behavior; user item selection; user profile; Data mining; Data models; Entropy; Footwear; Kernel; Predictive models; Probabilistic logic; behavir model; topic model

【Paper Link】【Pages】:1103-1108

【Authors】: Sejeong Kwon ; Meeyoung Cha ; Kyomin Jung ; Wei Chen ; Yajun Wang

【Abstract】: The problem of identifying rumors is of practical importance especially in online social networks, since information can diffuse more rapidly and widely than the offline counterpart. In this paper, we identify characteristics of rumors by examining the following three aspects of diffusion: temporal, structural, and linguistic. For the temporal characteristics, we propose a new periodic time series model that considers daily and external shock cycles, where the model demonstrates that rumor likely have fluctuations over time. We also identify key structural and linguistic differences in the spread of rumors and non-rumors. Our selected features classify rumors with high precision and recall in the range of 87% to 92%, that is higher than other states of the arts on rumor classification.

【Keywords】: social networking (online); time series; daily shock cycles; external shock cycles; linguistic differences; online social media; online social networks; periodic time series model; rumor classification; rumor propagation; structural differences; temporal characteristics; Adaptation models; Electric shock; Mathematical model; Pragmatics; Psychology; Time series analysis; Twitter; Diffusion Network; Rumor; Sentiment Analysis; Social Media; Time Series

123. Group Feature Selection with Streaming Features.

【Paper Link】【Pages】:1109-1114

【Authors】: Hai-Guang Li ; Xindong Wu ; Zhao Li ; Wei Ding

【Abstract】: Group feature selection makes use of structural information among features to discover a meaningful subset of features. Existing group feature selection algorithms only deal with pre-given candidate feature sets and they are incapable of handling streaming features. On the other hand, feature selection algorithms targeted for streaming features can only perform at the individual feature level without considering intrinsic group structures of the features. In this paper, we perform group feature selection with streaming features. We propose to perform feature selection at the group and individual feature levels simultaneously in a manner of a feature stream rather than a pre-given candidate feature set. In our approach, the group structures are fully utilized to reduce the cost of evaluating streaming features. We have extensively evaluated the proposed method. Experimental results have demonstrated that our proposed algorithms statistically outperform state-of-the-art methods of feature selection in terms of classification accuracy.

【Keywords】: feature selection; pattern classification; statistical analysis; GFSSF; group feature selection with streaming features; group structures; intrinsic group structures; streaming feature evaluation; streaming feature handling; structural information; Accuracy; Entropy; Mutual information; Redundancy; Standards; Training; Uncertainty

124. Bayesian Multi-Task Relationship Learning with Link Structure.

【Paper Link】【Pages】:1115-1120

【Authors】: Yingming Li ; Ming Yang ; Zhongang Qi ; Zhongfei (Mark) Zhang

【Abstract】: In this paper, we study the multi-task learning problem with a new perspective of considering the link structure of data and task relationship modeling simultaneously. In particular, we first introduce the Matrix Generalized Inverse Gaussian (MGIG) distribution and define a Matrix Gaussian Matrix Generalized Inverse Gaussian (MG-MGIG) prior. Based on this prior, we propose a novel multi-task learning algorithm, the Bayesian Multi-task Relationship Learning (BMTRL) algorithm. To incorporate the link structure into the framework of BMTRL, we propose link constraints between samples. Through combining the BMTRL algorithm with the link constraints, we propose the Bayesian Multi-task Relationship Learning with Link Constraints (BMTRL-LC) algorithm. To make the computation tractable, we simultaneously use a convex optimization method and sampling techniques. In particular, we adopt two stochastic EM algorithms for BMTRL and BMTRL-LC, respectively. The experimental results on Cora dataset demonstrate the promise of the proposed algorithms.

【Keywords】: Gaussian distribution; convex programming; data mining; data structures; inverse problems; learning (artificial intelligence); matrix algebra; sampling methods; BMTRL-LC algorithm; Bayesian multitask relationship learning with link constraints; Cora dataset; MG-MGIG; convex optimization method; data link structure; matrix Gaussian matrix generalized inverse Gaussian; multitask learning problem; sampling techniques; stochastic EM algorithms; task relationship modeling; Algorithm design and analysis; Bayes methods; Covariance matrices; Gaussian distribution; Kernel; Machine learning algorithms; Optimization; Link Structure; Multi-task Learning; Task Relationship Modeling

125. Quantitative Prediction of Glaucomatous Visual Field Loss from Few Measurements.

【Paper Link】【Pages】:1121-1126

【Authors】: Zenghan Liang ; Ryota Tomioka ; Hiroshi Murata ; Ryo Asaoka ; Kenji Yamanishi

【Abstract】: We propose database-aware regression methods for extrapolation from few measurements in the context of quantitative prognosis. The idea is to leverage a database of patients with similar conditions to increase the effective number of samples when we train a predictive model. Applying the proposed method to a database of glaucoma patients, we were able to predict the disease condition at a future time point significantly more accurately than the conventional patient-wise linear regression approach. In fact, our prediction was 50% more accurate than the conventional approach when three or less measurements were available and with only two measurements at least as accurate as the conventional approach with six measurements. Moreover, the proposed method can provide spatially localized prediction and also the (localized) speed of progression, which are valuable for doctors in making decisions.

【Keywords】: diseases; extrapolation; patient treatment; regression analysis; database-aware regression method; decision making; extrapolation; glaucoma patients database; glaucomatous visual field loss quantitative prediction; quantitative prognosis; Diseases; Extrapolation; Linear regression; Predictive models; Principal component analysis; Vectors; Visualization; Clustering; Extrapolation; Multi-task learning; Quantitative prognosis; Spatio-temporal data

126. On the Feature Discovery for App Usage Prediction in Smartphones.

【Paper Link】【Pages】:1127-1132

【Authors】: Zhung-Xun Liao ; Shou-Chung Li ; Wen-Chih Peng ; Philip S. Yu ; Te-Chuan Liu

【Abstract】: With the increasing number of mobile Apps developed, they are now closely integrated into daily life. In this paper, we develop a framework to predict mobile Apps that are most likely to be used regarding the current device status of a smartphone. Such an Apps usage prediction framework is a crucial prerequisite for fast App launching, intelligent user experience, and power management of smartphones. By analyzing real App usage log data, we discover two kinds of features: The Explicit Feature (EF) from sensing readings of built-in sensors, and the Implicit Feature (IF) from App usage relations. The IF feature is derived by constructing the proposed App Usage Graph (abbreviated as AUG) that models App usage transitions. In light of AUG, we are able to discover usage relations among Apps. Since users may have different usage behaviors on their smartphones, we further propose one personalized feature selection algorithm. We explore minimum description length (MDL) from the training data and select those features which need less length to describe the training data. The personalized feature selection can successfully reduce the log size and the prediction time. Finally, we adopt the kNN classification model to predict Apps usage. Note that through the features selected by the proposed personalized feature selection algorithm, we only need to keep these features, which in turn reduces the prediction time and avoids the curse of dimensionality when using the kNN classifier. The results based on a real dataset demonstrate the effectiveness of the proposed framework and show the predictive capability for App usage prediction.

【Keywords】: feature selection; graph theory; mobile computing; pattern classification; power aware computing; smart phones; AUG; App usage graph; App usage log data analysis; App usage transitions; IF feature; MDL; built-in sensors; curse of dimensionality; explicit feature; fast App launching; feature discovery; implicit feature; intelligent user experience; kNN classification model; kNN classifier; log size reduction; minimum description length; mobile App usage prediction framework; personalized feature selection algorithm; prediction time reduction; smartphones; training data; Equations; Prediction algorithms; Sensors; Smart phones; Testing; Training; Training data; Mobile Application; Usage prediction; classification; mobile Apps

127. How Many Zombies Around You?

【Paper Link】【Pages】:1133-1138

【Authors】: Hongfu Liu ; Yuchao Zhang ; Hao Lin ; Junjie Wu ; Zhiang Wu ; Xu Zhang

【Abstract】: Recent years have witnessed the explosive growth of online social media. Weibo, a famous "Chinese Twitter", has attracted over half billion users in less than four years. Among them are zombie users or bogus users, who are seemingly active common users but actually marionettes manipulated by intelligent software for economic interests. To probe such users thus becomes critically important for a healthy Weibo, but the existing studies along this line are still in initial stage due to the serious lack of labeled zombies and the limited attributes for user profiling. In light of this, in this paper, we figure out a commercial way for training set labeling, and propose a two-stage cascading model called ProZombie for zombie user recognition. ProZombie decomposes the training/predicting process into fast and refined phases in cascade, which greatly improves the modeling efficiency without sacrificing the accuracy. Moreover, 35 attributes including 16 newly proposed ones are employed for a panoramic description of Weibo users. Experiments on real-world labeled Weibo users demonstrate the effectiveness and efficiency of ProZombie. More interestingly, two case studies based on ProZombie successfully unveil the zombies hidden around common users, and their impact to information propagation on Weibo. To our best knowledge, this study is among the first to quantify these interesting observations on Weibo.

【Keywords】: data mining; social networking (online); Chinese Twitter; ProZombie; Weibo user panoramic description; active common users; bogus users; intelligent software; labeled Weibo users; online social media; predicting process; training process; training set labeling; two-stage cascading model; user profiling; zombie user recognition; Accuracy; Computational modeling; Economics; Media; Predictive models; Read only memory; Training; Cascading Model; Social Media; Weibo; Zombie

128. Hibernating Process: Modelling Mobile Calls at Multiple Scales.

【Paper Link】【Pages】:1139-1144

【Authors】: Siyuan Liu ; Lei Li ; Rammaya Krishnan

【Abstract】: Do mobile phone calls at larger granularities behave in the same pattern as in smaller ones? How can we forecast the distribution of a whole month's phone calls with only one day's observation? There are many models developed to interpret large scale social graphs. However, all of the existing models focus on graph at one time scale. Many dynamical behaviors were either ignored, or handled at one scale. In particular new users might join or current users quit social networks at any time. In this paper, we propose HiP, a novel model to capture longitudinal behaviors in modeling degree distribution of evolving social graphs. We analyze a large scale phone call dataset using HiP, and compare with several previous models in literature. Our model is able to fit phone call distribution at multiple scales with 30% to 75% improvement over the best existing method on each scale.

【Keywords】: graph theory; mobile computing; social sciences computing; hibernating process; large scale phone call dataset; large scale social graphs; social graphs; social networks; Data models; Hip; Mobile communication; Mobile computing; Mobile handsets; Parametric statistics; Social network services; Mobile phone call graph; churning behavior; heavy tailed distribution; non-parametric model

129. On Good and Fair Paper-Reviewer Assignment.

【Paper Link】【Pages】:1145-1150

【Authors】: Cheng Long ; Raymond Chi-Wing Wong ; Yu Peng ; Liangliang Ye

【Abstract】: Peer review has become the most common practice for judging papers submitted to a conference for decades. An extremely important task involved in peer review is to assign submitted papers to reviewers with appropriate expertise which is referred to as paper-reviewer assignment. In this paper, we study the paper-reviewer assignment problem from both the goodness aspect and the fairness aspect. For the goodness aspect, we propose to maximize the topic coverage of the paper-reviewer assignment. This objective is new and the problem based on this objective is shown to be NP-hard. To solve this problem efficiently, we design an approximate algorithm which gives a 1/3-approximation. For the fairness aspect, we perform a detailed study on conflict-of-interest (COI) types and discuss several issues related to using COI, which, we hope, can raise some open discussions among researchers on the COI study. Finally, we conducted experiments on real datasets which verified the effectiveness of our algorithm and also revealed some interesting results of COI.

【Keywords】: computational complexity; optimisation; query processing; text analysis; 1/3-approximation; COI types; NP-hard problem; approximate algorithm; conflict-of-interest types; fair paper-reviewer assignment paper; good paper-reviewer assignment; peer review; query; text document; Algorithm design and analysis; Approximation algorithms; Approximation methods; Bipartite graph; Data mining; Greedy algorithms; Linear programming; COI study; Paper reviewer assignment; topic coverage

130. Community Detection in Networks with Node Attributes.

【Paper Link】【Pages】:1151-1156

【Authors】: Jaewon Yang ; Julian J. McAuley ; Jure Leskovec

【Abstract】: Community detection algorithms are fundamental tools that allow us to uncover organizational principles in networks. When detecting communities, there are two possible sources of information one can use: the network structure, and the features and attributes of nodes. Even though communities form around nodes that have common edges and common attributes, typically, algorithms have only focused on one of these two data modalities: community detection algorithms traditionally focus only on the network structure, while clustering algorithms mostly consider only node attributes. In this paper, we develop Communities from Edge Structure and Node Attributes (CESNA), an accurate and scalable algorithm for detecting overlapping communities in networks with node attributes. CESNA statistically models the interaction between the network structure and the node attributes, which leads to more accurate community detection as well as improved robustness in the presence of noise in the network structure. CESNA has a linear runtime in the network size and is able to process networks an order of magnitude larger than comparable approaches. Last, CESNA also helps with the interpretation of detected communities by finding relevant node attributes for each community.

【Keywords】: network theory (graphs); pattern clustering; social networking (online); statistical analysis; CESNA; clustering algorithms; community detection algorithms; community from edge structure and node attributes; data modalities; network size; network structure; organizational principles; overlapping community detection; statistical modelling; Communities; Electronic publishing; Encyclopedias; Facebook; Image edge detection; Logistics; Community detection; Network communities; Node attributes; Overlapping community detection

【Paper Link】【Pages】:1157-1162

【Authors】: Meenakshi Mishra ; Jun Huan

【Abstract】: Multitask learning has been thoroughly proven to improve the generalization performance given a set of related tasks. Most multitask learning algorithm assume that all tasks are related. However, if all the tasks are not related, negative transfer of information occurs amongst the tasks, and the performance of traditional multitask learning algorithm worsens. Thus, we design an algorithm that simultaneously groups the related tasks and trains only the related task together. There are different approaches to train the related tasks in multi-task learning based on which information is shared across the tasks. These approaches either assume that the parameters of each of the tasks are situated close together, or assume that there is a common underlying latent space in the features of the tasks that is related. Most multi-task learning algorithm use either regularization method or matrix-variate priors. In our algorithm, the related tasks are tied together by a set of common features selected by each tasks. Thus, to train the related tasks together, we use spike and slab prior to select a common set of features for the related tasks, and a mixture of gaussians prior to select the set of related tasks. For validation, the developed algorithm is tested on toxicity prediction and hand written digit recognition data sets. The results show a significant improvement over multitask learning with feature selection for larger number of tasks. Further, the developed algorithm is also compared against another state of the art algorithm that similarly groups the related tasks together and proven to be better and more accurate.

【Keywords】: Gaussian processes; feature extraction; handwritten character recognition; learning (artificial intelligence); mixture models; pattern classification; toxicology; Gaussian mixture; feature selection; generalization performance; hand written digit recognition data sets; information sharing; matrix-variate prior; multitask learning algorithm; regularization method; slab; spike; toxicity prediction; Bayes methods; Chemicals; Equations; Mathematical model; Slabs; Training; Vectors; Expectation Propagation; Groups of Tasks; Mixture of Gaussian Prior; Multi-task Learning; Spike and Slab Prior

132. Network Hypothesis Testing Using Mixed Kronecker Product Graph Models.

【Paper Link】【Pages】:1163-1168

【Authors】: Sebastián Moreno ; Jennifer Neville

【Abstract】: The recent interest in networks-social, physical, communication, information, etc.-has fueled a great deal of research on the analysis and modeling of graphs. However, many of the analyses have focused on a single large network (e.g., a sub network sampled from Facebook). Although several studies have compared networks from different domains or samples, they largely focus on empirical exploration of network similarities rather than explicit tests of hypotheses. This is in part due to a lack of statistical methods to determine whether two large networks are likely to have been drawn from the same underlying graph distribution. Research on across-network hypothesis testing methods has been limited by (i) difficulties associated with obtaining a set of networks to reason about the underlying graph distribution, and (ii) limitations of current statistical models of graphs that make it difficult to represent variations across networks. In this paper, we exploit the recent development of mixed-Kronecker Product Graph Models, which accurately capture the natural variation in real world graphs, to develop a model-based approach for hypothesis testing in networks.

【Keywords】: graph theory; network theory (graphs); statistical testing; across-network hypothesis testing methods; graph analysis; graph distribution; graph modeling; large networks; mixed-Kronecker product graph models; statistical methods; Data models; Electronic mail; Facebook; Sociology; Statistics; Testing; Training; Network science; graph models; hypothesis testing

133. Graph Partitioning Change Detection Using Tree-Based Clustering.

【Paper Link】【Pages】:1169-1174

【Authors】: Sho-Ichi Sato ; Kenji Yamanishi

【Abstract】: We are concerned with the issue of detecting changes of graph partitioning structures from a graph sequence. We call this issue GPCD(graph partitioning change detection). The graph partitioning structures may represent network communities. Hence GPCD is important in that it leads to discovery of important events which cause changes of network communities. We introduce a new algorithm for GPCD, denoted as TREE. The key ideas are: 1) we employ probabilistic trees to represent probabilistic models of graph partitioning structures. 2) We then reduce GPCD into the issue of detecting changes of trees on the basis of the minimum description length (MDL) principle. 3) By taking the cost of changes into consideration, we realize significantly less false alarm rates for change detection than the baseline method called Graph Scope. We empirically demonstrate that TREE is able to detect changes more accurately than Graph Scope.

【Keywords】: graph theory; pattern clustering; probability; trees (mathematics); GPCD; GraphScope; MDL principle; baseline method; event discovery; false alarm rates; graph partitioning change detection; graph partitioning structures; graph sequence; minimum description length principle; network communities; probabilistic models; probabilistic trees; tree-based clustering; Bipartite graph; Communities; Complexity theory; Educational institutions; Encoding; Partitioning algorithms; Receivers; dynamic model selection; graph partitioning change detection; minimum description length; tree structures

134. SAX-VSM: Interpretable Time Series Classification Using SAX and Vector Space Model.

【Paper Link】【Pages】:1175-1180

【Authors】: Pavel Senin ; Sergey Malinchik

【Abstract】: In this paper, we propose a novel method for discovering characteristic patterns in a time series called SAX-VSM. This method is based on two existing techniques - Symbolic Aggregate approximation and Vector Space Model. SAX-VSM automatically discovers and ranks time series patterns by their "importance" to the class, which not only facilitates well-performing classification procedure, but also provides an interpretable class generalization. The accuracy of the method, as shown through experimental evaluation, is at the level of the current state of the art. While being relatively computationally expensive within a learning phase, our method provides fast, precise, and interpretable classification.

【Keywords】: computational complexity; data mining; learning (artificial intelligence); pattern classification; symbol manipulation; time series; SAX-VSM; automatic time series pattern discovery; automatic time series pattern ranking; characteristic pattern discovery; interpretable class generalization; interpretable time series classification; symbolic aggregate approximation; time series pattern ranking; vector space model; Accuracy; Approximation algorithms; Approximation methods; Euclidean distance; Time series analysis; Training; Vectors; classification algorithms; time series analysis

135. Clustering on Multiple Incomplete Datasets via Collective Kernel Learning.

【Paper Link】【Pages】:1181-1186

【Authors】: Weixiang Shao ; Xiaoxiao Shi ; Philip S. Yu

【Abstract】: Multiple datasets containing different types of features may be available for a given task. For instance, users' profiles can be used to group users for recommendation systems. In addition, a model can also use users' historical behaviors and credit history to group users. Each dataset contains different information and suffices for learning. A number of clustering algorithms on multiple datasets were proposed during the past few years. These algorithms assume that at least one dataset is complete. So far as we know, all the previous methods will not be applicable if there is no complete dataset available. However, in reality, there are many situations where no dataset is complete. As in building a recommendation system, some new users may not have profiles or historical behaviors, while some may not have credit history. Hence, no available dataset is complete. In order to solve this problem, we propose an approach called Collective Kernel Learning to infer hidden sample similarity from multiple incomplete datasets. The idea is to collectively completes the kernel matrices of incomplete datasets by optimizing the alignment of shared instances of the datasets. Furthermore, a clustering algorithm is proposed based on the kernel matrix. The experiments on both synthetic and real datasets demonstrate the effectiveness of the proposed approach. The proposed clustering algorithm outperforms the comparison algorithms by as much as two times in normalized mutual information.

【Keywords】: learning (artificial intelligence); pattern clustering; recommender systems; collective kernel learning; credit history; incomplete dataset kernel matrices; multiple incomplete dataset clustering; normalized mutual information; real dataset; recommendation systems; synthetic dataset; user historical behavior; user profiles; Algorithm design and analysis; Clustering algorithms; Correlation; Equations; Kernel; Laplace equations; Optimization

136. MLI: An API for Distributed Machine Learning.

【Paper Link】【Pages】:1187-1192

【Authors】: Evan R. Sparks ; Ameet Talwalkar ; Virginia Smith ; Jey Kottalam ; Xinghao Pan ; Joseph E. Gonzalez ; Michael J. Franklin ; Michael I. Jordan ; Tim Kraska

【Abstract】: MLI is an Application Programming Interface designed to address the challenges of building Machine Learning algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of high-performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.

【Keywords】: application program interfaces; distributed algorithms; learning (artificial intelligence); API; MLI; application programming interface; data-centric computing; distributed algorithm; distributed machine learning; high-performance algorithm; Computational modeling; Logistics; MATLAB; Mathematical model; Sparks; Vectors; distributed computing; machine learning; programming interface

137. Co-ClusterD: A Distributed Framework for Data Co-Clustering with Sequential Updates.

【Paper Link】【Pages】:1193-1198

【Authors】: Sen Su ; Xiang Cheng ; Lixin Gao ; Jiangtao Yin

【Abstract】: Co-clustering is a powerful data mining tool for co-occurrence and dyadic data. As data sets become increasingly large, the scalability of co-clustering becomes more and more important. In this paper, we propose two approaches to parallelize co-clustering with sequential updates in a distributed environment. Based on these two approaches, we present a new distributed framework, Co-ClusterD, that supports efficient implementations of co-clustering algorithms with sequential updates. We design and implement Co-ClusterD, and show its efficiency through two co-clustering algorithms: fast nonnegative matrix tri-factorization (FNMTF) and information theoretic co-clustering (ITCC). We evaluate our framework on both a local cluster of machines and the Amazon EC2 cloud. Our evaluation shows that co-clustering algorithms implemented in Co-ClusterD can achieve better results and run faster than their traditional concurrent counterparts.

【Keywords】: data mining; distributed processing; information theory; matrix decomposition; pattern clustering; Amazon EC2 cloud; Co-ClusterD; FNMTF; ITCC; coclustering parallelization approach; data coclustering algorithm; distributed environment; distributed framework; dyadic data; fast nonnegative matrix trifactorization; information theoretic coclustering; sequential updates; Algorithm design and analysis; Clustering algorithms; Convergence; Distributed databases; Linear programming; Scalability; Synchronization; Cloud Computing; Co-Clustering; Concurrent Updates; Distributed Framework; Sequential Updates

138. Non-negative Multiple Tensor Factorization.

【Paper Link】【Pages】:1199-1204

【Authors】: Koh Takeuchi ; Ryota Tomioka ; Katsuhiko Ishiguro ; Akisato Kimura ; Hiroshi Sawada

【Abstract】: Non-negative Tensor Factorization (NTF) is a widely used technique for decomposing a non-negative value tensor into sparse and reasonably interpretable factors. However, NTF performs poorly when the tensor is extremely sparse, which is often the case with real-world data and higher-order tensors. In this paper, we propose Non-negative Multiple Tensor Factorization (NMTF), which factorizes the target tensor and auxiliary tensors simultaneously. Auxiliary data tensors compensate for the sparseness of the target data tensor. The factors of the auxiliary tensors also allow us to examine the target data from several different aspects. We experimentally confirm that NMTF performs better than NTF in terms of reconstructing the given data. Furthermore, we demonstrate that the proposed NMTF can successfully extract spatio-temporal patterns of people's daily life such as leisure, drinking, and shopping activity by analyzing several tensors extracted from online review data sets.

【Keywords】: data analysis; matrix decomposition; tensors; auxiliary data tensors; data analysis; higher-order tensors; nonnegative multiple tensor factorization; target data tensor sparseness; Business; Geology; Matrix decomposition; Motion pictures; Probabilistic logic; Sparse matrices; Tensile stress; Machine Learning; Non-negative Tensor Factorization; Spatio-Temporal Pattern

139. Multiclass Semi-Supervised Boosting Using Similarity Learning.

【Paper Link】【Pages】:1205-1210

【Authors】: Jafar Tanha ; Mohammad Javad Saberian ; Maarten van Someren

【Abstract】: In this paper, we consider the multiclass semi-supervised classification problem. A boosting algorithm is proposed to solve the multiclass problem directly. The proposed multiclass approach uses a new multiclass loss function, which includes two terms. The first term is the cost of the multiclass margin and the second term is a regularization term on unlabeled data. The regularization term is used to minimize the inconsistency between the pair wise similarity and the classifier predictions. It assigns the soft labels weighted with the similarity between unlabeled and labeled examples. We then derive a boosting algorithm, named CD-MSSBoost, from the proposed loss function using coordinate gradient descent. The derived algorithm is further used for learning optimal similarity function for a given data. Our experiments on a number of UCI datasets show that CD-MSSBoost outperforms the state-of-the-art methods to multiclass semi-supervised learning.

【Keywords】: learning (artificial intelligence); pattern classification; CD-MSSBoost; UCI datasets; coordinate gradient descent; multiclass loss function; multiclass margin; multiclass semisupervised boosting algorithm; multiclass semisupervised classification problem; multiclass semisupervised learning; optimal similarity function learning; regularization term; unlabeled data; Algorithm design and analysis; Boosting; Optimization; Prediction algorithms; Semisupervised learning; Training; Boosting; Multiclass classification; Semi-Supervised Learning; Similarity learning

140. Mining Dependent Frequent Serial Episodes from Uncertain Sequence Data.

【Paper Link】【Pages】:1211-1216

【Authors】: Li Wan ; Ling Chen ; Chengqi Zhang

【Abstract】: In this paper, we focus on the problem of mining Probabilistic Dependent Frequent Serial Episodes (P-DFSEs) from uncertain sequence data. By observing that the frequentness probability of an episode in an uncertain sequence is a Markov Chain imbeddable variable, we first propose an Embeded Markov Chain-based algorithm that efficiently computes the frequentness probability of an episode by projecting the probability space into a set of limited partitions. To further improve the computation efficiency, we devise an optimized approach that prunes candidate episodes early by estimating the upper bound of their frequentness probabilities.

【Keywords】: Markov processes; data mining; P-DFSE mining; embeded Markov chain-based algorithm; episode frequentness probability; probabilistic dependent frequent serial episodes mining; probability space; uncertain sequence data; Automata; Data mining; Electromagnetic compatibility; Heuristic algorithms; Markov processes; Probabilistic logic

141. Efficient and Scalable Information Geometry Metric Learning.

【Paper Link】【Pages】:1217-1222

【Authors】: Wei Wang ; Bao-Gang Hu ; Zengfu Wang

【Abstract】: Information Geometry Metric Learning (IGML) is shown to be an effective algorithm for distance metric learning. In this paper, we attempt to alleviate two limitations of IGML: (A) the time complexity of IGML increases rapidly for high dimensional data, (B) IGML has to transform the input low rank kernel into a full-rank one since it is undefined for singular matrices. To this end, two novel algorithms, referred to as Efficient Information Geometry Metric Learning (EIGML) and Scalable Information Geometry Metric Learning (SIGML), are proposed. EIGML scales linearly with the dimensionality, resulting in significantly reduced computational complexity. As for SIGML, it is proven to have a range-space preserving property. Following this property, SIGML is found to be capable of handling both full-rank and low-rank kernels. Additionally, the geometric information from data is further exploited in SIGML. In contrast to most existing metric learning methods, both EIGML and SIGML have closed-form solutions and can be efficiently optimized. Experimental results on various data sets demonstrate that the proposed methods outperform the state-of-the-art metric learning algorithms.

【Keywords】: computational complexity; learning (artificial intelligence); matrix algebra; EIGML; SIGML; computational complexity reduction; distance metric learning; efficient information geometry metric learning; full-rank kernel; high dimensional data; low rank kernel; range-space preserving property; scalable information geometry metric learning; singular matrices; time complexity; Computational complexity; Covariance matrices; Eigenvalues and eigenfunctions; Information geometry; Kernel; Learning systems; Measurement; Information Geometry Metric Learning; Mahalanobis distance learning; closed-form solution; range-space preserving

142. Exploring Patient Risk Groups with Incomplete Knowledge.

【Paper Link】【Pages】:1223-1228

【Authors】: Xiang Wang ; Fei Wang ; Jun Wang ; Buyue Qian ; Jianying Hu

【Abstract】: Patient risk stratification, which aims to stratify a patient cohort into a set of homogeneous groups according to some risk evaluation criteria, is an important task in modern medical informatics. Good risk stratification is the key to good personalized care plan design and delivery. The typical procedure for risk stratification is to first identify a set of risk-relevant medical features (also called risk factors), and then construct a predictive model to estimate the risk scores for individual patients. However, due to the heterogeneity of patients' clinical conditions, the risk factors and their importance vary across different patient groups. Therefore a better approach is to first segment the patient cohort into a set of homogeneous groups with consistent clinical conditions, namely risk groups, and then develop group-specific risk prediction models. In this paper, we propose RISGAL (RISk Group Analysis), a novel semi-supervised learning framework for patient risk group exploration. Our method segments a patient similarity graph into a set of risk groups such that some risk groups are in alignment with (incomplete) prior knowledge from the domain experts while the remaining groups reveal new knowledge from the data. Our method is validated on public benchmark datasets as well as a real electronic medical record database to identify risk groups from a set of potential Congestive Heart Failure (CHF) patients.

【Keywords】: electronic health records; graph theory; learning (artificial intelligence); medical computing; patient care; risk analysis; CHF; RISGAL; congestive heart failure patients; domain experts; electronic medical record database; group-specific risk prediction models; homogeneous groups; incomplete knowledge; medical informatics; patient clinical condition heterogeneity; patient cohort; patient risk group exploration; patient risk stratification; patient similarity graph; personalized care plan design; risk evaluation criteria; risk factors; risk group analysis; risk scores; risk-relevant medical features; semisupervised learning framework; Accuracy; Benchmark testing; Clustering algorithms; Diseases; Heart; Medical diagnostic imaging; Semisupervised learning; Electronic Medical Records; Patient Risk Stratification; Risk Group Analysis; Semi-Supervised Learning

143. Classification-Based Clustering Evaluation.

【Paper Link】【Pages】:1229-1234

【Authors】: John S. Whissell ; Charles L. A. Clarke

【Abstract】: The evaluation of clustering quality has proven to be a difficult task. While it is generally agreed that application specific human assessment can provide a reasonable gold standard for clustering evaluation, the use of human assessors is not practical in many real situations. As a result, machine computable internal clustering quality measures (CQMs) are often used in the evaluation process. However, CQMs have their own drawbacks. Despite their extensive use in clustering research and applications, many CQMs have been shown to lack generality. In this paper we present a new CQM with general applicability. The basis of our CQM is a pattern recognition view of clustering's purpose: the unsupervised prediction of behavior from populations. This purpose translates naturally into our new classifier based CQM which we refer to as in formativeness. We show that in formativeness can satisfy core CQM axioms defined in prior research. Additionally, we provide experimental support, showing that in formativeness can outperform many established CQMs by detecting a larger variety of meaningful structures across a range of synthetic datasets, while at the same time exhibiting good performance on each individual dataset. Our results indicate that in formativeness provides a highly general and effective CQM.

【Keywords】: human factors; pattern classification; pattern clustering; CQM axioms; application-specific human assessment; classification-based clustering evaluation; clustering quality evaluation; human assessors; machine computable internal clustering quality measures; pattern recognition; synthetic datasets; unsupervised behavior prediction; Clustering algorithms; Estimation; Euclidean distance; Labeling; Radio frequency; Sociology; Statistics; clustering method

144. Efficient Online Sequence Prediction with Side Information.

【Paper Link】【Pages】:1235-1240

【Authors】: Han Xiao ; Claudia Eckert

【Abstract】: Sequence prediction is a key task in machine learning and data mining. It involves predicting the next symbol in a sequence given its previous symbols. Our motivating application is predicting the execution path of a process on an operating system in real-time. In this case, each symbol in the sequence represents a system call accompanied with arguments and a return value. We propose a novel online algorithm for predicting the next system call by leveraging both context and side information. The online update of our algorithm is efficient in terms of time cost and memory consumption. Experiments on real-world data sets showed that our method outperforms state-of-the-art online sequence prediction methods in both accuracy and efficiency, and incorporation of side information does significantly improve the predictive accuracy.

【Keywords】: data mining; learning (artificial intelligence); operating systems (computers); arguments; context information; data mining; machine learning; memory consumption; online algorithm; online sequence prediction; operating system; process execution path; return value; side information; system call; time cost; Accuracy; Context; Error analysis; Memory management; Prediction algorithms; Predictive models; Vectors; online learning; scalability; sequence predictio; system trace

145. Multilabel Consensus Classification.

【Paper Link】【Pages】:1241-1246

【Authors】: Sihong Xie ; Xiangnan Kong ; Jing Gao ; Wei Fan ; Philip S. Yu

【Abstract】: In the era of big data, a large amount of noisy and incomplete data can be collected from multiple sources for prediction tasks. Combining multiple models or data sources helps to counteract the effects of low data quality and the bias of any single model or data source, and thus can improve the robustness and the performance of predictive models. Out of privacy, storage and bandwidth considerations, in certain circumstances one has to combine the predictions from multiple models or data sources without accessing the raw data. Consensus-based prediction combination algorithms are effective for such situations. However, current research on prediction combination focuses on the single label setting, where an instance can have one and only one label. Nonetheless, data nowadays are usually multilabeled, such that more than one label have to be predicted at the same time. Direct applications of existing prediction combination methods to multilabel settings can lead to degenerated performance. In this paper, we address the challenges of combining predictions from multiple multilabel classifiers and propose two novel algorithms, MLCM-r (MultiLabel Consensus Maximization for ranking) and MLCM-a (MLCM for microAUC). These algorithms can capture label correlations that are common in multilabel classifications, and optimize corresponding performance metrics. Experimental results on popular multilabel classification tasks verify the theoretical analysis and effectiveness of the proposed methods.

【Keywords】: Big Data; learning (artificial intelligence); optimisation; pattern classification; MLCM for microAUC; MLCM-a; MLCM-r; bandwidth considerations; big data; consensus-based prediction combination algorithms; data quality; data sources; incomplete data; label correlations; multilabel classifiers; multilabel consensus classification; multilabel consensus maximization for ranking; noisy data; performance metrics; prediction tasks; predictive model robustness; single label setting; storage considerations; Algorithm design and analysis; Bipartite graph; Correlation; Data models; Measurement; Prediction algorithms; Predictive models; ensemble; multilabel classification

146. Sampling Heterogeneous Networks.

【Paper Link】【Pages】:1247-1252

【Authors】: Cheng-Lun Yang ; Perng-Hwa Kung ; Cheng-Te Li ; Chun-An Chen ; Shou-De Lin

【Abstract】: Online social networks are mainly characterized by large-scale and heterogeneous semantic relationships. Unfortunately, for online social network services such as Facebook or Twitter, it is very difficult to obtain the fully observed network without privilege to access the data internally. To address the above needs, social network sampling is a means that aims at identifying a representative subgraph that preserves certain properties of the network, given the information of any instance in the network is unknown before being sampled. This study tackles heterogeneous network sampling by considering the conditional dependency of node types and link types, where we design a property, Relational Profile, to account such characterization. We further propose a sampling method to preserve this property. Lastly, we propose to evaluate our model from three different angles. First, we show that the proposed sampling method can more faithfully preserve the Relational Profile. Second, we evaluate the usefulness of the Relational Profile showing such information is beneficial for link prediction tasks. Finally, we evaluate whether the networks sampled by our method can be used to train more accurate prediction models comparing to networks produced by other methods.

【Keywords】: graph theory; sampling methods; social networking (online); Facebook; Twitter; heterogeneous network sampling; heterogeneous semantic relationships; large-scale semantic relationships; link prediction tasks; link type conditional dependency; node type conditional dependency; online social network services; relational profile; representative subgraph identification; social network sampling; Equations; Mathematical model; Patents; Predictive models; Sampling methods; Semantics; Social network services; heterogeneous networks; network prediction; network sampling

147. A Model for Discovering Correlations of Ubiquitous Things.

【Paper Link】【Pages】:1253-1258

【Authors】: Lina Yao ; Quan Z. Sheng ; Byron J. Gao ; Anne H. H. Ngu ; Xue Li

【Abstract】: With recent advances in radio-frequency identification (RFID), wireless sensor networks, and Web services, physical things are becoming an integral part of the emerging ubiquitous Web. Correlation discovery for ubiquitous things is critical for many important applications such as things search, recommendation, annotation, classification, clustering, composition, and management. In this paper, we propose a novel approach for discovering things correlation based on user, temporal, and spatial information captured from usage events of things. In particular, we use a spatio-temporal graph and a social graph to model things usage contextual information and user-thing relationships respectively. Then, we apply random walks with restart on these graphs to compute correlations among things. This correlation analysis lays a solid foundation and contributes to improved effectiveness in things management. To demonstrate the utility of our approach, we perform a systematic case study and comprehensive experiments on things annotation.

【Keywords】: graph theory; radiofrequency identification; ubiquitous computing; RFID; Web services; correlation analysis; correlation discovery; radiofrequency identification; social graph; spatio-temporal graph; ubiquitous Web; ubiquitous things; wireless sensor networks; Correlation; Educational institutions; Equations; Feature extraction; Radiofrequency identification; Testing; Ubiquitous things; correlation discovery; random walk with restart

148. Efficiently Mining Top-K High Utility Sequential Patterns.

【Paper Link】【Pages】:1259-1264

【Authors】: Junfu Yin ; Zhigang Zheng ; Longbing Cao ; Yin Song ; Wei Wei

【Abstract】: High utility sequential pattern mining is an emerging topic in the data mining community. Compared to the classic frequent sequence mining, the utility framework provides more informative and actionable knowledge since the utility of a sequence indicates business value and impact. However, the introduction of "utility" makes the problem fundamentally different from the frequency-based pattern mining framework and brings about dramatic challenges. Although the existing high utility sequential pattern mining algorithms can discover all the patterns satisfying a given minimum utility, it is often difficult for users to set a proper minimum utility. A too small value may produce thousands of patterns, whereas a too big one may lead to no findings. In this paper, we propose a novel framework called top-k high utility sequential pattern mining to tackle this critical problem. Accordingly, an efficient algorithm, Top-k high Utility Sequence (TUS for short) mining, is designed to identify top-k high utility sequential patterns without minimum utility. In addition, three effective features are introduced to handle the efficiency problem, including two strategies for raising the threshold and one pruning for filtering unpromising items. Our experiments are conducted on both synthetic and real datasets. The results show that TUS incorporating the efficiency-enhanced strategies demonstrates impressive performance without missing any high utility sequential patterns.

【Keywords】: business data processing; data mining; information filtering; utility theory; TUS mining; business value; data mining community; real datasets; synthetic datasets; top-k high utility sequential pattern mining; unpromising item filtering; Algorithm design and analysis; Business; Data mining; Itemsets; Sequences; Sorting; High utility sequential pattern mining; Top-K sequential pattern mining

149. Efficient Proper Length Time Series Motif Discovery.

【Paper Link】【Pages】:1265-1270

【Authors】: Sorrachai Yingchareonthawornchai ; Haemwaan Sivaraks ; Thanawin Rakthanmanon ; Chotirat Ann Ratanamahatana

【Abstract】: As one of the most essential data mining tasks, finding frequently occurring patterns, i.e., motif discovery, has drawn a lot of attention in the past decade. Despite successes in speedup of motif discovery algorithms, most of the existing algorithms still require predefined parameters. The critical and most cumbersome one is time series motif length since it is difficult to manually determine the proper length of the motifs-even for the domain experts. In addition, with variability in the motif lengths, ranking among these motifs becomes another major problem. In this work, we propose a novel algorithm using compression ratio as a heuristic to discover meaningful motifs in proper lengths. The ranking of these various length motifs relies on an ability to compress time series by its own motif as a hypothesis. Furthermore, other than being an anytime algorithm, our experimental evaluation also demonstrates that our proposed method outperforms existing works in various domains both in terms of speed and accuracy.

【Keywords】: data mining; time series; anytime algorithm; compression ratio; data mining task; motif ranking; proper length time series motif discovery; Accuracy; Algorithm design and analysis; Clustering algorithms; Data mining; Data models; Time complexity; Time series analysis; motif discovery; proper length motif; time series mining

150. On Anomalous Hotspot Discovery in Graph Streams.

【Paper Link】【Pages】:1271-1276

【Authors】: Weiren Yu ; Charu C. Aggarwal ; Shuai Ma ; Haixun Wang

【Abstract】: Network streams have become ubiquitous in recent years because of many dynamic applications. Such streams may show localized regions of activity and evolution because of anomalous events. This paper will present methods for dynamically determining anomalous hot spots from network streams. These are localized regions of sudden activity or change in the underlying network. We will design a localized principal component analysis algorithm, which can continuously maintain the information about the changes in the different neighborhoods of the network. We will use a fast incremental eigenvector update algorithm based on von Mises iterations in a lazy way in order to efficiently maintain local correlation information. This is used to discover local change hotspots in dynamic streams. We will finally present an experimental study to demonstrate the effectiveness and efficiency of our approach.

【Keywords】: eigenvalues and eigenfunctions; graph theory; network theory (graphs); principal component analysis; anomalous events; anomalous hotspot discovery; dynamic applications; dynamic streams; fast incremental eigenvector update algorithm; graph streams; local change hotspot discovery; local correlation information; localized principal component analysis algorithm; network streams; von Mises iterations; Algorithm design and analysis; Correlation; Eigenvalues and eigenfunctions; Image edge detection; Motion pictures; Time-frequency analysis; Vectors; anomaly detection; graph streams

【Paper Link】【Pages】:1277-1282

【Authors】: Guangxiang Zeng ; Ping Luo ; Enhong Chen ; Min Wang

【Abstract】: This study addresses the problem of inferring users' employment affiliation information from social activities. It is motivated by the applications which need to monitoring and analyzing the social activities of the employees from a given company, especially their social tracks related to the work and business. It definitely helps to better understand their needs and opinions towards certain business area, so that the account sales targeting these customers in the given company can adjust the sales strategies accordingly. Specifically, in this task we are given a snapshot of a social network and some labeled social users who are the employees of a given company. Our goal is to identify more users from the same company. We formulate this problem as a task of classifying nodes over a graph, and develop a Supervised Label Propagation model. It naturally incorporates the rich set of features for social activities, models the networking effect by label propagation, and learns the feature weights so that the labels are propagated to the right users. To validate its effectiveness, we show our case studies on identifying the employees of "China Telecom" and "China Unicom" from Sina Weibo. The experimental results show that our method significantly outperforms the compared baseline ones.

【Keywords】: learning (artificial intelligence); social networking (online); China Telecom; China Unicom; social network; social user activities; supervised label propagation model; Companies; Equations; Media; Social network services; Telecommunications; Vectors

152. Transfer Learning across Cancers on DNA Copy Number Variation Analysis.

【Paper Link】【Pages】:1283-1288

【Authors】: Huanan Zhang ; Ze Tian ; Rui Kuang

【Abstract】: DNA copy number variations (CNVs) are prevalent in all types of tumors. It is still a challenge to study how CNVs play a role in driving tumorgenic mechanisms that are either universal or specific in different cancer types. To address the problem, we introduce a transfer learning framework to discover common CNVs shared across different tumor types as well as CNVs specific to each tumor type from genome-wide CNV data measured by array CGH and SNP genotyping array. The proposed model, namely Transfer Learning with Fused LASSO (TLFL), detects latent CNV components from multiple CNV datasets of different tumor types to distinguish the CNVs that are common across the datasets and those that are specific in each dataset. Both the common and type-specific CNVs are detected as latent components in matrix factorization coupled with fused LASSO on adjacent CNV probe features. TLFL considers the common latent components underlying the multiple datasets to transfer knowledge across different tumor types. In simulations and experiments on real cancer CNV datasets, TLFL detected better latent components that can be used as features to improve classification of patient samples in each individual dataset compared with the model without the knowledge transfer. In cross-dataset analysis on bladder cancer and cross-domain analysis on breast cancer and ovarian cancer, TLFL also learned latent CNV components that are both predictive of tumor stages and correlate with known cancer genes.

【Keywords】: DNA; cancer; learning (artificial intelligence); matrix decomposition; medical computing; pattern classification; tumours; CNVs; DNA copy number variation analysis; SNP genotyping array; TLFL; adjacent CNV probe features; array CGH; bladder cancer; breast cancer; cancer genes; common latent components; cross-dataset analysis; cross-domain analysis; genome-wide CNV data; latent components; matrix factorization; ovarian cancer; patient sample classification; transfer learning with fused LASSO; tumorgenic mechanisms; Arrays; Biological cells; Cancer; Genomics; Hidden Markov models; Probes; Tumors; Cancer Genomics; DNA Copy Number; Fused LASSO Components; Transfer Learning

【Paper Link】【Pages】:1289-1294

【Authors】: Jiawei Zhang ; Xiangnan Kong ; Philip S. Yu

【Abstract】: Nowadsys, many new users are keeping joining in the online social networks every day and these new users usually have very few social connections and very sparse auxiliary information in the network. Prediction social links for new users is very important. Different from conventional link prediction problems, link prediction for new users is more challenging due to the lack of information from the new users in the network. Meanwhile, in recent years, users are usually involved in multiple social networks simultaneously to enjoy the specific services offered by different social networks. The shared users of multiple networks can act as the "anchors" aligned the networks they participate in. In this paper, we propose a link prediction method called SCAN-PS (Supervised Cross Aligned Networks link prediction with Personalized Sampling), to solve the social link prediction problem for new users. SCAN-PS can use information transferred from both the existing active users in the target network and other source networks through aligned accounts. In addition, SCAN-PS could solve the cold start problem when information of these new users is total absent in the target network. Extensive experiments conducted on two real-world aligned heterogeneous social networks demonstrate that SCAN-PS can perform well in predicting social links for new users.

【Keywords】: data mining; social networking (online); SCAN-PS; aligned heterogeneous social networks; data mining; online social networks; social links; supervised cross aligned networks link prediction with personalized sampling; Cultural differences; Diversity reception; Feature extraction; Prediction methods; Twitter; Vectors; Data Mining; Link Prediction

154. Combating Sub-Clusters Effect in Imbalanced Classification.

【Paper Link】【Pages】:1295-1300

【Authors】: Yang Zhao ; Abhishek K. Shrivastava

【Abstract】: Approaches to imbalanced classification problem usually focus on rebalancing the class sizes, neglecting the effect of hidden structure within the majority class. The aim of this paper is to highlight the effect of sub-clusters within the majority class on detecting minority class instances, and handle imbalanced classification by learning the structure in the data. We propose a decomposition based approach to two-class imbalanced classification problem. This approach works by first learning the hidden structure of the majority class using an unsupervised learning algorithm. Thus, transforming the classification problem into several classification sub-problems. The base classifier is constructed on each sub-problem. The ensemble is tuned to increase its sensitivity towards minority class. The proposed approach overcomes the limitations of conventional classifiers on imbalanced problem, and combats imbalance by learning hidden structure in the majority class, which is neglected in most existing works. We demonstrate the performance of the proposed approach on several datasets.

【Keywords】: pattern classification; pattern clustering; unsupervised learning; base classifier; class size rebalancing; decomposition-based approach; ensemble learning; hidden structure learning; majority class; minority class instance detection; subcluster effect; two-class imbalanced classification problem; unsupervised learning algorithm; Accuracy; Classification algorithms; Clustering algorithms; Correlation; Sensitivity; Support vector machines; Training; clustering; latent structure; supervised learning; within-class imbalance

155. Influence and Profit: Two Sides of the Coin.

【Paper Link】【Pages】:1301-1306

【Authors】: Yuqing Zhu ; Zaixin Lu ; Yuanjun Bi ; Weili Wu ; Yiwei Jiang ; Deying Li

【Abstract】: Influence maximization problem is to find a set of seeds in social networks such that the cascade influence is maximized. Traditional models assume all nodes are willing to spread the influence once they are influenced, and they ignore the disparity between influence and profit of a product. In this paper by considering the role that price plays in viral marketing, we propose price related (PR) frame that contains PR-I and PR-L models for classic IC and LT models respectively, which is a pioneer work. We find that influence and profit are like two sides of the coin, high price hinders the influence propagation and to enlarge the influence some sacrifice on profit is inevitable. We propose Balanced Influence and Profit (BIP) maximization problem. We prove the NP-hardness of BIP maximization under PR-I and PR-L model. Unlike influence maximization, the BIP objective function is not monotone. Despite the non-monotony, we show BIP objective function is sub modular under certain conditions. Two unbudgeted greedy algorithms separately are devised. We conduct simulations on real-world datasets and evaluate the superiority of our algorithms over existing ones.

【Keywords】: computational complexity; greedy algorithms; marketing data processing; optimisation; pricing; profitability; social networking (online); BIP objective function; IC models; LT models; NP-hardness problem; PR-I model; PR-L models; balanced influence and profit maximization problem; cascade influence maximization problem; coin; independent cascade model; influence propagation; linear threshold model; price related frame; social networks; unbudgeted greedy algorithms; viral marketing; Computational modeling; Educational institutions; Integrated circuit modeling; Linear programming; Manufacturing; Social network services; IC model; Influence maximization; LT model; profit maximization; social networks

156. Constrained Clustering: Effective Constraint Propagation with Imperfect Oracles.

【Paper Link】【Pages】:1307-1312

【Authors】: Xiatian Zhu ; Chen Change Loy ; Shaogang Gong

【Abstract】: While spectral clustering is usually an unsupervised operation, there are circumstances in which we have prior belief that pairs of samples should (or should not) be assigned with the same cluster. Constrained spectral clustering aims to exploit this prior belief as constraint (or weak supervision) to influence the cluster formation so as to obtain a structure more closely resembling human perception. Two important issues remain open: (1) how to propagate sparse constraints effectively, (2) how to handle ill-conditioned/noisy constraints generated by imperfect oracles. In this paper we present a unified framework to address the above issues. Specifically, in contrast to existing constrained spectral clustering approaches that blindly rely on all features for constructing the spectral, our approach searches for neighbours driven by discriminative feature selection for more effective constraint diffusion. Crucially, we formulate a novel data-driven filtering approach to handle the noisy constraint problem, which has been unrealistically ignored in constrained spectral clustering literature.

【Keywords】: constraint handling; feature selection; information filtering; pattern clustering; cluster formation; constrained spectral clustering; constraint diffusion; data-driven filtering approach; discriminative feature selection; ill-conditioned-noisy constraint handling; imperfect oracles; sparse constraint propagation; weak supervision; Clustering methods; Equations; Mathematical model; Noise measurement; Optimization; Training; Vegetation; Constrained clustering; constraint propagation; feature selection; imperfect oracles; spectral clustering

【Paper Link】【Pages】:1313-1318

【Authors】: Honglei Zhuang ; Yihan Sun ; Jie Tang ; Jialin Zhang ; Xiaoming Sun

【Abstract】: Social influence and influence diffusion has been widely studied in online social networks. However, most existing works on influence diffusion focus on static networks. In this paper, we study the problem of maximizing influence diffusion in a dynamic social network. Specifically, the network changes over time and the changes can be only observed by periodically probing some nodes for the update of their connections. Our goal then is to probe a subset of nodes in a social network so that the actual influence diffusion process in the network can be best uncovered with the probing nodes. We propose a novel algorithm to approximate the optimal solution. The algorithm, through probing a small portion of the network, minimizes the possible error between the observed network and the real network. We evaluate the proposed algorithm on both synthetic and real large networks. Experimental results show that our proposed algorithm achieves a better performance than several alternative algorithms.

【Keywords】: directed graphs; human factors; social networking (online); dynamic social networks; influence diffusion maximization; influence maximization; online social networks; probing nodes; social influence; static networks; Algorithm design and analysis; Approximation algorithms; Estimation; Heuristic algorithms; Probes; Twitter; dynamic social networks; influence maximization

158. Adaptive Model Tree for Streaming Data.

【Paper Link】【Pages】:1319-1324

【Authors】: Anca Maria Zimmer ; Michael Kurze ; Thomas Seidl

【Abstract】: With an ever-growing availability of data streams the interest in and need for efficient techniques dealing with such data increases. A major challenge in this context is the accurate online prediction of continuous values in the presence of concept drift. In this paper, we introduce a new adaptive model tree (AMT), designed to incrementally learn from the data stream, adapt to the changes, and to perform real time accurate predictions at anytime. To deal with sub models lying in different subspaces, we propose a new model clustering algorithm able to identify subspace models, and use it for computing splits in the input space. Compared to state of the art, our AMT allows for oblique splits, delivering more compact and accurate models.

【Keywords】: learning (artificial intelligence); pattern clustering; trees (mathematics); AMT; adaptive model tree; concept drift; data streaming; model clustering algorithm; oblique splits; online prediction; subspace model identification; Adaptation models; Clustering algorithms; Computational modeling; Data models; Impurities; Predictive models; Support vector machines; prediction; regression tree; streaming data

159. Structural-Context Similarities for Uncertain Graphs.

【Paper Link】【Pages】:1325-1330

【Authors】: Zhaonian Zou ; Jianzhong Li

【Abstract】: Structural-context similarities between vertices in graphs, such as the Jaccard similarity, the Dice similarity, and the cosine similarity, play important roles in a number of graph data analysis techniques. However, uncertainty is inherent in massive graph data, and therefore the classical definitions of structural-context similarities on exact graphs don't make sense on uncertain graphs. In this paper, we propose a generic definition of structural-context similarity for uncertain graphs. Since it is computationally prohibitive to compute the similarity between two vertices of an uncertain graph directly by its definition, we investigate two efficient approaches to computing similarities, namely the polynomial-time exact algorithms and the linear-time approximation algorithms. The experimental results on real uncertain graphs verify the effectiveness of the proposed structural-context similarities as well as the accuracy and efficiency of the proposed evaluation algorithms.

【Keywords】: approximation theory; computational complexity; data analysis; graph theory; Dice similarity; Jaccard similarity; cosine similarity; graph data analysis techniques; linear-time approximation algorithm; polynomial-time exact algorithm; structural-context similarities; uncertain graph; Accuracy; Approximation algorithms; Approximation methods; Data analysis; Databases; Joints; Uncertainty; Dice similarity; Jaccard similarity; cosine similarity; structural-context similarity; uncertain graph