12. SDM 2012:Anaheim, California, USA

Proceedings of the Twelfth SIAM International Conference on Data Mining, Anaheim, California, USA, April 26-28, 2012. SIAM / Omnipress 【DBLP Link

Paper Num: 98 || Session Num: 12

Session CP1 - Applications - Climate and Geography 5

1. Detecting and Tracking Coordinated Groups in Dense, Systematically Moving, Crowds.

Paper Link】 【Pages】:1-11

【Authors】: James Rosswog ; Kanad Ghose

【Abstract】: We address the problem of detecting and tracking clusters of moving objects in very noisy environments. Monitoring a crowded football stadium for small groups of individuals acting suspiciously is an example instance of this problem. In this example the vast majority of individuals are not part of a suspicious group and are considered as noise. Existing spatio-temporal cluster algorithms are only capable of detecting small clusters in extreme noise when the noise objects are moving randomly. In reality, including the example cited, the noise objects move more systematically instead of moving randomly. The members of the suspicious groups attempt to mimic the behaviors of the crowd in order to blend in and avoid detection. This significantly exacerbates the problem of detecting the true clusters. We propose the use of Support Vector Machines (SVMs) to differentiate the true clusters and their members from the systematically moving noise objects. Our technique utilizes the relational history of the moving objects, implicitly tracked in a relationship graph, and a SVM to increase the accuracy of the clustering algorithm. A modified DBSCAN algorithm is then used to discover clusters of highly related objects from the relationship graph. We evaluate our technique experimentally on several data sets of mobile objects. The experiments show that our technique is able to accurately and efficiently identify groups of suspicious individuals in dense crowds.

【Keywords】:

2. Large-Scale Nonparametric Estimation of Vehicle Travel Time Distributions.

Paper Link】 【Pages】:12-23

【Authors】: Rikiya Takahashi ; Takayuki Osogami ; Tetsuro Morimura

【Abstract】: Fitting distributions of travel-time in vehicle traffic is an important application of spatio-temporal data mining. While regression methods to forecast the expected travel-time are standard approaches of travel-time prediction, we need to estimate distributions of the travel-time when using state-of-the-art risk-sensitive route recommendation systems. The authors introduce a novel nonparametric density estimator of travel-time for each road or link. The new estimator consists of basis functions modeled as mixtures of gamma or log-normal density functions, a sparse link similarity matrix given as an approximate diffusion kernel on a link connectivity graph, and importance weights for each link. Unlike the existing nonparametric methods that are computationally intensive, the new estimator is stably applicable to large datasets, because the basis functions and the importance weights are globally optimized with a fast convex clustering algorithm. Experimental results using real probe-car datasets show advantages of the new nonparametric estimator over parametric regression methods.

【Keywords】:

3. Drought Detection of the Last Century: An MRF-based Approach.

Paper Link】 【Pages】:24-34

【Authors】: Qiang Fu ; Arindam Banerjee ; Stefan Liess ; Peter K. Snyder

【Abstract】: Droughts are one of the most damaging climate-related hazards. The late 1960s Sahel drought in Africa and the North American Dust Bowl of the 1930s are two examples of severe droughts that have an impact on society and the environment. Due to the importance of understanding droughts, we consider the problem of their detection based on gridded datasets of precipitation. We formulate the problem as the one of finding the most likely configuration of a Markov Random Field and propose an efficient inference algorithm. We apply this algorithm to the Climate Research Unit precipitation dataset spanning 106 years. The empirical results show that the algorithm successfully identifies the major droughts of the twentieth century in different regions of the world.

【Keywords】:

4. Toward Data-driven, Semi-automatic Inference of Phenomenological Physical Models: Application to Eastern Sahel Rainfall.

Paper Link】 【Pages】:35-46

【Authors】: Saurabh V. Pendse ; Isaac K. Tetteh ; Fredrick H. M. Semazzi ; Vipin Kumar ; Nagiza F. Samatova

【Abstract】: First-principles based predictive understanding of complex, dynamic physical phenomena, such as regional precipitation or hurricane intensity and frequency, is quite limited due to the lack of complete phenomenological models underlying their physics. To address this gap, hypothesis-driven, manually-constructed, conceptual hurricane models and models for regional-scale precipitation extremes have been emerging. To complement both approaches, we propose a methodology for data-driven, semi-automatic inference of plausible phenomenological models and apply it to derive the model for eastern Sahel rainfall, an important factor for socioeconomic growth and development of this region. At its core, our methodology derives cause-effect relationships using the Lasso multivariate regression model and quantifies compound affect that the complex interplay among the key predictors at their prominent temporal phases plays on the response (rainfall). Specifically, we propose methods for (a) detecting and ranking predictors' prominent temporal phases, (b) optimizing the regularization penalty, (c) assessing predictor statistical significance, (d) performing impact analysis of data normalization on model inference, and (e) calculating the Expected Causality Impact (ECI) score to quantify impact analysis. The culmination of this study is the plausible phenomenological model of the eastern Sahel seasonal rainfall and quantified key climate drivers involved in the rainfall variability at different time lags. To the best of our knowledge, this is the first phenomenological model of this phenomenon; several of its components are consistent with the known evidence from literature.

【Keywords】:

5. Sparse Group Lasso: Consistency and Climate Applications.

Paper Link】 【Pages】:47-58

【Authors】: Soumyadeep Chatterjee ; Karsten Steinhaeuser ; Arindam Banerjee ; Snigdhansu Chatterjee ; Auroop R. Ganguly

【Abstract】: The design of statistical predictive models for climate data gives rise to some unique challenges due to the high dimensionality and spatio-temporal nature of the datasets, which dictate that models should exhibit parsimony in variable selection. Recently, a class of methods which promote structured sparsity in the model have been developed, which is suitable for this task. In this paper, we prove theoretical statistical consistency of estimators with tree-structured norm regularizers. We consider one particular model, the Sparse Group Lasso (SGL), to construct predictors of land climate using ocean climate variables. Our experimental results demonstrate that the SGL model provides better predictive performance than the current state-of-the-art, remains climatologically interpretable, and is robust in its variable selection.

【Keywords】:

Session CP2 - Clustering 5

6. The Multi-Set Stream Clustering Problem.

Paper Link】 【Pages】:59-69

【Authors】: Charu C. Aggarwal

【Abstract】: The problem of clustering has been widely studied by the data mining community because of its applications to a wide variety of problems in the context of customer segmentation, electronic commerce and learning. In general, the problem of clustering is generally presented as one of clustering individual instances of data records. In many applications, we have a collection of multiple sets of records. Each such set is essentially a database of records, and each database may possibly contain a different number of records. It is desirable to cluster these sets on the basis of the similarity of underlying data distribution. Thus, this problem may also be understood as that of clustering sets of data sets, as opposed to clustering sets of instances. The problem is especially challenging when the data sets are not available at one time, but are presented in the form of out-of-order and mixed streams, in which the records from different data sets do not arrive in any particular order, but are mixed with one another. In this paper, we present a first approach to the problem with the use of anchor-based summarization. We present experimental results for the effectiveness and efficiency of the approach on a number of real data sets.

【Keywords】:

7. Stratification Based Hierarchical Clustering Over a Deep Web Data Source.

Paper Link】 【Pages】:70-81

【Authors】: Tantan Liu ; Gagan Agrawal

【Abstract】: This paper focuses on the problem of clustering data from a hidden or a deep web data source. A key characteristics of deep web data sources is that data can only be accessed through the limited query interface they support. Because the underlying data set cannot be accessed directly, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. Unlike existing sampling based methods, sampling costs, and not the computation or memory costs, are the dominant consideration in designing the technique for sampling. We have developed a new methodology for addressing the clustering problem on the deep web. Our work includes three new ideas, which are a method for stratifying a deep web data source, an algorithm for hierarchical clustering based on stratified sampling, and a two phase technique for sampling, which includes a representative sampling in the first phase, and sampling focusing on the boundary points between the clusters in the second phase. We have evaluated our approach using two synthetic and one real data set. Our experiments show that each of the three ideas we have introduced leads to significant improvements in accuracy and efficiency of clustering a hidden data source. Specifically, we improve the accuracy of the clusters obtained (measured by average distance to centers) by up to 20% over the existing approach. Compared in another way, our method can achieve the same accuracy with up to 25% fewer samples, thus reducing the sampling cost.

【Keywords】:

8. Cluster-Aware Compression with Provable K-means Preservation.

Paper Link】 【Pages】:82-93

【Authors】: Nikolaos M. Freris ; Michail Vlachos ; Deepak S. Turaga

【Abstract】: This work rigorously explores the design of cluster-preserving compression schemes for high-dimensional data. We focus on the K-means algorithm and identify conditions under which running the algorithm on the compressed data yields the same clustering outcome as on the original. The compression is performed using single and multi-bit minimum mean square error quantization schemes as well as a given clustering assignment of the original data. We provide theoretical guarantees on post-quantization cluster preservation under certain conditions on the cluster structure, and propose an additional data transformation that can ensure cluster preservation unconditionally; this transformation is invertible and thus induces virtually no distortion on the compressed data. In addition, we provide an efficient scheme for multi-bit allocation, per cluster and data dimension, which enables a trade-off between high compression efficiency and low data distortion. Our experimental studies highlight that the suggested scheme accurately preserved the clusters formed in all cases, while incurring minimal distortion on the data shapes. Our results can find many applications, e.g., in a) clustering, analysis and distribution of massive datasets, where the proposed data compression can boost performance while providing provable guarantees on the clustering result, as well as, in b) cloud computing services, as the optional transformation provides a data-hiding functionality in addition to preserving the K-means clustering outcome.

【Keywords】:

9. Supervised Clustering of Label Ranking Data.

Paper Link】 【Pages】:94-105

【Authors】: Mihajlo Grbovic ; Nemanja Djuric ; Slobodan Vucetic

【Abstract】: In this paper we study supervised clustering in the context of label ranking data. Segmentation of such complex data has many potential real-world applications. For example, in target marketing, the goal is to cluster customers in the feature space by taking into consideration the assigned, potentially incomplete product preferences, such that the preferences of instances within a cluster are more similar than the preferences of customers in the other clusters. We establish several heuristic baselines for this application that make use of well-known algorithms such as K-means, and propose a principled algorithm specifically tailored for this type of clustering. It is based on the Plackett-Luce (PL) probabilistic ranking model. Each cluster is represented as a union of Voronoi cells defined by a set of prototypes and is assigned a set of PL label scores that determine the cluster-specific label ranking. The unknown cluster PL parameters and prototype positions are determined using a supervised learning technique. Cluster membership and ranking for a new instance is determined by membership of its nearest prototype. The proposed algorithms were empirically evaluated on synthetic and reallife label ranking data. The PL-based method was superior to the heuristically-based supervised clustering approaches. The proposed PL-based algorithm was also evaluated on the task of label ranking prediction. The results showed that it is highly competitive to the state of the art label ranking algorithms, and that it is particularly accurate on data with partial rankings.

【Keywords】:

10. Symmetric Nonnegative Matrix Factorization for Graph Clustering.

Paper Link】 【Pages】:106-117

【Authors】: Da Kuang ; Haesun Park ; Chris H. Q. Ding

【Abstract】: Nonnegative matrix factorization (NMF) provides a lower rank approximation of a nonnegative matrix, and has been successfully used as a clustering method. In this paper, we offer some conceptual understanding for the capabilities and shortcomings of NMF as a clustering method. Then, we propose Symmetric NMF (SymNMF) as a general framework for graph clustering, which inherits the advantages of NMF by enforcing nonnegativity on the clustering assignment matrix. Unlike NMF, however, SymNMF is based on a similarity measure between data points, and factorizes a symmetric matrix containing pairwise similarity values (not necessarily nonnegative). We compare SymNMF with the widely-used spectral clustering methods, and give an intuitive explanation of why SymNMF captures the cluster structure embedded in the graph representation more naturally. In addition, we develop a Newton-like algorithm that exploits second-order information efficiently, so as to show the feasibility of SymNMF as a practical framework for graph clustering. Our experiments on artificial graph data, text data, and image data demonstrate the substantially enhanced clustering quality of SymNMF over spectral clustering and NMF. Therefore, SymNMF is able to achieve better clustering results on both linear and nonlinear manifolds, and serves as a potential basis for many extensions and applications.

【Keywords】:

Session - CP3 - Social Media 5

11. Feature Selection with Linked Data in Social Media.

Paper Link】 【Pages】:118-128

【Authors】: Jiliang Tang ; Huan Liu

【Abstract】: Feature selection is widely used in preparing high-dimensional data for effective data mining. Increasingly popular social media data presents new challenges to feature selection. Social media data consists of (1) traditional high-dimensional, attribute-value data such as posts, tweets, comments, and images, and (2) linked data that describes the relationships between social media users as well as who post the posts, etc. The nature of social media also determines that its data is massive, noisy, and incomplete, which exacerbates the already challenging problem of feature selection. In this paper, we illustrate the differences between attribute-value data and social media data, investigate if linked data can be exploited in a new feature selection framework by taking advantage of social science theories, extensively evaluate the effects of user-user and user-post relationships manifested in linked data on feature selection, and discuss some research issues for future work.

【Keywords】:

12. Microscopic Social Influence.

Paper Link】 【Pages】:129-140

【Authors】: Ting Wang ; Mudhakar Srivatsa ; Dakshi Agrawal ; Ling Liu

【Abstract】: Social influences, the phenomena that one individual's actions can induce similar behaviors among his/her friends via their social ties, have been observed prevailingly in socially networked systems. While most existing work focuses on studying general, macro-level influence (e.g., diffusion); equally important is to understand social influence at microscopic scales (i.e., at the granularity of single individuals, actions, and time-stamps), which may benefit a range of applications. We propose μSI, a microscopic social-influence model wherein: individuals' actions are modeled as temporary interactions between social network (formed by individuals) and object network (formed by targets of actions); one individual's actions influence his/her friends in a dynamic, network-wise manner (i.e., dependent on both social and object networks). We develop for μSI a suite of novel inference tools that enable to answer questions of the form: How may an occurred interaction trigger another? More importantly, when and where may a new interaction be observed? We carefully address the computational challenges for inferencing over such semantically rich models by dynamically identifying sub-domains of interest and varying the precision of solutions over different sub-domains. We demonstrate the breadth and generality of μSI using two seemingly disparate applications. In the context of social tagging service, we show how it can help improve the accuracy and freshness of resource recommendation; in the context of mobile phone call service, we show how it can help improve the efficiency of paging operation.

【Keywords】:

13. HAR: Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search.

Paper Link】 【Pages】:141-152

【Authors】: Xutao Li ; Michael K. Ng ; Yunming Ye

【Abstract】: In this paper, we propose a framework HAR to study the hub and authority scores of objects, and the relevance scores of relations in multi-relational data for query search. The basic idea of our framework is to consider a random walk in multi-relational data, and study in such random walk, limiting probabilities of relations for relevance scores, and of objects for hub scores and authority scores. The main contribution of this paper is to (i) propose a framework (HAR) that can compute the hub, authority and relevance scores by solving limiting probabilities arising from multi-relational data, and can incorporate input query vectors to handle query-specific search; (ii) show existence and uniqueness of such limiting probabilities so that they can be used for query search effectively; and (iii) develop an iterative algorithm to solve a set of tensor (multivariate polynomial) equations to obtain such probabilities. Extensive experimental results on TREC and DBLP data sets suggest that the proposed method is very effective in obtaining relevant results to the querying inputs. In the comparison, we find that the performance of HAR is better than those of HITS, SALSA and TOPHITS.

【Keywords】:

14. Evaluating Event Credibility on Twitter.

Paper Link】 【Pages】:153-164

【Authors】: Manish Gupta ; Peixiang Zhao ; Jiawei Han

【Abstract】: Though Twitter acts realtime news source with people acting as sensors and sending event updates from all over the world, rumors spread via Twitter have been noted to cause considerable damage. Given a set of popular Twitter events along with related users and tweets, we study the problem of automatically assessing the credibility of such events. We propose a credibility analysis approach enhanced with event graph-based optimization to solve the problem. First we experiment by performing PageRank-like credibility propagation on a multi-typed network consisting of events, tweets, and users. Further, within each iteration, we enhance the basic trust analysis by updating event credibility scores using regularization on a new graph of events. Our experiments using events extracted from two tweet feed datasets, each with millions of tweets show that our event graph optimization approach outperforms the basic credibility analysis approach. Also, our methods are significantly more accurate (∼86%) than the decision tree classifier approach (∼72%).

【Keywords】:

15. Multi-skill Collaborative Teams based on Densest Subgraphs.

Paper Link】 【Pages】:165-176

【Authors】: Amita Gajewar ; Atish Das Sarma

【Abstract】: We consider the problem of identifying a team of skilled individuals for collaboration, in the presence of a social network, with the goal to maximize the collaborative compatibility of the team. Each node in the social network is associated with skills, and edge-weights specify affinity between respective nodes. We measure collaborative compatibility objective as the density of the induced subgraph on selected nodes. This problem is NP-hard even when the team requires individuals of only one skill. We present a 3-approximation algorithm for the single-skill team formulation problem. We show the same approximation can be extended to a special case of multiple skills. Our problem generalizes the formulation studied by Lappas et al. [KDD ′09] who measure team compatibility in terms of diameter or spanning tree. The experimental results show that the density-based algorithms outperform the diameter-based objective on several metrics.

【Keywords】:

Session CP4 - Multi-Source and Multi-Task 5

16. Heterogeneous Data Fusion via Space Alignment Using Nonmetric Multidimensional Scaling.

Paper Link】 【Pages】:177-188

【Authors】: Jaegul Choo ; Shawn Bohn ; Grant Nakamura ; Amanda M. White ; Haesun Park

【Abstract】: Heterogeneous data sets are typically represented in different feature spaces, making it difficult to analyze relationships spanning different data sets even when they are semantically related. Data fusion via space alignment can remedy this task by integrating multiple data sets lying in different spaces into one common space. Given a set of reference correspondence data that share the same semantic meaning across different spaces, space alignment attempts to place the corresponding reference data as close together as possible, and accordingly, the entire data are aligned in a common space. Space alignment involves optimizing two potentially conflicting criteria: minimum deformation of the original relationships and maximum alignment between the different spaces. To solve this problem, we provide a novel graph embedding framework for space alignment, which converts each data set into a graph and assigns zero distance between reference correspondence pairs resulting in a single graph. We propose a graph embedding method for fusion based on nonmetric multidimensional scaling (MDS). Its criteria using the rank order rather than the distance allows nonmetric MDS to effectively handle both deformation and alignment. Experiments using parallel data sets demonstrate that our approach works well in comparison to existing methods such as constrained Laplacian eigenmaps, Procrustes analysis, and tensor decomposition. We also present standard cross-domain information retrieval tests as well as interesting visualization examples using space alignment.

【Keywords】:

17. Heterogeneous Datasets Representation and Learning using Diffusion Maps and Laplacian Pyramids.

Paper Link】 【Pages】:189-199

【Authors】: Neta Rabin ; Ronald R. Coifman

【Abstract】: The diffusion maps together with the geometric harmonics provide a method for describing the geometry of high dimensional data and for extending these descriptions to new data points and to functions, which are defined on the data. This method suffers from two limitations. First, even though real-life data is often heterogeneous, the assumption in diffusion maps is that the attributes of the processed dataset are comparable. Second, application of the geometric harmonics requires careful setting for the correct extension scale and condition number. In this paper, we propose a method for representing and learning heterogeneous datasets by using diffusion maps for unifying and embedding heterogeneous dataset and by replacing the geometric harmonics with the Laplacian pyramid extension. Experimental results on three benchmark datasets demonstrate how the learning process becomes straightforward when the constructed representation smoothly parameterizes the task-related function.

【Keywords】:

18. A Bayesian Nonparametric Joint Factor Model for Learning Shared and Individual Subspaces from Multiple Data Sources.

Paper Link】 【Pages】:200-211

【Authors】: Sunil Kumar Gupta ; Dinh Quoc Phung ; Svetha Venkatesh

【Abstract】: Joint analysis of multiple data sources is becoming increasingly popular in transfer learning, multi-task learning and cross-domain data mining. One promising approach to model the data jointly is through learning the shared and individual factor subspaces. However, performance of this approach depends on the subspace dimensionalities and the level of sharing needs to be specified a priori. To this end, we propose a nonparametric joint factor analysis framework for modeling multiple related data sources. Our model utilizes the hierarchical beta process as a nonparametric prior to automatically infer the number of shared and individual factors. For posterior inference, we provide a Gibbs sampling scheme using auxiliary variables. The effectiveness of the proposed framework is validated through its application on two real world problems-transfer learning in text and image retrieval.

【Keywords】:

19. Adaptive Multi-task Sparse Learning with an Application to fMRI Study.

Paper Link】 【Pages】:212-223

【Authors】: Xi Chen ; Jingrui He ; Rick Lawrence ; Jaime G. Carbonell

【Abstract】: In this paper, we consider the multi-task sparse learning problem under the assumption that the dimensionality diverges with the sample size. The traditional l1/l2 multi-task lasso does not enjoy the oracle property unless a rather strong condition is enforced. Inspired by adaptive lasso, we propose a multi-stage procedure, adaptive multi-task lasso, to simultaneously conduct model estimation and variable selection across different tasks. Motivated by adaptive elastic-net, we further propose the adaptive multi-task elastic-net by adding another quadratic penalty to address the problem of collinearity. When the number of tasks is fixed, under weak assumptions, we establish the asymptotic oracle property for the proposed adaptive multi-task sparse learning methods including both adaptive multi-task lasso and elastic-net. In addition to the desirable asymptotic property, we show by simulations that adaptive sparse learning methods also achieve much improved finite sample performance. As a case study, we apply adaptive multi-task elastic-net to a cognitive science problem, where one wants to discover a compact semantic basis for predicting fMRI images. We show that adaptive multi-task sparse learning methods achieve superior performance and provide some insights into how the brain represents meanings of words.

【Keywords】:

20. Learning from Heterogeneous Sources via Gradient Boosting Consensus.

Paper Link】 【Pages】:224-235

【Authors】: Xiaoxiao Shi ; Jean-François Paiement ; David Grangier ; Philip S. Yu

【Abstract】: Multiple data sources containing different types of features may be available for a given task. For instance, users' profiles can be used to build recommendation systems. In addition, a model can also use users' historical behaviors and social networks to infer users' interests on related products. We argue that it is desirable to collectively use any available multiple heterogeneous data sources in order to build effective learning models. We call this framework heterogeneous learning. In our proposed setting, data sources can include (i) non-overlapping features, (ii) non-overlapping instances, and (iii) multiple networks (i.e. graphs) that connect instances. In this paper, we propose a general optimization framework for heterogeneous learning, and devise a corresponding learning model from gradient boosting. The idea is to minimize the empirical loss with two constraints: (1) There should be consensus among the predictions of overlapping instances (if any) from different data sources; (2) Connected instances in graph datasets may have similar predictions. The objective function is solved by stochastic gradient boosting trees. Furthermore, a weighting strategy is designed to emphasize informative data sources, and deemphasize the noisy ones. We formally prove that the proposed strategy leads to a tighter error bound. This approach consistently outperforms a standard concatenation of data sources on movie rating prediction, number recognition and terrorist attack detection tasks. We observe that the proposed model can improve out-of-sample error rate by as much as 80%.

【Keywords】:

Session CP5 - Pattern Mining 5

21. Slim: Directly Mining Descriptive Patterns.

Paper Link】 【Pages】:236-247

【Authors】: Koen Smets ; Jilles Vreeken

【Abstract】: Mining small, useful, and high-quality sets of patterns has recently become an important topic in data mining. The standard approach is to first mine many candidates, and then to select a good subset. However, the pattern explosion generates such enormous amounts of candidates that by post-processing it is virtually impossible to analyse dense or large databases in any detail. We introduce Slim, an any-time algorithm for mining high-quality sets of itemsets directly from data. We use MDL to identify the best set of itemsets as that set that describes the data best. To approximate this optimum, we iteratively use the current solution to determine what itemset would provide most gain—estimating quality using an accurate heuristic. Without requiring a pre-mined candidate collection, Slim is parameter-free in both theory and practice. Experiments show we mine high-quality pattern sets; while evaluating orders-of-magnitude fewer candidates than our closest competitor, Krimp, we obtain much better compression ratios—closely approximating the locally-optimal strategy. Classification experiments independently verify we characterise data very well.

【Keywords】:

22. MARBLES: Mining Association Rules Buried in Long Event Sequences.

Paper Link】 【Pages】:248-259

【Authors】: Boris Cule ; Nikolaj Tatti ; Bart Goethals

【Abstract】: Sequential pattern discovery is a well-studied field in data mining. Episodes are sequential patterns that describe events that often occur in the vicinity of each other. Episodes can impose restrictions on the order of the events, which makes them a versatile technique for describing complex patterns in the sequence. Most of the research on episodes deals with special cases such as serial and parallel episodes, while discovering general episodes is surprisingly understudied. This is particularly true when it comes to discovering association rules between them. In this paper we propose an algorithm that mines association rules between two general episodes. On top of the traditional definitions of frequency and confidence, we introduce two novel confidence measures for the rules. The major challenge in mining these association rules is pattern explosion. To limit the output, we aim to eliminate all redundant rules. We define the class of closed association rules, and show that this class contains all non-redundant output. To make the algorithm efficient, we use further pruning steps along the way. First of all, we generate only free and closed frequent episodes from which we create candidate rules, we speed up the evaluation of the rules, and finally prune the remaining non-closed rules from the output.

【Keywords】:

23. Mining Patterns in Networks using Homomorphism.

Paper Link】 【Pages】:260-271

【Authors】: Anton Dries ; Siegfried Nijssen

【Abstract】: In recent years many algorithms have been developed for finding patterns in graphs and networks. A disadvantage of these algorithms is that they use subgraph isomorphism to determine the support of a graph pattern; subgraph isomorphism is a well-known NP complete problem. In this paper, we propose an alternative approach which mines tree patterns in networks by using subgraph homomorphism. The advantage of homomorphism is that it can be computed in polynomial time, which allows us to develop an algorithm that mines tree patterns in arbitrary graphs in incremental polynomial time. Homomorphism however entails two problems not found when using isomorphism: (1) two patterns of different size can be equivalent; (2) patterns of unbounded size can be frequent. In this paper we formalize these problems and study solutions that easily fit within our algorithm.

【Keywords】:

24. Scalable Induction of Probabilistic Real-Time Automata Using Maximum Frequent Pattern Based Clustering.

Paper Link】 【Pages】:272-283

【Authors】: Jana Schmidt ; Sonja Ansorge ; Stefan Kramer

【Abstract】: The paper presents a scalable method for learning probabilistic real-time automata (PRTAs), a new type of model that captures the dynamics of multi-dimensional event logs. In multi-dimensional event logs, events are described by several features instead of only one symbol. Moreover, it is not clear up front which events occur in an event log. The learning method to find a PRTA that models such an event log is based on the state merging of a prefix tree acceptor, which is guided by a clustering to determine the states of the automaton. To make the overall approach scalable, an online clustering method based on maximum frequent patterns (MFPs) is used. The approach is evaluated on a synthetic, a biological and a medical data set. The results show that the induction of automata using MFP-based clustering gives easy to understand and stable automata, but most importantly, makes it scalable to large data sets.

【Keywords】:

25. Class Relevant Pattern Mining in Output-Polynomial Time.

Paper Link】 【Pages】:284-294

【Authors】: Henrik Grosskreutz

【Abstract】: The set of so-called relevant patterns is a subset of all itemsets particularly suited for pattern-based classification tasks. So far, no efficient algorithm has been developed for computing the set of relevant patterns: all existing solutions have a worst-case complexity which is exponential in the size of the input and output. In this paper, we investigate new properties of the relevant patterns and develop, thereupon, the first algorithm whose runtime is polynomial in the size of the input and output. As we show in the experimental section, this result is not only of theoretical interest but also of practical importance, often reducing the search space by orders of magnitude.

【Keywords】:

Session CP6 - Time Series and Sequence Analysis 5

26. Simplex Distributions for Embedding Data Matrices over Time.

Paper Link】 【Pages】:295-306

【Authors】: Kristian Kersting ; Mirwaes Wahabzada ; Christoph Römer ; Christian Thurau ; Agim Ballvora ; Uwe Rascher ; Jens Leon ; Christian Bauckhage ; Lutz Plümer

【Abstract】: Early stress recognition is of great relevance in precision plant protection. Pre-symptomatic water stress detection is of particular interest, ultimately helping to meet the challenge of “How to feed a hungry world?”. Due to the climate change, this is of considerable political and public interest. Due to its large-scale and temporal nature, e.g., when monitoring plants using hyper-spectral imaging, and the demand of physical meaning of the results, it presents unique computational problems in scale and interpretability. However, big data matrices over time also arise in several other real-life applications such as stock market monitoring where a business sector is characterized by the ups and downs of each of its companies per year or topic monitoring of document collections. Therefore, we consider the general problem of embedding data matrices into Euclidean space over time without making any assumption on the generating distribution of each matrix. To do so, we represent all data samples by means of convex combinations of only few extreme ones computable in linear time. On the simplex spanned by the extremes, there are then natural candidates for distributions inducing distances between and in turn embeddings of the data matrices. We evaluate our method across several domains, including synthetic, text, and financial data as well as a large-scale dataset on water stress detection in plants with more than 3 billion matrix entries. The results demonstrate that the embeddings are meaningful and fast to compute. The stress detection results were validated by a domain expert and conform to existing plant physiological knowledge.

【Keywords】:

27. Transformation Based Ensembles for Time Series Classification.

Paper Link】 【Pages】:307-318

【Authors】: Anthony Bagnall ; Luke M. Davis ; Jon Hills ; Jason Lines

【Abstract】: Until recently, the vast majority of data mining time series classification (TSC) research has focused on alternative distance measures for 1-Nearest Neighbour (1-NN) classifiers based on either the raw data, or on compressions or smoothing of the raw data. Despite the extensive evidence in favour of 1-NN classifiers with Euclidean or Dynamic Time Warping distance, there has also been a flurry of recent research publications proposing classification algorithms for TSC. Generally, these classifiers describe different ways of incorporating summary measures in the time domain into more complex classifiers. Our hypothesis is that the easiest way to gain improvement on TSC problems is simply to transform into an alternative data space where the discriminatory features are more easily detected. To test our hypothesis, we perform a range of benchmarking experiments in the time domain, before evaluating nearest neighbour classifiers on data transformed into the power spectrum, the autocorrelation function, and the principal component space. We demonstrate that on some problems there is dramatic improvement in the accuracy of classifiers built on the transformed data over classifiers built in the time domain, but that there is also a wide variance in accuracy for a particular classifier built on different data transforms. To overcome this variability, we propose a simple transformation based ensemble, then demonstrate that it improves performance and reduces the variability of classifiers built in the time domain only. Our advice to a practitioner with a real world TSC problem is to try transforms before developing a complex classifier; it is the easiest way to get a potentially large increase in accuracy, and may provide further insights into the underlying relationships that characterise the problem.

【Keywords】:

28. Mining Compressing Sequential Patterns.

Paper Link】 【Pages】:319-330

【Authors】: Hoang Thanh Lam ; Fabian Moerchen ; Dmitriy Fradkin ; Toon Calders

【Abstract】: Compression based pattern mining has been successfully applied to many data mining tasks. We propose an approach based on the minimum description length principle to extract sequential patterns that compress a database of sequences well. We show that mining compressing patterns is NP-Hard and belongs to the class of inapproximable problems. We propose two heuristic algorithms to mining compressing patterns. The first uses a two-phase approach similar to Krimp for itemset data. To overcome performance with the required candidate generation we propose GoKrimp, an effective greedy algorithm that directly mines compressing patterns. We conduct an empirical study on six real-life datasets to compare the proposed algorithms by run time, compressibility, and classification accuracy using the patterns found as features for SVM classifiers.

【Keywords】:

29. Beam Methods for the Profile Hidden Markov Model.

Paper Link】 【Pages】:331-342

【Authors】: Samuel J. Blasiak ; Huzefa Rangwala ; Kathryn B. Laskey

【Abstract】: The Profile Hidden Markov Model (PHMM) is commonly used to represent biological sequences. We present a method for transforming the Profile HMM into an equivalent standard HMM where each transition is associated with a single emission. Using this transformation, we develop a beam method, which includes a novel variational adaptation of the infinite-HMM beam sampling technique, to create a fast inference algorithm. We evaluate our algorithm on both synthetic data and protein sequence datasets, showing that our beam method can lead to considerable improvements in runtime while maintaining the model's ability to concisely represent sequences.

【Keywords】:

30. Optimal Distance Estimation Between Compressed Data Series.

Paper Link】 【Pages】:343-354

【Authors】: Nikolaos M. Freris ; Michail Vlachos ; Suleyman Serdar Kozat

【Abstract】: Most real-world data contain repeated or periodic patterns. This suggests that they can be effectively represented and compressed using only a few coefficients of an appropriate complete orthogonal basis (e.g., Fourier, Wavelets, Karhunen-Loeve expansion or Principal Components). In the face of ever increasing data repositories and given that most mining operations are distance-based, it is vital to perform accurate distance estimation directly on the compressed data. However, distance estimation when the data are represented using different sets of coefficients is still a largely unexplored area. This work studies the optimization problems related to obtaining the tightest lower/upper bound on the distance based on the available information. In particular, we consider the problem where a distinct set of coefficients is maintained for each sequence, and the L2-norm of the compression error is recorded. We establish the properties of optimal solutions, and leverage the theoretical analysis to develop a fast algorithm to obtain an exact solution to the problem. The suggested solution provides the tightest provable estimation of the L2-norm or the correlation, and executes at least two order of magnitudes faster than a numerical solution based on convex optimization. The contributions of this work extend beyond the purview of periodic data, as our methods are applicable to any sequential or high-dimensional data as well as to any orthogonal data transformation used for the underlying data compression scheme.

【Keywords】:

Session CP7 - Kernels and Classification 5

31. Multi-Objective Multi-Label Classification.

Paper Link】 【Pages】:355-366

【Authors】: Chuan Shi ; Xiangnan Kong ; Philip S. Yu ; Bai Wang

【Abstract】: Multi-label classification refers to the task of predicting potentially multiple labels for a given instance. Conventional multi-label classification approaches focus on the single objective setting, where the learning algorithm optimizes over a single performance criterion (e.g. Ranking Loss) or a heuristic function. The basic assumption is that the optimization over one single objective can improve the overall performance of multilabel classification and meet the requirements of various applications. However, in many real applications, an optimal multi-label classifier may need to consider the tradeoffs among multiple conflicting objectives, such as minimizing Hamming Loss and maximizing Micro F1. In this paper, we study the problem of multi-objective multi-label classification and propose a novel solution (called M oml) to optimize over multiple objectives simultaneously. Note that optimization objectives may be conflicting, thus one cannot identify a single solution that is optimal on all objectives. Our M oml algorithm finds a set of non-dominated solutions which are optimal according to the different tradeoffs of the multiple objectives. So users can flexibly construct various combined predictive models from the solution set, which helps to provide more meaningful classification results in different application scenarios. Empirical studies on real-world tasks demonstrate that the M oml can effectively boost the overall performance of multi-label classification, not limiting to the optimization objectives.

【Keywords】:

32. Bayesian Supervised Multilabel Learning with Coupled Embedding and Classification.

Paper Link】 【Pages】:367-378

【Authors】: Mehmet Gönen

【Abstract】: Coupled training of dimensionality reduction and classification is proposed previously to improve the prediction performance for single-label problems. Following this line of research, in this paper, we introduce a novel Bayesian supervised multilabel learning method that combines linear dimensionality reduction with linear binary classification. We present a deterministic variational approximation approach to learn the proposed probabilistic model for multilabel classification. We perform experiments on four benchmark multilabel learning data sets by comparing our method with four baseline linear dimensionality reduction algorithms. Experiments show that the proposed approach achieves good performance values in terms of hamming loss, macro F1, and micro F1 on held-out test data. The low-dimensional embeddings obtained by our method are also very useful for exploratory data analysis.

【Keywords】:

33. Subtree Replacement in Decision Tree Simplification.

Paper Link】 【Pages】:379-390

【Authors】: Salvatore Ruggieri

【Abstract】: The current availability of efficient algorithms for decision tree induction makes intricate post-processing techniques worth to be investigated both for efficiency and effectiveness. We study the simplification operator of subtree replacement, also known as grafting, originally implemented in the C4.5 system. We present a parametric bottom-up algorithm integrating grafting with the standard pruning operator, and analyze its complexity in terms of the number of nodes visited. Immediate instances of the parametric algorithm include extensions of error based, reduced error, minimum error, and pessimistic error pruning. Experimental results show that the computational cost of grafting is paid off by statistically significant smaller trees without accuracy loss.

【Keywords】:

34. A Distributed Kernel Summation Framework for General-Dimension Machine Learning.

Paper Link】 【Pages】:391-402

【Authors】: Dongryeol Lee ; Richard W. Vuduc ; Alexander G. Gray

【Abstract】: Kernel summations are a ubiquitous key computational bottleneck in many data analysis methods. In this paper, we attempt to marry, for the first time, the best relevant techniques in parallel computing, where kernel summations are in low dimensions, with the best general-dimension algorithms from the machine learning literature. We provide the first distributed implementation of kernel summation framework that can utilize: 1) various types of deterministic and probabilistic approximations that may be suitable for low and high-dimensional problems with a large number of data points; 2) any multi-dimensional binary tree using both distributed memory and shared memory parallelism; 3) a dynamic load balancing scheme to adjust work imbalances during the computation. Our hybrid MPI/OpenMP codebase has wide applicability in providing a general framework to accelerate the computation of many popular machine learning methods. Our experiments show scalability results for kernel density estimation on a synthetic ten-dimensional dataset containing over one billion points and a subset of the Sloan Digital Sky Survey Data up to 6,144 cores.

【Keywords】:

35. Kernelized Probabilistic Matrix Factorization: Exploiting Graphs and Side Information.

Paper Link】 【Pages】:403-414

【Authors】: Tinghui Zhou ; Hanhuai Shan ; Arindam Banerjee ; Guillermo Sapiro

【Abstract】: We propose a new matrix completion algorithm—Kernelized Probabilistic Matrix Factorization (KPMF), which effectively incorporates external side information into the matrix factorization process. Unlike Probabilistic Matrix Factorization (PMF) [14], which assumes an independent latent vector for each row (and each column) with Gaussian priors, KMPF works with latent vectors spanning all rows (and columns) with Gaussian Process (GP) priors. Hence, KPMF explicitly captures the underlying (nonlinear) covariance structures across rows and columns. This crucial difference greatly boosts the performance of KPMF when appropriate side information, e.g., users' social network in recommender systems, is incorporated. Furthermore, GP priors allow the KPMF model to fill in a row that is entirely missing in the original matrix based on the side information alone, which is not feasible for standard PMF formulation. In our paper, we mainly work on the matrix completion problem with a graph among the rows and/or columns as side information, but the proposed framework can be easily used with other types of side information as well. Finally, we demonstrate the efficacy of KPMF through two different applications: 1) recommender systems and 2) image restoration.

【Keywords】:

CP8 - Social Media and Graphs 5

Paper Link】 【Pages】:415-426

【Authors】: Charu C. Aggarwal ; Yan Xie ; Philip S. Yu

【Abstract】: Network and linked data have become quite prevalent in recent years because of the ubiquity of the web and social media applications, which are inherently network oriented. Such networks are massive, dynamic, contain a lot of content, and may evolve over time in terms of the underlying structure. In this paper, we will study the problem of dynamic link inference in temporal and heterogeneous information networks. The problem of dynamic link inference is extremely challenging in massive and heterogeneous information network because of the challenges associated with the dynamic nature of the network, and the different types of nodes and attributes in it. Both the topology and type information need to be used effectively for the link inference process. We propose an effective two-level scheme which makes efficient macro- and micro-decisions for combining structure and content in a dynamic and time-sensitive way. The time-sensitive nature of the links is leveraged in order to perform effective link prediction. We illustrate the effectiveness of our technique over a number of real data sets.

【Keywords】:

37. A Framework for the Evaluation and Management of Network Centrality.

Paper Link】 【Pages】:427-438

【Authors】: Vatche Ishakian ; Dóra Erdös ; Evimaria Terzi ; Azer Bestavros

【Abstract】: Network-analysis literature is rich in node-centrality measures that quantify the centrality of a node as a function of the (shortest) paths of the network that go through it. Existing work focuses on defining instances of such measures and designing algorithms for the specific combinatorial problems that arise for each instance. In this work, we propose a unifying definition of centrality that subsumes all path-counting based centrality definitions: e.g., stress, betweenness or paths centrality. We also define a generic algorithm for computing this generalized centrality measure for every node and every group of nodes in the network. Next, we define two optimization problems: k-GROup CENTRALITY MAXIMIZATION and k-EDGE CENTRALITY BOOSTING. In the former, the task is to identify the subset of k nodes that have the largest group centrality. In the latter, the goal is to identify up to k edges to add to the network so that the centrality of a node is maximized. We show that both of these problems can be solved efficiently for arbitrary centrality definitions using our general framework. In a thorough experimental evaluation we show the practical utility of our framework and the efficacy of our algorithms.

【Keywords】:

38. PICS: Parameter-free Identification of Cohesive Subgroups in Large Attributed Graphs.

Paper Link】 【Pages】:439-450

【Authors】: Leman Akoglu ; Hanghang Tong ; Brendan Meeder ; Christos Faloutsos

【Abstract】: Given a graph with node attributes, how can we find meaningful patterns such as clusters, bridges, and outliers? Attributed graphs appear in real world in the form of social networks with user interests, gene interaction networks with gene expression information, phone call networks with customer demographics, and many others. In effect, we want to group the nodes into clusters with similar connectivity and homogeneous attributes. Most existing graph clustering algorithms either consider only the connectivity structure of the graph and ignore the node attributes, or require several user-defined parameters such as the number of clusters. We propose PICS, a novel, parameter-free method for mining attributed graphs. Two key advantages of our method are that (1) it requires no user-specified parameters such as the number of clusters and similarity functions, and (2) its running time scales linearly with total graph and attribute size. Our experiments show that PICS reveals meaningful and insightful patterns and outliers in both synthetic and real datasets, including call networks, political books, political blogs, and collections from Twitter and YouTube which have more than 70K nodes and 30K attributes.

【Keywords】:

39. Structural Analysis in Multi-Relational Social Networks.

Paper Link】 【Pages】:451-462

【Authors】: Bing Tian Dai ; Freddy Chong Tat Chua ; Ee-Peng Lim

【Abstract】: Modern social networks often consist of multiple relations among individuals. Understanding the structure of such multi-relational network is essential. In sociology, one way of structural analysis is to identify different positions and roles using blockmodels. In this paper, we generalize stochastic blockmodels to Generalized Stochastic Blockmodels (GSBM) for performing positional and role analysis on multi-relational networks. Our GSBM generalizes many different kinds of Multivariate Probability Distribution Function (MVPDF) to model different kinds of multi-relational networks. In particular, we propose to use multivariate Poisson distribution for multi-relational social networks. Our experiments show that GSBM is able to identify the structures for both synthetic and real world network data. These structures can further be used for predicting relationships between individuals.

【Keywords】:

40. Influence Blocking Maximization in Social Networks under the Competitive Linear Threshold Model.

Paper Link】 【Pages】:463-474

【Authors】: Xinran He ; Guojie Song ; Wei Chen ; Qingye Jiang

【Abstract】: In many real-world situations, different and often opposite opinions, innovations, or products are competing with one another for their social influence in a networked society. In this paper, we study competitive influence propagation in social networks under the competitive linear threshold (CLT) model, an extension to the classic linear threshold model. Under the CLT model, we focus on the problem that one entity tries to block the influence propagation of its competing entity as much as possible by strategically selecting a number of seed nodes that could initiate its own influence propagation. We call this problem the influence blocking maximization (IBM) problem. We prove that the objective function of IBM in the CLT model is submodular, and thus a greedy algorithm could achieve 1 — 1/e approximation ratio. However, the greedy algorithm requires Monte-Carlo simulations of competitive influence propagation, which makes the algorithm not efficient. We design an efficient algorithm CLDAG, which utilizes the properties of the CLT model, to address this issue. We conduct extensive simulations of CLDAG, the greedy algorithm, and other baseline algorithms on real-world and synthetic datasets. Our results show that CLDAG is able to provide best accuracy in par with the greedy algorithm and often better than other algorithms, while it is two orders of magnitude faster than the greedy algorithm.

【Keywords】:

Session - CP9 - Feature Selection, Networks and Prediction 5

41. Feature Selection over Distributed Data Streams through Convex Optimization.

Paper Link】 【Pages】:475-484

【Authors】: Jacob Kogan

【Abstract】: Monitoring data streams in a distributed system has attracted considerable interest in recent years. The task of feature selection (e.g., by monitoring the information gain of various features) requires a very high communication overhead when addressed using straightforward centralized algorithms. While most of the existing algorithms deal with monitoring simple aggregated values such as frequency of occurrence of stream items, motivated by recent contributions based on geometric ideas we present an alternative approach. The proposed approach enables monitoring values of an arbitrary threshold function over distributed data streams through constraints applied separately on each stream. We report numerical experiments on a real-world data that detect instances where communication between nodes is required, and compare the approach and the results to those recently reported in the literature.

【Keywords】:

42. Feature Selection "Tomography" - Illustrating that Optimal Feature Filtering is Hopelessly Ungeneralizable.

Paper Link】 【Pages】:485-493

【Authors】: George Forman

【Abstract】: Feature filtering methods are often used in text classification and other high-dimensional domains to quickly score each feature independently and pass only the best to the learning algorithm. The panoply of available methods grows over the years, with frequent research publications touting new functions that seem to yield superior learning: variants on Information Gain, Chi Squared, Mutual Information, and others. This slow generate-and-test search in the literature is usually counted as progress towards finding superior filter methods. But this is illusory. We provide a new empirical method to reveal the feature preference surface for a given dataset and classifier: cross-validating with an additional feature whose noise characteristics are controlled. By evaluating for every sensitivity and specificity level—though computationally expensive—we determine the expected benefit of adding a feature with a given characteristic. We desire features with high expected gain. Ideally this preference surface would be easily generalizable across many datasets and depend on just a few parameters. However, by visualizing the preference surface under different conditions while holding other factors constant, we demonstrate graphically that it depends on more factors than are ever considered in feature selection papers: training set size and class distribution, classifier model and parameters, and—even holding all of these constant—the individual target concept itself. Thus, the ongoing sequence of published papers in this area will not yield an optimal feature selection function of significant generality.

【Keywords】:

43. Sampling Strategies to Evaluate the Performance of Unknown Predictors.

Paper Link】 【Pages】:494-505

【Authors】: Hamed Valizadegan ; Saeed Amizadeh ; Milos Hauskrecht

【Abstract】: The focus of this paper is on how to select a small sample of examples for labeling that can help us to evaluate many different classification models unknown at the time of sampling. We are particularly interested in studying the sampling strategies for problems in which the prevalence of the two classes is highly biased toward one of the classes. The evaluation measures of interest we want to estimate as accurately as possible are those obtained from the contingency table. We provide a careful theoretical analysis on sensitivity, specificity, and precision and show how sampling strategies should be adapted to the rate of skewness in data in order to effectively compute the three aforementioned evaluation measures.

【Keywords】:

44. A Bayesian Markov-switching Model for Sparse Dynamic Network Estimation.

Paper Link】 【Pages】:506-515

【Authors】: Huijing Jiang ; Aurelie C. Lozano ; Fei Liu

【Abstract】: Inferring Dynamic Bayesian Networks (DBNs) from multivariate time series data is a key step towards the understanding of complex systems as it reveals important dependency relationship underlying such systems. Most of the traditional approaches assume a “static” DBN. Yet in many relevant applications, such as those arising in biology and social sciences, the dependency structures may vary over time. In this paper, we introduce a sparse Markov-switching vector autoregressive model to capture the structural changes in the dependency relationships over time. Our approach accounts for such structural changes via a set of latent state variables, which are modeled by a discrete-time discrete-state Markov process. Assuming that the underlying structures are sparse, we estimate the networks at each state through the hierarchical Bayesian group Lasso, so as to efficiently capture dependencies with lags greater than one time unit. For computation, we develop an efficient algorithm based on the Expectation-Maximization method. We demonstrate the strength of our approach through simulation studies and a real data set concerning climate change.

【Keywords】:

45. Learning Hierarchical Relationships among Partially Ordered Objects with Heterogeneous Attributes and Links.

Paper Link】 【Pages】:516-527

【Authors】: Chi Wang ; Jiawei Han ; Qi Li ; Xiang Li ; Wen-Pin Lin ; Heng Ji

【Abstract】: Objects linking with many other objects in an information network may imply various semantic relationships. Uncovering such knowledge is essential for role discovery, data cleaning, and better organization of information networks, especially when the semantically meaningful relationships are hidden or mingled with noisy links and attributes. In this paper we study a generic form of relationship along which objects can form a treelike structure, a pervasive structure in various domains. We formalize the problem of uncovering hierarchical relationships in a supervised setting. In general, local features of object attributes, their interaction patterns, as well as rules and constraints for knowledge propagation can be used to infer such relationships. Existing approaches, designed for specific applications, either cannot handle dependency rules together with local features, or cannot leverage labeled data to differentiate their importance. In this study, we propose a discriminative undirected graphical model. It integrates a wide range of features and rules by defining potential functions with simple forms. These functions are also summarized and categorized. Our experiments on three quite different domains demonstrate how to apply the method to encode domain knowledge. The efficacy is measured with both traditional and our newly designed metrics in the evaluation of discovered tree structures.

【Keywords】:

Session CP10 - Transfer Learning 4

46. Transfer Learning of Distance Metrics by Cross-Domain Metric Sampling across Heterogeneous Spaces.

Paper Link】 【Pages】:528-539

【Authors】: Guo-Jun Qi ; Charu C. Aggarwal ; Thomas S. Huang

【Abstract】: The problem of transfer learning has recently been of great interest in a variety of machine learning applications. In this paper, we examine a new angle to the transfer learning problem, where we examine the problem of distance function learning. Specifically, we focus on the problem of how our knowledge of distance functions in one domain can be transferred to a new domain. A good semantic understanding of the feature space is critical in providing the domain specific understanding for setting up good distance functions. Unfortunately, not all domains have feature representations which are equally interpretable. For example, in some domains such as text, the semantics of the feature representation are clear, as a result of which it is easy for a domain expert to set up distance functions for specific kinds of semantics. In the case of image data, the features are semantically harder to interpret, and it is harder to set up distance functions, especially for particular semantic criteria. In this paper, we focus on the problem of transfer learning as a way to close the semantic gap between different domains, and show how to use correspondence information between two domains in order to set up distance functions for the semantically more challenging domain.

【Keywords】:

47. Dual Transfer Learning.

Paper Link】 【Pages】:540-551

【Authors】: Mingsheng Long ; Jianmin Wang ; Guiguang Ding ; Wei Cheng ; Xiang Zhang ; Wei Wang

【Abstract】: Transfer learning aims to leverage the knowledge in the source domain to facilitate the learning tasks in the target domain. It has attracted extensive research interests recently due to its effectiveness in a wide range of applications. The general idea of the existing methods is to utilize the common latent structure shared across domains as the bridge for knowledge transfer. These methods usually model the common latent structure by using either the marginal distribution or the conditional distribution. However, without exploring the duality between these two distributions, these single bridge methods may not achieve optimal capability of knowledge transfer. In this paper, we propose a novel approach, Dual Transfer Learning (DTL), which simultaneously learns the marginal and conditional distributions, and exploits the duality between them in a principled way. The key idea behind DTL is that learning one distribution can help to learn the other. This duality property leads to mutual reinforcement when adapting both distributions across domains to transfer knowledge. The proposed method is formulated as an optimization problem based on joint nonnegative matrix trifactorizations (NMTF). The two distributions are learned from the decomposed latent factors that exhibit the duality property. An efficient alternating minimization algorithm is developed to solve the optimization problem with convergence guarantee. Extensive experimental results demonstrate that DTL is more effective than alternative transfer learning methods.

【Keywords】:

48. Transfer Significant Subgraphs across Graph Databases.

Paper Link】 【Pages】:552-563

【Authors】: Xiaoxiao Shi ; Xiangnan Kong ; Philip S. Yu

【Abstract】: A key step of graph classification is to identify informative subgraphs that encode label information. For instance, in drug efficacy prediction, the drugs (chemical compounds) effective against the same disease usually contain similar chemical-subgraphs effective to control the disease. Then, one can use such chemical subgraphs to identify effective drugs. We call these subgraphs significant subgraphs. In this paper, the aim is to utilize the significant subgraphs from related graph datasets to help label graphs of the target dataset. For example, we utilize the breast cancer drug data, and transfer the anti-cancer subgraphs to help label another set of drug data against lung cancer. To do so, we propose a Bayesian-based transfer learning model. The key idea is to first evaluate the similarity between the target and source datasets by estimating the degree they share on their significant subgraphs. This dataset similarity is then used to judiciously select significant subgraphs from similar (related) datasets to the target dataset. An optimization problem is devised to maximize the likelihood that the selected subgraphs are significant in the target dataset. The objective function is further proven to have the antimonotone property which can help prune the search space significantly. Sixteen sets of experiments show that the proposed algorithm can effectively reduce the error rates by as much as 40%. More importantly, it is 10 times faster than the comparison models, which include unsupervised and supervised significant subgraph mining algorithms.

【Keywords】:

49. Transfer Topic Modeling with Ease and Scalability.

Paper Link】 【Pages】:564-575

【Authors】: Jeon-Hyung Kang ; Jun Ma ; Yan Liu

【Abstract】: The increasing volume of short texts generated on social media sites, such as Twitter or Facebook, creates a great demand for effective and efficient topic modeling approaches. While latent Dirichlet allocation (LDA) can be applied, it is not optimal due to its weakness in handling short texts with fast-changing topics and scalability concerns. In this paper, we propose a transfer learning approach that utilizes abundant labeled documents from other domains (such as Yahoo! News or Wikipedia) to improve topic modeling, with better model fitting and result interpretation. Specifically, we develop Transfer Hierarchical LDA (thLDA) model, which incorporates the label information from other domains via informative priors. In addition, we develop a parallel implementation of our model for large-scale applications. We demonstrate the effectiveness of our thLDA model on both a microblogging dataset and standard text collections including AP and RCV1 datasets.

【Keywords】:

Session CP11 - Applications - Healthcare and Networks 4

50. SOR: Scalable Orthogonal Regression for Low-Redundancy Feature Selection and its Healthcare Applications.

Paper Link】 【Pages】:576-587

【Authors】: Dijun Luo ; Fei Wang ; Jimeng Sun ; Marianthi Markatou ; Jianying Hu ; Shahram Ebadollahi

【Abstract】: As more clinical information with increasing diversity become available for analysis, a large number of features can be constructed and leveraged for predictive modeling. Feature selection is a classic analytic component that faces new challenges due to the new applications: How to handle a diverse set of high dimensional features? How to select features with high predictive power, but low redundant information? How to design methods that can select globally optimal features with theoretical guarantee? How to incorporate and extend existing knowledge driven approach? In this paper, we present Scalable Orthogonal Regression (SOR), an optimization-based feature selection method with the following novelties: 1) Scalability: SOR achieves nearly linear scale-up with respect to the number of input features and the number of samples; 2) Optimality: SOR is formulated as an alternative convex optimization problem with theoretical convergence and global optimality guarantee; 3) Low-redundancy: thanks to the orthogonality objective, SOR is designed specifically to select less redundant features without sacrificing quality; 4) Extendability: SOR can enhance an existing set of preselected features by adding additional features that complement the existing feature set but still with strong predictive power. We present evaluation results showing that SOR consistently outperforms state of the art feature selection methods in a range of quality metrics on several real world data sets. We demonstrate a case study of a large-scale clinical application for predicting early onset of Heart Failure (HF) using real Electronic Health Records (EHRs) data of over 10K patients for over 7 years. Leveraging SOR, we are able to construct accurate and robust predictive models and derive potential clinical insights.

【Keywords】:

51. Mining Massive Archives of Mice Sounds with Symbolized Representations.

Paper Link】 【Pages】:588-599

【Authors】: Jesin Zakaria ; Sarah Rotschafer ; Abdullah Mueen ; Khaleel Razak ; Eamonn J. Keogh

【Abstract】: Many animals produce long sequences of vocalizations best described as “songs.” In some animals, such as crickets and frogs, these songs are relatively simple and repetitive chirps or trills. However, animals as diverse as whales, bats, birds and even the humble mice considered here produce intricate and complex songs. These songs are worthy of study in their own right. For example, the study of bird songs has helped to cast light on various questions in the nature vs. nurture debate. However, there is a particular reason why the study of mice songs can benefit mankind. The house mouse (Mus musculus) has long been an important model organism in biology and medicine, and it is by far the most commonly used genetically altered laboratory mammal to address human diseases. While there has been significant recent efforts to analyze mice songs, advances in sensor technology have created a situation where our ability to collect data far outstrips our ability to analyze it. In this work we argue that the time is ripe for archives of mice songs to fall into the purview of data mining. We show a novel technique for mining mice vocalizations directly in the visual (spectrogram) space that practitioners currently use. Working in this space allows us to bring an arsenal of data mining tools to bear on this important domain, including similarity search, classification, motif discovery and contrast set mining.

【Keywords】:

52. IntruMine: Mining Intruders in Untrustworthy Data of Cyber-physical Systems.

Paper Link】 【Pages】:600-611

【Authors】: Lu An Tang ; Quanquan Gu ; Xiao Yu ; Jiawei Han ; Thomas F. La Porta ; Alice Leung ; Tarek F. Abdelzaher ; Lance M. Kaplan

【Abstract】: A Cyber-Physical System (CPS) integrates physical (i.e., sensor) devices with cyber (i.e., informational) components to form a situation-aware system that responds intelligently to dynamic changes in real-world. It has wide application to scenarios of traffic control, environment monitoring and battlefield surveillance. This study investigates the specific problem of intruder mining in CPS: With a large number of sensors deployed in a designated area, the task is real time detection of intruders who enter the area, based on untrustworthy data. We propose a method called IntruMine to detect and verify the intruders. IntruMine constructs monitoring graphs to model the relationships between sensors and possible intruders, and computes the position and energy of each intruder with the link information from these monitoring graphs. Finally, a confidence rating is calculated for each potential detection, reducing false positives in the results. IntruMine is a generalized approach. Two classical methods of intruder detection can be seen as special cases of IntruMine under certain conditions. We conduct extensive experiments to evaluate the performance of IntruMine on both synthetic and real datasets and the experimental results show that IntruMine has better effectiveness and efficiency than existing methods.

【Keywords】:

53. Robust Reputation-Based Ranking on Bipartite Rating Networks.

Paper Link】 【Pages】:612-623

【Authors】: Rong-Hua Li ; Jeffrey Xu Yu ; Xin Huang ; Hong Cheng

【Abstract】: With the growth of the Internet and E-commerce, bipartite rating networks are ubiquitous. In such bipartite rating networks, there exist two types of entities: the users and the objects, where users give ratings to objects. A fundamental problem in such networks is how to rank the objects by user's ratings. Although it has been extensively studied in the past decade, the existing algorithms either cannot guarantee convergence, or are not robust to the spammers. In this paper, we propose six new reputation-based algorithms, where the users' reputation is determined by the aggregated difference between the users' ratings and the corresponding objects' rankings. We prove that all of our algorithms converge into a unique fixed point. The time and space complexity of our algorithms are linear w.r.t. the size of the graph, thus they can be scalable to large datasets. Moreover, our algorithms are robust to the spamming users. We evaluate our algorithms using three real datasets. The experimental results confirm the effectiveness, efficiency, and robustness of our algorithms.

【Keywords】:

Short Presentations 45

54. Event Detection in Social Streams.

Paper Link】 【Pages】:624-635

【Authors】: Charu C. Aggarwal ; Karthik Subbian

【Abstract】: Social networks generate a large amount of text content over time because of continuous interaction between participants. The mining of such social streams is more challenging than traditional text streams, because of the presence of both text content and implicit network structure within the stream. The problem of event detection is also closely related to clustering, because the events can only be inferred from aggregate trend changes in the stream. In this paper, we will study the two related problems of clustering and event detection in social streams. We will study both the supervised and unsupervised case for the event detection problem. We present experimental results illustrating the effectiveness of incorporating network structure in event discovery over purely content-based methods.

【Keywords】:

55. On Influential Node Discovery in Dynamic Social Networks.

Paper Link】 【Pages】:636-647

【Authors】: Charu C. Aggarwal ; Shuyang Lin ; Philip S. Yu

【Abstract】: The problem of maximizing influence spread has been widely studied in social networks, because of its tremendous number of applications in determining critical points in a social network for information dissemination. All the techniques proposed in the literature are inherently static in nature, which are designed for social networks with a fixed set of links. However, many forms of social interactions are transient in nature, with relatively short periods of interaction. Any influence spread may happen only during the period of interaction, and the probability of spread is a function of the corresponding interaction time. Furthermore, such interactions are quite fluid and evolving, as a result of which the topology of the underlying network may change rapidly, as new interactions form and others terminate. In such cases, it may be desirable to determine the influential nodes based on the dynamic interaction patterns. Alternatively, one may wish to discover the most likely starting points for a given infection pattern. We will propose methods which can be used both for optimization of information spread, as well as the backward tracing of the source of influence spread. We will present experimental results illustrating the effectiveness of our approach on a number of real data sets.

【Keywords】:

56. Query-based Biclustering using Formal Concept Analysis.

Paper Link】 【Pages】:648-659

【Authors】: Faris Alqadah ; Joel S. Bader ; Rajul Anand ; Chandan K. Reddy

【Abstract】: Biclustering methods have proven to be critical tools in the exploratory analysis of high-dimensional data including information networks, microarray experiments, and bag of words data. However, most biclustering methods fail to answer specific questions of interest and do not incorporate prior knowledge and expertise from the user. To this end, query-based biclustering algorithms that are recently developed in the context of microarray data utilize a set of seed genes provided by the user which are assumed to be tightly co-expressed or functionally related to prune the search space and guide the biclustering algorithm. In this paper, a novel Query-Based Bi-Clustering algorithm, QBBC, is proposed by a new formulation that combines the advantages of low-variance biclustering techniques and Formal Concept Analysis. We prove that statistical dispersion measures that are order-preserving induce an ordering on the set of biclusters in the data. In turn, this ordering is exploited to form query-based biclusters in an efficient manner. Our novel approach provides a mechanism to generalize query-based biclustering to sparse high-dimensional data such as information networks and bag of words. Moreover, the proposed framework performs a local approach to query-based biclustering as opposed to the global approaches that previous algorithms have employed. Experimental results indicate that this local approach often produces higher quality and precise biclusters compared to the state-of-the-art query-based methods. In addition, our results on the performance evaluation illustrate the efficiency and scalability of QBBC compared to full biclustering approaches and other existing query-based approaches.

【Keywords】:

57. Granger Causality Analysis in Irregular Time Series.

Paper Link】 【Pages】:660-671

【Authors】: Mohammad Taha Bahadori ; Yan Liu

【Abstract】: Learning temporal causal structures between time series is one of the key tools for analyzing time series data. In many real-world applications, we are confronted with Irregular Time Series, whose observations are not sampled at equally-spaced time stamps. The irregularity in sampling intervals violates the basic assumptions behind many models for structure learning. In this paper, we propose a nonparametric generalization of the Granger graphical models called Generalized Lasso Granger (GLG) to uncover the temporal dependencies from irregular time series. Via theoretical analysis and extensive experiments, we verify the effectiveness of our model. Furthermore, we apply GLG to the application dataset of δ18O isotope of Oxygen records in Asia and achieve promising results to discover the moisture transportation patterns in a 800-year period.

【Keywords】:

58. Clustering Based on Yukawa Potential.

Paper Link】 【Pages】:672-683

【Authors】: Xue Bai ; Zezhen Lin ; Yun Xiong ; Yangyong Zhu

【Abstract】: Clustering is a common natural phenomenon. In microcosms, the nucleus is formed by the aggregation of nuclear particles through strong interactions, which can be illustrated by Yukawa potential. Inspired by this clustering phenomenon, we propose a novel dynamic clustering algorithm based on Yukawa potential (Yupc). Each data object is regarded as a particle following the basic rules of movements in the Yukawa potential field. After several time intervals, similar objects gradually aggregate together and form clear clusters. Yupc neither relies on any assumption of data distribution, nor prescribes any specific number of clusters. Natural clusters of different shapes, densities, sizes, numbers and distributions can be detected by Yupc, reflecting the intrinsic structure of the original data set. In addition, we propose a framework to automatically find appropriate parameters for Yupc. Experiments performed on synthetic and real-world data show that this approach outperforms existing algorithms, especially in data sets with arbitrary kinds of clusters.

【Keywords】:

59. Deterministic CUR for Improved Large-Scale Data Analysis: An Empirical Study.

Paper Link】 【Pages】:684-695

【Authors】: Christian Thurau ; Kristian Kersting ; Christian Bauckhage

【Abstract】: Low-rank approximations which are computed from selected rows and columns of a given data matrix have attracted considerable attention lately. They have been proposed as an alternative to the SVD because they naturally lead to interpretable decompositions which was shown to be successful in application such as fraud detection, fMRI segmentation, and collaborative filtering. The CUR decomposition of large matrices, for example, samples rows and columns according to a probability distribution that depends on the Euclidean norm of rows or columns or on other measures of statistical leverage. At the same time, there are various deterministic approaches that do not resort to sampling and were found to often yield factorization of superior quality with respect to reconstruction accuracy. However, these are hardly applicable to large matrices as they typically suffer from high computational costs. Consequently, many practitioners in the field of data mining have abandon deterministic approaches in favor of randomized ones when dealing with today's large-scale data sets. In this paper, we empirically disprove this prejudice. We do so by introducing a novel, linear-time, deterministic CUR approach that adopts the recently introduced Simplex Volume Maximization approach for column selection. The latter has already been proven to be successful for NMF-like decompositions of matrices of billions of entries. Our exhaustive empirical study on more than 30 synthetic and real-world data sets demonstrates that it is also beneficial for CUR-like decompositions. Compared to other deterministic CUR-like methods, it provides comparable reconstruction quality but operates much faster so that it easily scales to matrices of billions of elements. Compared to sampling-based methods, it provides competitive reconstruction quality while staying in the same run-time complexity class.

【Keywords】:

60. Combining Active Learning and Dynamic Dimensionality Reduction.

Paper Link】 【Pages】:696-707

【Authors】: Mustafa Bilgic

【Abstract】: To date, many active learning techniques have been developed for acquiring labels when training data is limited. However, an important aspect of the problem has often been neglected or just mentioned in passing: the curse of dimensionality. Yet, the curse of dimensionality poses even greater challenges in the case of limited data, which is precisely the setup for active learning. Reducing the dimensions is not a trivial task, however, as the correct number of dimensions depends on a number of factors including the training data size, the number of classes, the discriminative power of the features, and the underlying classification model. Moreover, active learning is typically applied in an iterative manner where the number of labels is smaller in the earlier iterations compared to the later ones. We propose an adaptive dimensionality reduction technique that determines the appropriate number of dimensions for each active learning iteration, utilizing the labeled and unlabeled data effectively to learn more accurate models. Extensive experiments comparing various approaches and parameter settings show that the proposed method improves performance drastically on three real-world text classification tasks.

【Keywords】:

Paper Link】 【Pages】:708-719

【Authors】: Jidong Chen ; Wentao Wu ; Hang Guo ; Wei Wang

【Abstract】: With the fast growth of disk capacity in personal computers, keyword search over personal data (a.k.a. desktop search) is becoming increasingly important. Nonetheless, desktop search has been shown to be more challenging than traditional Web search. Modern commercial Web search engines heavily rely on structural information (i.e., hyperlinks between Web pages) to rank their search results. However, such information is not available in the circumstance of desktop search. Therefore, state-of-the-art desktop search systems such as Google Desktop Search usually leverage pure text-based ranking approaches (e.g., TF-IDF), which often fail to give promising rankings due to the misinterpretation of user intention. We observed that in desktop search, the semantics of keyword queries are often context-aware, i.e., they are related to the current activity state (e.g., writing a paper, navigating a website, etc.) of the user. In this paper, we present a novel context-aware search framework by taking this activity information into consideration. Specifically, we use Hidden Markov Model (HMM) to capture the relationships between user's access actions (e.g., opening/closing files, sending/receiving emails, etc.) and activity states. The model is learned from user's past access history and is used to predict user's current activity upon the submission of some keyword query. We further propose a ranking scheme with this predicted context information incorporated. Experimental evaluation demonstrates both the effectiveness of the proposed context-aware search method and the enhancement to user's search experience.

【Keywords】:

62. Mining Social Dependencies in Dynamic Interaction Networks.

Paper Link】 【Pages】:720-731

【Authors】: Freddy Chong Tat Chua ; Hady Wirawan Lauw ; Ee-Peng Lim

【Abstract】: User-to-user interactions have become ubiquitous in Web 2.0. Users exchange emails, post on newsgroups, tag web pages, co-author papers, etc. Through these interactions, users co-produce or co-adopt content items (e.g., words in emails, tags in social bookmarking sites). We model such dynamic interactions as a user interaction network, which relates users, interactions, and content items over time. After some interactions, a user may produce content that is more similar to those produced by other users previously. We term this effect social dependency, and we seek to mine from such networks the degree to which a user may be socially dependent on another user over time. We propose a Decay Topic Model to model the evolution of a user's preferences for content items at the topic level, as well as a Social Dependency Metric that quantifies the extent of social dependency based on interactions and content changes. Our experiments on two user interaction networks induced from real-life datasets show the effectiveness of our approach.

【Keywords】:

63. Detecting Irregularly Shaped Significant Spatial and Spatio-Temporal Clusters.

Paper Link】 【Pages】:732-743

【Authors】: Weishan Dong ; Xin Zhang ; Li Li ; Changhua Sun ; Lei Shi ; Wei Sun

【Abstract】: Detecting significant overdensity or underdensity clusters in spatio-temporal data is critical for many real-world applications. Most existing approaches are designed to deal with regularly shaped clusters such as circular, elliptic and rectangular ones, but cannot work well on irregularly shaped clusters. In this paper, we propose GridScan, a grid-based approach for detecting irregularly shaped spatial clusters. In GridScan, a cluster is asymptotically described by a set of connected grid cells and is computed by a fast greedy region-growing algorithm with elaborating cluster merging in the process. The time complexity of GridScan is linear to the number of grids, making it scalable to very large datasets. A prospective spatio-temporal cluster detection approach, GridScan-Pro, is also proposed by extending GridScan. Experiments and a case study in the epidemic scenario demonstrate that our approaches greatly outperform existing ones in terms of accuracy, efficiency, and scalability.

【Keywords】:

64. Contextual Collaborative Filtering via Hierarchical Matrix Factorization.

Paper Link】 【Pages】:744-755

【Authors】: ErHeng Zhong ; Wei Fan ; Qiang Yang

【Abstract】: Matrix factorization (MF) has been demonstrated to be one of the most competitive techniques for collaborative filtering. However, state-of-the-art MFs do not consider contextual information, where ratings can be generated under different environments. For example, users select items under various situations, such as happy mood vs. sad, mobile vs. stationary, movies vs. book, etc. Under different contexts, the preference of users are inherently different. The problem is that MF methods uniformly decompose the rating matrix, and thus they are unable to factorize for different contexts. To amend this problem and improve recommendation accuracy, we introduce a “hierarchical” factorization model by considering the local context when performing matrix factorization. The intuition is that: as ratings are being generated from heterogeneous environments, certain user and item pairs tend to be more similar to each other than others, and hence they ought to receive more collaborative information from each other. To take the contextual information into consideration, the proposed “contextual collaborative filtering” approach splits the rating matrix hierarchically by grouping similar users and items together, and factorizes each sub-matrix locally under different contexts. By building an ensemble model, the approach further avoids over-fitting with less parameter tuning. We analyze and demonstrate that the proposed method is a model-averaging gradient boosting model, and its error rate can be bounded. Experimental results show that it outperforms three state-of-the-art algorithms on a number of real-world datasets (Movie-Lens, Netflix, etc). The source code and datasets are available for download.

【Keywords】:

65. Active Learning with Monotonicity Constraints.

Paper Link】 【Pages】:756-767

【Authors】: Nicola Barile ; Ad Feelders

【Abstract】: In many applications of data mining it is known beforehand that the response variable should be increasing (or decreasing) in the attributes. We propose two algorithms to exploit such monotonicity constraints for active learning in ordinal classification in two different settings. The basis of our approach is the observation that if the class label of an object is given, then the monotonicity constraints may allow the labels of other objects to be inferred. For instance, from knowing that loan applicant a is rejected, it can be concluded that all applicants that score worse than a on all criteria should be rejected as well. We propose two heuristics to select good query points. These heuristics make a selection based on a point's potential to determine the labels of other points. The algorithms, each implemented with the proposed heuristics, are evaluated on artificial and real data sets to study their performance. We conclude that exploitation of monotonicity constraints can be very beneficial in active learning.

【Keywords】:

Paper Link】 【Pages】:768-779

【Authors】: Liang Ge ; Aidong Zhang

【Abstract】: Link prediction is an important task in social networks and data mining for understanding the mechanisms by which the social networks form and evolve. In most link prediction researches, it is assumed either a snapshot of the social network or a social network with some missing links is available. Most existing researches therefore approach this problem by exploring the topological structure of the social network using only one source of information. However, in many application domains, in addition to the social network of interest, there are a number of auxiliary information available. In this work, we introduce the pseudo cold start link prediction with multiple sources as the problem of predicting the structure of a social network when only a small subgraph of the social network is known and multiple heterogeneous sources are available. We propose a two-phase supervised method: the first phase generates an efficient feature selection scheme to find the best feature from multiple sources that is used for predicting the structure in the social network. In the second phase, we propose a regularization method to control the risk of over-fitting induced by the first phase. We assess our method empirically over a large data collection obtained from Youtube. The extensive experimental evaluations confirm the effectiveness of our approach.

【Keywords】:

67. Discovering Context-aware Influential Objects.

Paper Link】 【Pages】:780-791

【Authors】: Yangpai Liu ; Huiping Cao ; Yifan Hao ; Peng Han ; Xinda Zeng

【Abstract】: It is very helpful for a user to get a moderate amount of information highly related to his/her immediate context (e.g., location, time, discussion topics) during the exploration of digital object collections (e.g., articles, web pages, blogs). For instance, in investigating a research topic, a researcher may be very interested in finding articles that are most related to the articles he/she already read on this topic, which we consider as “context” in this paper. To facilitate users' exploration, we introduce the problem of discovering Context-aware Influential Objects (CIO) from a collection of digital objects with influence relationships. Although there is a large amount of work in detecting direct influence degree between objects to denote how strong an object influences others, very few works utilize such direct influence to find influential objects for a context. To discover CIOs for a context consisting of several objects of a user's interest, the first challenge is to meaningfully measure the collective influence of an object over a context considering both the direct influence and the indirectly derived influence, which is not taken into consideration by most “query by example” approaches. We propose an aggregation framework to formulate the collective influence among objects by leveraging both direct and indirect influence. The second challenge is to discover CIOs efficiently. We present three approaches to calculate collective influence of an object over a context from an influence graph. In particular, the first approach utilizes the breadth-first-search paradigm; the other approaches make use of the topological sorting of graph nodes and perform context-aware search using push and pull mechanisms. We show experimental results on real datasets to demonstrate the effectiveness and efficiency of the proposed methodologies.

【Keywords】:

68. Monitoring and Mining Insect Sounds in Visual Space.

Paper Link】 【Pages】:792-803

【Authors】: Yuan Hao ; Bilson J. L. Campana ; Eamonn J. Keogh

【Abstract】: Monitoring animals by the sounds they produce is an important and challenging task, whether the application is outdoors in a natural habitat, or in the controlled environment of a laboratory setting. In the former case the density and diversity of animal sounds can act as a measure of biodiversity. In the latter case, researchers often create control and treatment groups of animals, expose them to different interventions, and test for different outcomes. One possible manifestation of different outcomes may be changes in the bioacoustics of the animals. With such a plethora of important applications, there have been significant efforts to build bioacoustic classification tools. However, we argue that most current tools are severely limited. They often require the careful tuning of many parameters (and thus huge amounts of training data), they are too computationally expensive for deployment in resource-limited sensors, they are specialized for a very small group of species, or they are simply not accurate enough to be useful. In this work we introduce a novel bioacoustic recognition/classification framework that mitigates or solves all of the above problems. We propose to classify animal sounds in the visual space, by treating the texture of their spectrograms as an acoustic fingerprint using a recently introduced parameter-free texture measure as a distance measure. We further show that by searching for the most representative acoustic fingerprint we can significantly outperform other techniques in terms of speed and accuracy.

【Keywords】:

69. Image Mining of Historical Manuscripts to Establish Provenance.

Paper Link】 【Pages】:804-815

【Authors】: Bing Hu ; Thanawin Rakthanmanon ; Bilson J. L. Campana ; Abdullah Mueen ; Eamonn J. Keogh

【Abstract】: The recent digitization of more than twenty million books has been led by initiatives from countries wishing to preserve their cultural heritage and by commercial endeavors, such as the Google Print Library Project. Within a few years a significant fraction of the world's books will be online. For millions of intact books and tens of millions of loose pages, the provenance of the manuscripts may be in doubt or completely unknown, thus denying historians an understanding of the context of the content. In some cases it may be possible for human experts to regain the provenance by examining linguistic, cultural and/or stylistic clues. However, such experts are rare and this investigation is clearly a time-consuming process. One technique used by experts to establish provenance is the examination of the ornate initial letters appearing in the questioned manuscript. By comparing the initial letters in the manuscript to annotated initial letters whose origin is known, the provenance can be determined. In this work we show for the first time that we can reproduce this ability with a computer algorithm. We leverage off a recently introduced technique to measure texture similarity and show that it can recognize initial letters with an accuracy that rivals or exceeds human performance. A brute force implementation of this measure would require several years to process a single large book; however, we introduce a novel lower bound that allows us to process the books in minutes.

【Keywords】:

70. RP-growth: Top-k Mining of Relevant Patterns with Minimum Support Raising.

Paper Link】 【Pages】:816-827

【Authors】: Yoshitaka Kameya ; Taisuke Sato

【Abstract】: One practical inconvenience in frequent pattern mining is that it often yields a flood of common or uninformative patterns, and thus we should carefully adjust the minimum support. To alleviate this inconvenience, based on FP-growth, this paper proposes RP-growth, an efficient algorithm for top-k mining of discriminative patterns which are highly relevant to the class of interest. RP-growth conducts a branch-and-bound search using anti-monotonic upper bounds of the relevance scores such as F-score and χ2, and the pruning in branch-and-bound search is successfully translated to minimum support raising, a standard, easy-to-implement pruning strategy for top-k mining. Furthermore, by introducing the notion called weakness and an additional, aggressive pruning strategy based on weakness, RP-growth efficiently finds k patterns of wide variety and high relevance to the class of interest. Experimental results on text classification exhibit the efficiency and the usefulness of RP-growth.

【Keywords】:

71. Fast Random Walk Graph Kernel.

Paper Link】 【Pages】:828-838

【Authors】: U. Kang ; Hanghang Tong ; Jimeng Sun

【Abstract】: Random walk graph kernel has been used as an important tool for various data mining tasks including classification and similarity computation. Despite its usefulness, however, it suffers from the expensive computational cost which is at least O(n3) or O(m2) for graphs with n nodes and m edges. In this paper, we propose ARK, a set of fast algorithms for random walk graph kernel computation. ARK is based on the observation that real graphs have much lower intrinsic ranks, compared with the orders of the graphs. ARK exploits the low rank structure to quickly compute random walk graph kernels in O(n2) or O(m) time. Experimental results show that our method is up to 97,865× faster than the existing algorithms, while providing more than 91.3% of the accuracies.

【Keywords】:

72. Tracking Spatio-Temporal Diffusion in Climate Data.

Paper Link】 【Pages】:839-850

【Authors】: Jaya Kawale ; Aditya Pal ; Rob Fatland

【Abstract】: A forest canopy forms a critical platform for complex interactions between the vegetation and the atmosphere boundary layer and is considered as a crucial piece for environmental scientists in their understanding of the ecosystem and its response to the climate change. Microfronts represent a class of these interactions characterized by a moving mass of air that introduce fluctuations in ambient temperature and humidity on small spatial and temporal scales. In this paper, we present a joint spatio-temporal hidden markov model that simultaneously incorporates neighborhood dependencies in space and time. We show that our approach can trace the diffusion of microfronts more effectively than several baseline methods over a sensor data from Brazilian rainforest and a synthetically generated dataset.

【Keywords】:

73. Group Sparsity in Nonnegative Matrix Factorization.

Paper Link】 【Pages】:851-862

【Authors】: Jingu Kim ; Renato D. C. Monteiro ; Haesun Park

【Abstract】: A recent challenge in data analysis for science and engineering is that data are often represented in a structured way. In particular, many data mining tasks have to deal with group-structured prior information, where features or data items are organized into groups. In this paper, we develop group sparsity regularization methods for nonnegative matrix factorization (NMF). NMF is an effective data mining tool that has been widely adopted in text mining, bioinformatics, and clustering, but a principled approach to incorporating group information into NMF has been lacking in the literature. Motivated by an observation that features or data items within a group are expected to share the same sparsity pattern in their latent factor representation, we propose mixed-norm regularization to promote group sparsity in the factor matrices of NMF. Group sparsity improves the interpretation of latent factors. Efficient convex optimization methods for dealing with the mixed-norm term are presented along with computational comparisons between them. Application examples of the proposed method in factor recovery, semi-supervised clustering, and multilingual text analysis are demonstrated.

【Keywords】:

74. Global Linear Neighborhoods for Efficient Label Propagation.

Paper Link】 【Pages】:863-872

【Authors】: Ze Tian ; Rui Kuang

【Abstract】: Graph-based semi-supervised learning improves classification by combining labeled and unlabeled data through label propagation. It was shown that the sparse representation of graph by weighted local neighbors provides a better similarity measure between data points for label propagation. However, selecting local neighbors can lead to disjoint components and incorrect neighbors in graph, and thus, fail to capture the underlying global structure. In this paper, we propose to learn a nonnegative low-rank graph to capture global linear neighborhoods, under the assumption that each data point can be linearly reconstructed from weighted combinations of its direct neighbors and reachable indirect neighbors. The global linear neighborhoods utilize information from both direct and indirect neighbors to preserve the global cluster structures, while the low-rank property retains a compressed representation of the graph. An efficient algorithm based on a multiplicative update rule is designed to learn a nonnegative low-rank factorization matrix minimizing the neighborhood reconstruction error. Large scale simulations and experiments on UCI datasets and high-dimensional gene expression datasets showed that label propagation based on global linear neighborhoods captures the global cluster structures better and achieved more accurate classification results.

【Keywords】:

75. Generalized Similarity Kernels for Efficient Sequence Classification.

Paper Link】 【Pages】:873-882

【Authors】: Pavel P. Kuksa ; Imdadullah Khan ; Vladimir Pavlovic

【Abstract】: String kernel-based machine learning methods have yielded great success in practical tasks of structured/sequential data analysis. They often exhibit state-of-the-art performance on tasks such as document topic elucidation, music genre classification, protein superfamily and fold prediction. However, typical string kernel methods rely on symbolic Hamming-distance based matching which may not necessarily reflect the underlying (e.g., physical) similarity between sequence fragments. In this work we propose a novel computational framework that uses general similarity metrics S(·, ·) and distance-preserving embeddings with string kernels to improve sequence classification. In particular, we consider two approaches that allow one either to incorporate non-Hamming similarity S(·, ·) into similarity evaluation by matching only the features that are similar according to S(·, ·) or to retain actual (approximate) similarity/distance scores in similarity evaluation. An embedding step, a distance-preserving bit-string mapping, is used to effectively capture similarity between otherwise symbolically different sequence elements. We show that it is possible to retain computational efficiency of string kernels while using this more “precise” measure of similarity. We then demonstrate that on a number of sequence classification tasks such as music, and biological sequence classification, the new method can substantially improve upon state-of-the-art string kernel baselines.

【Keywords】:

76. Detecting Extreme Rank Anomalous Collections.

Paper Link】 【Pages】:883-894

【Authors】: Hanbo Dai ; Feida Zhu ; Ee-Peng Lim ; HweeHwa Pang

【Abstract】: Anomaly or outlier detection has a wide range of applications, including fraud and spam detection. Most existing studies focus on detecting point anomalies, i.e., individual, isolated entities. However, there is an increasing number of applications in which anomalies do not occur individually, but in small collections. Unlike the majority, entities in an anomalous collection tend to share certain extreme behavioral traits. The knowledge essential in understanding why and how the set of entities becomes outliers would only be revealed by examining at the collection level. A good example is web spammers adopting common spamming techniques. To discover this kind of anomalous collections, we introduce a novel definition of anomaly, called Extreme Rank Anomalous Collection. We propose a statistical model to quantify the anomalousness of such a collection, and present an exact as well as a heuristic algorithms for finding top-K extreme rank anomalous collections. We apply the algorithms on real Web spam data to detect spamming sites, and on IMDB data to detect unusual actor groups. Our algorithms achieve higher precisions compared to existing spam and anomaly detection methods. More importantly, our approach succeeds in finding meaningful anomalous collections in both datasets.

【Keywords】:

77. Visualizing Variable-Length Time Series Motifs.

Paper Link】 【Pages】:895-906

【Authors】: Yuan Li ; Jessica Lin ; Tim Oates

【Abstract】: The problem of time series motif discovery has received a lot of attention from researchers in the past decade. Most existing work on finding time series motifs require that the length of the motifs be known in advance. However, such information is not always available. In addition, motifs of different lengths may co-exist in a time series dataset. In this work, we develop a motif visualization system based on grammar induction. We demonstrate that grammar induction in time series can effectively identify repeated patterns without prior knowledge of their lengths. The motifs discovered by the visualization system are variable-lengths in two ways. Not only can the inter-motif subsequences have variable lengths, the intra-motif subsequences also are not restricted to have identical length—a unique property that is desirable, but has not been seen in the literature.

【Keywords】:

78. Which Distance Metric is Right: An Evolutionary K-Means View.

Paper Link】 【Pages】:907-918

【Authors】: Chuanren Liu ; Tianming Hu ; Yong Ge ; Hui Xiong

【Abstract】: It is well known that the distance metric plays an important role in the clustering process. Indeed, many clustering problems can be treated as an optimization problem of a criterion function defined over one distance metric. While many distance metrics have been developed, it is not clear that how these distance metrics can impact on the clustering/optimization process. To that end, in this paper, we study the impact of a set of popular cosine-based distance metrics on K-means clustering. Specifically, by revealing the common order-preserving property, we first show that K-means has exactly the same cluster assignment for these metrics during the E-step. Next, by both theoretical and empirical studies, we prove that the cluster centroid is a good approximator of their respective optimal centers in the M-step. As such, we identify a problem with K-means: it cannot differentiate these metrics. To explore the nature of these metrics, we propose an evolutionary K-means framework that integrates K-means and genetic algorithms. This framework not only enables inspection of arbitrary distance metrics, but also can be used to investigate different formulations of the optimization problem. Finally, this framework is used in extensive experiments on real-world data sets. The results validate our theoretical findings on the characteristics and interrelationships of these metrics. Most importantly, this paper furthers our understanding of the impact of the distance metrics on the optimization process of K-means.

【Keywords】:

79. Constructing Training Sets for Outlier Detection.

Paper Link】 【Pages】:919-929

【Authors】: Li-Ping Liu ; Xiaoli Fern

【Abstract】: Outlier detection often works in an unsupervised manner due to the difficulty of obtaining enough training data. Since outliers are rare, one has to label a very large dataset to include enough outliers in the training set, with which classifiers could sufficiently learn the concept of outliers. Labeling a large training set is costly for most applications. However, we could just label suspected instances identified by unsupervised methods. In this way, the number of instances to be labeled could be greatly reduced. Based on this idea, we propose CISO, an algorithm Constructing training set by Identifying Suspected Outliers. In this algorithm, instances in a pool are first ranked by an unsupervised outlier detection algorithm. Then, suspected instances are selected and hand-labeled, and all remaining instances receive label of in-lier. As such, all instances in the pool are labeled and used in the training set. We also propose Budgeted CISO (BCISO), with which user could set a fixed budget for labeling. Experiments show that both algorithms achieve good performance compared to other methods when the same amount of labeling effort are used.

【Keywords】:

80. A Flexible Open-Source Toolbox for Scalable Complex Graph Analysis.

Paper Link】 【Pages】:930-941

【Authors】: Adam Lugowski ; David M. Alber ; Aydin Buluç ; John R. Gilbert ; Steven P. Reinhardt ; Yun Teng ; Andrew Waranis

【Abstract】: The Knowledge Discovery Toolbox (KDT) enables domain experts to perform complex analyses of huge datasets on supercomputers using a high-level language without grappling with the difficulties of writing parallel code, calling parallel libraries, or becoming a graph expert. KDT provides a flexible Python interface to a small set of high-level graph operations; composing a few of these operations is often sufficient for a specific analysis. Scalability and performance are delivered by linking to a state-of-the-art back-end compute engine that scales from laptops to large HPC clusters. KDT delivers very competitive performance from a general-purpose, reusable library for graphs on the order of 10 billion edges and greater. We demonstrate speedup of 1 and 2 orders of magnitude over PBGL and Pegasus, respectively, on some tasks. Examples from simple use cases and key graph-analytic benchmarks illustrate the productivity and performance realized by KDT users. Semantic graph abstractions provide both flexibility and high performance for real-world use cases. Graph-algorithm researchers benefit from the ability to develop algorithms quickly using KDT's graph and underlying matrix abstractions for distributed memory. KDT is available as open-source code to foster experimentation.

【Keywords】:

81. Fast Robustness Estimation in Large Social Graphs: Communities and Anomaly Detection.

Paper Link】 【Pages】:942-953

【Authors】: Fragkiskos D. Malliaros ; Vasileios Megalooikonomou ; Christos Faloutsos

【Abstract】: Given a large social graph, like a scientific collaboration network, what can we say about its robustness? Can we estimate a robustness index for a graph quickly? If the graph evolves over time, how these properties change? In this work, we are trying to answer the above questions studying the expansion properties of large social graphs. First, we present a measure which characterizes the robustness properties of a graph, and serves as global measure of the community structure (or lack thereof). We study how these properties change over time and we show how to spot outliers and anomalies over time. We apply our method on several diverse real networks with millions of nodes. We also show how to compute our measure efficiently by exploiting the special spectral properties of real-world networks.

【Keywords】:

82. On Finding Joint Subspace Boolean Matrix Factorizations.

Paper Link】 【Pages】:954-965

【Authors】: Pauli Miettinen

【Abstract】: Finding latent factors of the data using matrix factorizations is a tried-and-tested approach in data mining. But finding shared factors over multiple matrices is more novel problem. Specifically, given two matrices, we want to find a set of factors shared by these two matrices and sets of factors specific for the matrices. Not only does such decomposition reveal what is common between the two matrices, it also eliminates the need of explaining that common part twice, thus concentrating the non-shared factors to uniquely specific parts of the data. This paper studies a problem called Joint Subspace Boolean Matrix Factorization asking exactly that: a set of shared factors and sets of specific factors. Furthermore, the matrix factorization is based on the Boolean arithmetic. This restricts the presented approach suitable to only binary matrices. The benefits, however, include much sparser factor matrices and greater interpretability of the results. The paper presents three algorithms for finding the Joint Subspace Boolean Matrix Factorization, an MDL-based method for selecting the subspaces' dimensionality, and throughout experimental evaluation of the proposed algorithms.

【Keywords】:

83. Generalized Optimization Framework for Graph-based Semi-supervised Learning.

Paper Link】 【Pages】:966-974

【Authors】: Marina Sokol ; Konstantin Avrachenkov ; Paulo Gonçalves ; Alexey Mishenin

【Abstract】: We develop a generalized optimization framework for graph-based semi-supervised learning. The framework gives as particular cases the Standard Laplacian, Normalized Laplacian and PageRank based methods. We have also provided new probabilistic interpretation based on random walks and characterized the limiting behaviour of the methods. The random walk based interpretation allows us to explain differences between the performances of methods with different smoothing kernels. It appears that the PageRank based method is robust with respect to the choice of the regularization parameter and the labelled data. We illustrate our theoretical results with two realistic datasets, characterizing different challenges: Les Miserables characters social network and Wikipedia hyper-link graph. The graph-based semi-supervised learning classifies the Wikipedia articles with very good precision and perfect recall employing only the information about the hyper-text links.

【Keywords】:

84. A Tree-Based Kernel for Graphs.

Paper Link】 【Pages】:975-986

【Authors】: Giovanni Da San Martino ; Nicolò Navarin ; Alessandro Sperduti

【Abstract】: This paper proposes a new tree-based kernel for graphs. Graphs are decomposed into multisets of ordered Directed Acyclic Graphs (DAGs) and a family of kernels computed by application of tree kernels extended to the DAG domain. We focus our attention on the efficient development of one member of this family. A technique for speeding up the computation is given, as well as theoretical bounds and practical evidence of its feasibility. State of the art results on various benchmark datasets prove the effectiveness of our approach.

【Keywords】:

85. Density-based Projected Clustering over High Dimensional Data Streams.

Paper Link】 【Pages】:987-998

【Authors】: Irene Ntoutsi ; Arthur Zimek ; Themis Palpanas ; Peer Kröger ; Hans-Peter Kriegel

【Abstract】: Clustering of high dimensional data streams is an important problem in many application domains, a prominent example being network monitoring. Several approaches have been lately proposed for solving independently the different aspects of the problem. There exist methods for clustering over full dimensional streams and methods for finding clusters in subspaces of high dimensional static data. Yet only a few approaches have been proposed so far which tackle both the stream and the high dimensionality aspects of the problem simultaneously. In this work, we propose a new density-based projected clustering algorithm, HDDSTREAM, for high dimensional data streams. Our algorithm summarizes both the data points and the dimensions where these points are grouped together and maintains these summaries online, as new points arrive over time and old points expire due to ageing. Our experimental results illustrate the effectiveness and the efficiency of HDDSTREAM and also demonstrate that it could serve as a trigger for detecting drastic changes in the underlying stream population, like bursts of network attacks.

【Keywords】:

86. A Novel Approximation to Dynamic Time Warping allows Anytime Clustering of Massive Time Series Datasets.

Paper Link】 【Pages】:999-1010

【Authors】: Qiang Zhu ; Gustavo E. A. P. A. Batista ; Thanawin Rakthanmanon ; Eamonn J. Keogh

【Abstract】: Given the ubiquity of time series data, the data mining community has spent significant time investigating the best time series similarity measure to use for various tasks and domains. After more than a decade of extensive efforts, there is increasing evidence that Dynamic Time Warping (DTW) is very difficult to beat. Given that, recent efforts have focused on making the intrinsically slow DTW algorithm faster. For the similarity-search task, an important subroutine in many data mining algorithms, significant progress has been made by replacing the vast majority of expensive DTW calculations with cheap-to-compute lower bound calculations. However, these lower bound based optimizations do not directly apply to clustering, and thus for some realistic problems, clustering with DTW can take days or weeks. In this work, we show that we can mitigate this untenable lethargy by casting DTW clustering as an anytime algorithm. At the heart of our algorithm is a novel data-adaptive approximation to DTW which can be quickly computed, and which produces approximations to DTW that are much better than the best currently known linear-time approximations. We demonstrate our ideas on real world problems showing that we can get virtually all the accuracy of a batch DTW clustering algorithm in a fraction of the time.

【Keywords】:

Paper Link】 【Pages】:1011-1022

【Authors】: Parikshit Ram ; Dongryeol Lee ; Alexander G. Gray

【Abstract】: Many high-profile applications pose high-dimensional nearest-neighbor search problems. Yet, it still remains difficult to achieve fast query times for state-of-the-art approaches which use multidimensional trees for either exact or approximate search, possibly in combination with hashing approaches. Moreover, a number of these applications only have a limited amount of time to answer nearest-neighbor queries. However, we observe empirically that the correct neighbor is often found early within the tree-search process, while the bulk of the time is spent on verifying its correctness. Motivated by this, we propose an algorithm for finding the best neighbor given any particular time limit, and develop a new data structure, the max-margin tree, to achieve accurate results even with small time budgets. Max-margin trees perform better in the limited-time setting than current commonly-used data structures such as the kd-tree and more recently developed data structures like the RP-tree.

【Keywords】:

88. Efficient Clustering of Metagenomic Sequences using Locality Sensitive Hashing.

Paper Link】 【Pages】:1023-1034

【Authors】: Zeehasham Rasheed ; Huzefa Rangwala ; Daniel Barbará

【Abstract】: The new generation of genomic technologies have allowed researchers to determine the collective DNA of organisms (e.g., microbes) co-existing as communities across the ecosystem (e.g., within the human host). There is a need for the computational approaches to analyze and annotate the large volumes of available sequence data from such microbial communities (metagenomes). In this paper, we developed an efficient and accurate metagenome clustering approach that uses the locality sensitive hashing (LSH) technique to approximate the computational complexity associated with comparing sequences. We introduce the use of fixed-length, gapless subsequences for improving the sensitivity of the LSH-based similarity function. We evaluate the performance of our algorithm on two metagenome datasets associated with microbes existing across different human skin locations. Our empirical results show the strength of the developed approach in comparison to three state-of-the-art sequence clustering algorithms with regards to computational efficiency and clustering quality. We also demonstrate practical significance for the developed clustering algorithm, to compare bacterial diversity and structure across different skin locations.

【Keywords】:

89. Balancing Prediction and Recommendation Accuracy: Hierarchical Latent Factors for Preference Data.

Paper Link】 【Pages】:1035-1046

【Authors】: Nicola Barbieri ; Giuseppe Manco ; Riccardo Ortale ; Ettore Ritacco

【Abstract】: Recent works in Recommender Systems (RS) have investigated the relationships between the prediction accuracy, i.e. the ability of a RS to minimize a cost function (for instance the RMSE measure) in estimating users' preferences, and the accuracy of the recommendation list provided to users. State-of-the-art recommendation algorithms, which focus on the minimization of RMSE, have shown to achieve weak results from the recommendation accuracy perspective, and vice versa. In this work we present a novel Bayesian probabilistic hierarchical approach for users' preference data, which is designed to overcome the limitation of current methodologies and thus to meet both prediction and recommendation accuracy. According to the generative semantics of this technique, each user is modeled as a random mixture over latent factors, which identify users community interests. Each individual user community is then modeled as a mixture of topics, which capture the preferences of the members on a set of items. We provide two different formalization of the basic hierarchical model: BH-Forced focuses on rating prediction, while BH-Free models both the popularity of items and the distribution over item ratings. The combined modeling of item popularity and rating provides a powerful framework for the generation of highly accurate recommendations. An extensive evaluation over two popular benchmark datasets reveals the effectiveness and the quality of the proposed algorithms, showing that BH-Free realizes the most satisfactory compromise between prediction and recommendation accuracy with respect to several state-of-the-art competitors.

【Keywords】:

90. On Evaluation of Outlier Rankings and Outlier Scores.

Paper Link】 【Pages】:1047-1058

【Authors】: Erich Schubert ; Remigius Wojdanowski ; Arthur Zimek ; Hans-Peter Kriegel

【Abstract】: Outlier detection research is currently focusing on the development of new methods and on improving the computation time for these methods. Evaluation however is rather heuristic, often considering just precision in the top k results or using the area under the ROC curve. These evaluation procedures do not allow for assessment of similarity between methods. Judging the similarity of or correlation between two rankings of outlier scores is an important question in itself but it is also an essential step towards meaningfully building outlier detection ensembles, where this aspect has been completely ignored so far. In this study, our generalized view of evaluation methods allows both to evaluate the performance of existing methods as well as to compare different methods w.r.t. their detection performance. Our new evaluation framework takes into consideration the class imbalance problem and offers new insights on similarity and redundancy of existing outlier detection methods. As a result, the design of effective ensemble methods for outlier detection is considerably enhanced.

【Keywords】:

91. Regularized Structured Output Learning with Partial Labels.

Paper Link】 【Pages】:1059-1070

【Authors】: Sundararajan Sellamanickam ; Charu Tiwari ; Sathiya Keerthi Selvaraj

【Abstract】: We consider the problem of learning structured output probabilistic models with training examples having partial labels. Partial label scenarios arise commonly in web applications such as taxonomy (hierarchical) classification, multi-label classification and information extraction from web pages. For example, label information may be available only at the internal node level (not at the leaf level) for some pages in a taxonomy classification problem. In a multi-label classification problem, it may be available only for some of the classes (in each example). Similarly, in a sequence learning problem, we may have label information only for some nodes in the training sequences. Conventionally, marginal likelihood maximization technique has been used to solve these problems. In such a solution unlabeled examples and any side information like expected label distribution (or correlation in a multi-label setting) of the unlabeled part are not used. We solve these problems by incorporating entropy and label distribution or correlation regularizations along with marginal likelihood. Entropy and label distribution regularizations have been used previously in semi-supervised learning with fully unlabeled examples. In this paper we develop probabilistic taxonomy and multi-label classifier models, and provide the ideas needed for expanding their usage to the partial label scenario. Experiments on real-life taxonomy and multi-label learning problems show that significant improvements in accuracy are achieved by incorporating these regularizations, when most of the examples are only partially labeled.

【Keywords】:

92. The Similarity Between Stochastic Kronecker and Chung-Lu Graph Models.

Paper Link】 【Pages】:1071-1082

【Authors】: C. Seshadhri ; Ali Pinar ; Tamara G. Kolda

【Abstract】: The analysis of massive graphs is now becoming a very important part of science and industrial research. This has led to the construction of a large variety of graph models, each with their own advantages. The Stochastic Kronecker Graph (SKG) model has been chosen by the Graph500 steering committee to create supercomputer benchmarks for graph algorithms. The major reasons for this are its easy parallelization and ability to mirror real data. Although SKG is easy to implement, there is little understanding of the properties and behavior of this model. We show that the parallel variant of the edge-configuration model given by Chung and Lu (referred to as CL) is notably similar to the SKG model. The graph properties of an SKG are extremely close to those of a CL graph generated with the appropriate parameters. Indeed, the final probability matrix used by SKG is almost identical to that of a CL model. This implies that the graph distribution represented by SKG is almost the same as that given by a CL model. We also show that when it comes to fitting real data, CL performs as well as SKG based on empirical studies of graph properties. CL has the added benefit of a trivially simple fitting procedure and exactly matching the degree distribution. Our results suggest that users of the SKG model should consider the CL model because of its similar properties, simpler structure, and ability to fit a wider range of degree distributions. At the very least, CL is a good control model to compare against.

【Keywords】:

93. WIGM: Discovery of Subgraph Patterns in a Large Weighted Graph.

Paper Link】 【Pages】:1083-1094

【Authors】: Jiong Yang ; Wei Su ; Shirong Li ; Mehmet M. Dalkiliç

【Abstract】: Many research areas have begun representing massive data sets as very large graphs. Thus, graph mining has been an active research area in recent years. Most of the graph mining research focuses on mining unweighted graphs. However, weighted graphs are actually more common. The weight on an edge may represent the likelihood or logarithmic transformation of likelihood of the existence of the edge or the strength of an edge, which is common in many biological networks. In this paper, a weighted subgraph pattern model is proposed to capture the importance of a subgraph pattern and our aim is to find these patterns in a large weighted graph. Two related problems are studied in this paper: (1) discovering all patterns with respect to a given minimum weight threshold and (2) finding k patterns with the highest weights. The weighted subgraph patterns do not possess the anti-monotonic property and in turn, most of existing subgraph mining methods could not be directly applied. Fortunately, the 1-extension property is identified so that a bounded search can be achieved. A novel weighted graph mining algorithm, namely WIGM, is devised based on the 1-extension property. Last but not least, real and synthetic data sets are used to show the effectiveness and efficiency of our proposed models and algorithms.

【Keywords】:

94. Legislative Prediction via Random Walks over a Heterogeneous Graph.

Paper Link】 【Pages】:1095-1106

【Authors】: Jun Wang ; Kush R. Varshney ; Aleksandra Mojsilovic

【Abstract】: In this article, we propose a random walk-based model to predict legislators' votes on a set of bills. In particular, we first convert roll call data, i.e. the recorded votes and the corresponding deliberative bodies, to a heterogeneous graph, where both the legislators and bills are treated as vertices. Three types of weighted edges are then computed accordingly, representing legislators' social and political relations, bills' semantic similarity, and legislator-bill vote relations. Through performing two-stage random walks over this heterogeneous graph, we can estimate legislative votes on past and future bills. We apply this proposed method on real legislative roll call data of the United States Congress and compare to state-of-the-art approaches. The experimental results demonstrate the superior performance and unique prediction power of the proposed model.

【Keywords】:

95. An Iterative and Re-weighting Framework for Rejection and Uncertainty Resolution in Crowdsourcing.

Paper Link】 【Pages】:1107-1118

【Authors】: Sihong Xie ; Wei Fan ; Philip S. Yu

【Abstract】: In practical applications of crowdsourcing, labelers may be uncertain or refuse to label a particular instance (or reject) due to the inherent difficulty, and each labeler may be given a different set of instances for big dataset applications. These various issues lead to missing and uncertain labels. Existing crowdsourcing methods have limited capabilities when these two problems exist. In this paper, we propose an Iterative Re-weighted Consensus Maximization framework to address the missing and uncertain label problem. The intuitive idea is to use an iterated framework to estimate each labeler's hidden competence and formulate it as a spectral clustering problem in the functional space, in order to minimize the overall loss given missing and uncertain information. One main advantage of the proposed method from state-of-the-art Bayesian model averaging based approaches is that it uncovers the intrinsic consistency among different set of answers and mines the best possible ground truth. Formal analysis demonstrates that the proposed framework has lower generalization error than widely adopted majority voting techniques for crowdsourcing. Experimental studies show that the proposed framework outperforms state-of-the-art baselines on several benchmark datasets.

【Keywords】:

96. Citation Prediction in Heterogeneous Bibliographic Networks.

Paper Link】 【Pages】:1119-1130

【Authors】: Xiao Yu ; Quanquan Gu ; Mianwei Zhou ; Jiawei Han

【Abstract】: To reveal information hiding in link space of bibliographical networks, link analysis has been studied from different perspectives in recent years. In this paper, we address a novel problem namely citation prediction, that is: given information about authors, topics, target publication venues as well as time of certain research paper, finding and predicting the citation relationship between a query paper and a set of previous papers. Considering the gigantic size of relevant papers, the loosely connected citation network structure as well as the highly skewed citation relation distribution, citation prediction is more challenging than other link prediction problems which have been studied before. By building a meta-path based prediction model on a topic discriminative search space, we here propose a two-phase citation probability learning approach, in order to predict citation relationship effectively and efficiently. Experiments are performed on real-world dataset with comprehensive measurements, which demonstrate that our framework has substantial advantages over commonly used link prediction approaches in predicting citation relations in bibliographical networks.

【Keywords】:

97. Mining Multi-Label Data Streams Using Ensemble-Based Active Learning.

Paper Link】 【Pages】:1131-1140

【Authors】: Peng Wang ; Peng Zhang ; Li Guo

【Abstract】: Data stream classification has drawn increasing attention from the data mining community in recent years, where a large number of stream classification models were proposed. However, most existing models were merely focused on mining from single-label data streams. Mining from multi-label data streams has not been fully addressed yet. On the other hand, although some recent work touched the multi-label stream mining problem, they never consider the expensive labeling cost issue, preventing them from real-world applications. To this end, we study, in this paper, a challenging problem that mining from multi-label data streams with limited labeling resource. Specifically, we propose an ensemble-based active learning framework to handle the large volume of stream data, expensive labeling cost and concept drifting problems on multi-label data streams. Experiments on both synthetic and real world data sets demonstrate the performance of the proposed method.

【Keywords】:

98. Feature Selection for High-dimensional Integrated Data.

Paper Link】 【Pages】:1141-1150

【Authors】: Charles Zheng ; Scott Schwartz ; Robert S. Chapkin ; Raymond J. Carroll ; Ivan Ivanov

【Abstract】: Motivated by the problem of identifying correlations between genes or features of two related biological systems, we propose a model of feature selection in which only a subset of the predictors Xt are dependent on the multidimensional variate Y, and the remainder of the predictors constitute a “noise set” Xu independent of Y. Using Monte Carlo simulations, we investigated the relative performance of two methods: thresholding and singular-value decomposition, in combination with stochastic optimization to determine “empirical bounds” on the small-sample accuracy of an asymptotic approximation. We demonstrate utility of the thresholding and SVD feature selection methods to with respect to a recent infant intestinal gene expression and metagenomics dataset.

【Keywords】: