14. ICDM 2014:Shenzhen, China

2014 IEEE International Conference on Data Mining, ICDM 2014, Shenzhen, China, December 14-17, 2014. IEEE 【DBLP Link】

Paper Num: 143 || Session Num: 2

Regular Papers 71
Short Papers 72

Regular Papers 71

1. Discriminative Learning on Exemplary Patterns of Sequential Numerical Data.

【Paper Link】【Pages】:1-10

【Abstract】: One of the effective methodologies for time series classification is to identify informative subsequence patterns in time series and exploit them as discriminative features. Previous studies on this methodology have achieved promising results using a small number of individually selected patterns. However, there remain difficulties in finding a set of related patterns or patterns of a minor class, which can be critical in real-world applications. In this paper, we exploit the sparse learning technique for the support vector machine (SVM) to identify informative and exemplary patterns. We first present a representation of time series as a vector of distances to exemplary patterns. It allows a structural SVM to handle distance space data and function as the nearest neighbor classifier, the combination of which is known to be highly competitive in time series classification. We then extend the zero-norm approximation method for the structural SVM, which can eliminate non-essential patterns from the classification model. The resulting model makes predictions by a simple modified nearest neighbor rule, yet has a strong mathematical support for empirical risk minimization and feature selection. We conduct an empirical study on real-world behavior and sequential data to evaluate the effectiveness of the proposed method and graphically examine the exemplary patterns.

【Keywords】: approximation theory; data handling; feature selection; learning (artificial intelligence); pattern classification; support vector machines; time series; classification model; discriminative features; discriminative learning; distance space data handling; exemplary patterns; feature selection; informative subsequence patterns; nearest neighbor classifier; nearest neighbor rule; risk minimization; sequential numerical data; sparse learning technique; structural SVM; support vector machine; time series classification; zero-norm approximation method; Approximation methods; Mathematical model; Support vector machines; Time series analysis; Training; Transforms; Vectors; Data cleaning; Nearest neighbor classifier; Sequence template transform; Structural SVM; Time series classification

2. Orthogonal Matching Pursuit for Sparse Quantile Regression.

【Paper Link】【Pages】:11-19

【Authors】: Aleksandr Y. Aravkin ; Aurelie C. Lozano ; Ronny Luss ; Prabhanjan Kambadur

【Abstract】: We consider new formulations and methods for sparse quantile regression in the high-dimensional setting. Quantile regression plays an important role in many data mining applications, including outlier-robust exploratory analysis in gene selection. In addition, the sparsity consideration in quantile regression enables the exploration of the entire conditional distribution of the response variable given the predictors and therefore yields a more comprehensive view of the important predictors. We propose a generalized Orthogonal Matching Pursuit algorithm for variable selection, taking the misfit loss to be either the traditional quantile loss or a smooth version we call quantile Huber, and compare the resulting greedy approaches with convex sparsity-regularized formulations. We apply a recently proposed interior point methodology to efficiently solve all formulations, provide theoretical guarantees of consistent estimation, and demonstrate the performance of our approach using empirical studies of simulated and genomic datasets.

【Keywords】: greedy algorithms; pattern matching; regression analysis; conditional distribution; data mining; gene selection; generalized orthogonal matching pursuit algorithm; genomic datasets; greedy approaches; interior point methodology; outlier-robust exploratory analysis; quantile Huber; simulated datasets; sparse quantile regression; variable selection; Convergence; IP networks; MATLAB; Matching pursuit algorithms; Servers; Sparse matrices; Vectors

3. Quick Detection of High-Degree Entities in Large Directed Networks.

【Paper Link】【Pages】:20-29

【Authors】: Konstantin Avrachenkov ; Nelly Litvak ; Liudmila Ostroumova Prokhorenkova ; Eugenia Suyargulova

【Abstract】: In this paper we address the problem of quick detection of high-degree entities in large online social networks. Practical importance of this problem is attested by a large number of companies that continuously collect and update statistics about popular entities, usually using the degree of an entity as an approximation of its popularity. We suggest a simple, efficient, and easy to implement two-stage randomized algorithm that provides highly accurate solutions to this problem. For instance, our algorithm needs only one thousand API requests in order to find the top-100 most followed users, with more than 90% precision, in the online social network Twitter with approximately a billion of registered users. Our algorithm significantly outperforms existing methods and serves many different purposes such as finding the most popular users or the most popular interest groups in social networks. An important contribution of this work is the analysis of the proposed algorithm using Extreme Value Theory - a branch of probability that studies extreme events and properties of largest order statistics in random samples. Using this theory we derive an accurate prediction for the algorithm's performance and show that the number of API requests for finding the top-k most popular entities is sub linear in the number of entities. Moreover, we formally show that the high variability of the entities, expressed through heavy-tailed distributions, is the reason for the algorithm's efficiency. We quantify this phenomenon in a rigorous mathematical way.

【Keywords】: application program interfaces; randomised algorithms; social networking (online); API requests; approximation; extreme value theory; high-degree entities; large directed networks; online social network Twitter; online social networks; quick detection; randomized algorithm; statistics; top-k most popular entities; Algorithm design and analysis; Approximation algorithms; Indexes; Peer-to-peer computing; Prediction algorithms; Twitter; high-degree entities; randomized algorithm; social networks; sublinear complexity

4. Inferring Uncertain Trajectories from Partial Observations.

【Paper Link】【Pages】:30-39

【Authors】: Prithu Banerjee ; Sayan Ranu ; Sriram Raghavan

【Abstract】: The explosion in the availability of GPS-enabled devices has resulted in an abundance of trajectory data. In reality, however, majority of these trajectories are collected at a low sampling rate and only provide partial observations on their actually traversed routes. Consequently, they are mired with uncertainty. In this paper, we develop a technique called Infer Tra to infer uncertain trajectories from network-constrained partial observations. Rather than predicting the most likely route, the inferred uncertain trajectory takes the form of an edge-weighted graph and summarizes all probable routes in a holistic manner. For trajectory inference, Infer Tra employs Gibbs sampling by learning a Network Mobility Model (NMM) from a database of historical trajectories. Extensive experiments on real trajectory databases show that the graph-based approach of Infer Tra is up to 50% more accurate, 20 times faster, and immensely more versatile than state-of-the-art techniques.

【Keywords】: Markov processes; directed graphs; inference mechanisms; learning (artificial intelligence); mobile computing; network theory (graphs); uncertainty handling; GPS-enabled device; Gibbs sampling; InferTra; NMM; edge weighted graph; network constrained partial observation; network mobility model; trajectory inference; uncertain trajectory data; Databases; Joining processes; Joints; Random variables; Roads; Trajectory; Uncertainty; network mobility model; partial observations; road network; uncertain trajectory

5. Tensor-Based Multi-view Feature Selection with Applications to Brain Diseases.

【Paper Link】【Pages】:40-49

【Authors】: Bokai Cao ; Lifang He ; Xiangnan Kong ; Philip S. Yu ; Zhifeng Hao ; Ann B. Ragin

【Abstract】: In the era of big data, we can easily access information from multiple views which may be obtained from different sources or feature subsets. Generally, different views provide complementary information for learning tasks. Thus, multi-view learning can facilitate the learning process and is prevalent in a wide range of application domains. For example, in medical science, measurements from a series of medical examinations are documented for each subject, including clinical, imaging, immunologic, serologic and cognitive measures which are obtained from multiple sources. Specifically, for brain diagnosis, we can have different quantitative analysis which can be seen as different feature subsets of a subject. It is desirable to combine all these features in an effective way for disease diagnosis. However, some measurements from less relevant medical examinations can introduce irrelevant information which can even be exaggerated after view combinations. Feature selection should therefore be incorporated in the process of multi-view learning. In this paper, we explore tensor product to bring different views together in a joint space, and present a dual method of tensor-based multi-view feature selection DUAL-TMFS based on the idea of support vector machine recursive feature elimination. Experiments conducted on datasets derived from neurological disorder demonstrate the features selected by our proposed method yield better classification performance and are relevant to disease diagnosis.

【Keywords】: Big Data; brain; diseases; feature selection; learning (artificial intelligence); medical administrative data processing; medical computing; neurophysiology; patient diagnosis; support vector machines; DUAL-TMFS; big data; brain diagnosis; brain diseases; clinical measures; cognitive measures; disease diagnosis; feature subsets; imaging measures; immunologic measures; medical examinations; medical science; multiview learning; neurological disorder; serologic measures; support vector machine recursive feature elimination; tensor-based multi-view feature selection; tensor-based multiview feature selection; Correlation; Diseases; Medical diagnostic imaging; Support vector machines; Tensile stress; Vectors; brain diseases; feature selection; multi-view learning; tensor

6. Collective Prediction of Multiple Types of Links in Heterogeneous Information Networks.

【Paper Link】【Pages】:50-59

【Authors】: Bokai Cao ; Xiangnan Kong ; Philip S. Yu

【Abstract】: Link prediction has become an important and active research topic in recent years, which is prevalent in many real-world applications. Current research on link prediction focuses on predicting one single type of links, such as friendship links in social networks, or predicting multiple types of links independently. However, many real-world networks involve more than one type of links, and different types of links are not independent, but related with complex dependencies among them. In such networks, the prediction tasks for different types of links are also correlated and the links of different types should be predicted collectively. In this paper, we study the problem of collective prediction of multiple types of links in heterogeneous information networks. To address this problem, we introduce the linkage homophily principle and design a relatedness measure, called RM, between different types of objects to compute the existence probability of a link. We also extend conventional proximity measures to heterogeneous links. Furthermore, we propose an iterative framework for heterogeneous collective link prediction, called HCLP, to predict multiple types of links collectively by exploiting diverse and complex linkage information in heterogeneous information networks. Empirical studies on real-world tasks demonstrate that the proposed collective link prediction approach can effectively boost link prediction performances in heterogeneous information networks.

【Keywords】: information networks; probability; RM; collective link prediction; friendship links; heterogeneous information networks; link existence probability; linkage homophily principle; proximity measures; relatedness measure; social networks; Chemical compounds; Chemicals; Correlation; Couplings; Diseases; Drugs; Semantics; collective link prediction; heterogeneous information networks; meta path

7. Factorized Similarity Learning in Networks.

【Paper Link】【Pages】:60-69

【Authors】: Shiyu Chang ; Guo-Jun Qi ; Charu C. Aggarwal ; Jiayu Zhou ; Meng Wang ; Thomas S. Huang

【Abstract】: The problem of similarity learning is relevant to many data mining applications, such as recommender systems, classification, and retrieval. This problem is particularly challenging in the context of networks, which contain different aspects such as the topological structure, content, and user supervision. These different aspects need to be combined effectively, in order to create a holistic similarity function. In particular, while most similarity learning methods in networks such as Sim Rank utilize the topological structure, the user supervision and content are rarely considered. In this paper, a Factorized Similarity Learning (FSL) is proposed to integrate the link, node content, and user supervision into an uniform framework. This is learned by using matrix factorization, and the final similarities are approximated by the span of low rank matrices. The proposed framework is further extended to a noise-tolerant version by adopting a hinge-loss alternatively. To facilitate efficient computation on large scale data, a parallel extension is developed. Experiments are conducted on the DBLP and CoRA datasets. The results show that FSL is robust, efficient, and outperforms the state-of-the-art.

【Keywords】: data mining; information retrieval; learning (artificial intelligence); matrix decomposition; pattern classification; recommender systems; CoRA dataset; DBLP dataset; FSL; data mining application; factorized similarity learning; holistic similarity function; low rank matrices; matrix factorization; parallel extension; sim rank; similarity learning method; Equations; Linear programming; Noise measurement; Optimization; Silicon; Sparse matrices; Content; Link; Network similarity; Supervised matrix factorization; Supervision

8. Learning Local Semantic Distances with Limited Supervision.

【Paper Link】【Pages】:70-79

【Authors】: Shiyu Chang ; Charu C. Aggarwal ; Thomas S. Huang

【Abstract】: Recent advances in distance function learning have demonstrated that learning a good distance metric can greatly improve the performance in a wide variety of tasks in data mining and web search. A major problem in such scenarios is the limited labeled knowledge available for learning the user intentions. Furthermore, distances are inherently local, where a single global distance function may not capture the distance structure well. A challenge here is that local distance learning is even harder when the labeled information available is limited, because the distance function varies with data locality. To address these issues, we propose a local metric learning algorithm termed Local Semantic Sensing (LSS), which augments the small amount of labeled data with unlabeled data in order to learn the semantic information in the manifold structure, and then integrated with supervised intentional knowledge in a local way. We present results in a retrieval application, which show that the approach significantly outperforms other state-of-the-art methods in the literature.

【Keywords】: Internet; data mining; information retrieval; learning (artificial intelligence); search engines; LSS; Web search; data locality; data mining; distance function learning; distance metric; local metric learning algorithm; local semantic distance learning; local semantic sensing; manifold structure; retrieval application; single global distance function; supervised intentional knowledge; user intentions; Context; Data mining; Manifolds; Measurement; Semantics; Symmetric matrices; Vectors; Instance based; Metric learning; Semantic Aware; Semi-supervised; Similarity learning

【Paper Link】【Pages】:80-89

【Authors】: Po-Ta Chen ; Feng Chen ; Zhen Qian

【Abstract】: Real-time road traffic congestion monitoring is an important and challenging problem. Most existing monitoring approaches require the deployment of infrastructure sensors or large-scale probe vehicles. Their installation is often expensive and temporal-spatial coverage is limited. Probe vehicle data are oftentimes noisy on urban arterials, and therefore insufficient to provide accurate congestion estimation. This paper presents a novel social-media based approach to traffic congestion monitoring, in which pedestrians, drivers, and passengers a retreated as human sensors and their posted tweets in Twitter as observations of nearby ongoing traffic conditions. There are three technical challenges for road traffic monitoring based on Twitter, namely: 1) language ambiguity in the usage of traffic related terms, 2) uncertainty and low resolution of geographic location mentions, and 3) interactions between traffic-related events such as accidents and congestion. We propose a topic modeling based language model to address the first challenge and a collaborative inference model based on probabilistic soft logic (PSL) to address the second and third challenges. We present a unified statistical framework that combines those two models based on hinge loss Markov random fields (HLMRFs). In order to address the computational challenges incurred by the non-analytical integral of latent variables (factors) and the MAP estimation of a large number of location-dependent traffic congestion variables, we propose a fast approximate inference algorithm based on maximization expectation (ME) and the alternating directed method of multipliers (ADMM). Extensive evaluations over a variety of metrics on real world Twitter and INRIX probe speed datasets in two U.S. Major cities demonstrate the efficiency and effectiveness of our proposed approach.

【Keywords】: Markov processes; inference mechanisms; natural language processing; pedestrians; probabilistic logic; road traffic; social networking (online); statistical analysis; traffic engineering computing; HLMRF; INRIX probe speed datasets; MAP estimation; ME; PSL; Twitter; US major cities; alternating directed method of multipliers; approximate inference algorithm; collaborative inference model; drivers; geographic location mention uncertainty; hinge-loss Markov random fields; human sensors; infrastructure sensors; language ambiguity; large-scale probe vehicles; latent variables nonanalytical integral; location-dependent traffic congestion variables; low resolution geographic location mention; maximization expectation; passengers; pedestrians; probabilistic soft logic; road traffic congestion monitoring; social-media based approach; topic modeling based language model; traffic related terms; traffic-related events; unified statistical framework; urban arterials; Accidents; Media; Monitoring; Roads; Sensors; Twitter; Vehicles; Markov Random Fields; Social Media; Traffic Congestion Monitoring

10. LorSLIM: Low Rank Sparse Linear Methods for Top-N Recommendations.

【Paper Link】【Pages】:90-99

【Authors】: Yao Cheng ; Li'ang Yin ; Yong Yu

【Abstract】: In this paper, we notice that sparse and low-rank structures arise in the context of many collaborative filtering applications where the underlying graphs have block-diagonal adjacency matrices. Therefore, we propose a novel Sparse and Low-Rank Linear Method (Lor SLIM) to capture such structures and apply this model to improve the accuracy of the Top-N recommendation. Precisely, a sparse and low-rank aggregation coefficient matrix W is learned from Lor SLIM by solving an l1-norm and nuclear norm regularized optimization problem. We also develop an efficient alternating augmented Lagrangian method (ADMM) to solve the optimization problem. A comprehensive set of experiments is conducted to evaluate the performance of Lor SLIM. The experimental results demonstrate the superior recommendation quality of the proposed algorithm in comparison with current state-of-the-art methods.

【Keywords】: collaborative filtering; matrix algebra; optimisation; recommender systems; ADMM; Lor SLIM; LorSLIM; alternating augmented Lagrangian method; block-diagonal adjacency matrices; collaborative filtering applications; l₁-norm; low rank sparse linear methods; low-rank aggregation coefficient matrix; low-rank linear method; low-rank structures; nuclear norm regularized optimization problem; top-n recommendations; Collaboration; Equations; Mathematical model; Optimization; Recommender systems; Sparse matrices; Vectors; ADMM; L_1-norm regularization; Sparse and Low-Rank Linear Method; Top-N Recommender Systems; nuclear norm regularization

11. Detecting Flow Anomalies in Distributed Systems.

【Paper Link】【Pages】:100-109

【Authors】: Freddy Chong Tat Chua ; Ee-Peng Lim ; Bernardo A. Huberman

【Abstract】: Deep within the networks of distributed systems, one often finds anomalies that affect their efficiency and performance. These anomalies are difficult to detect because the distributed systems may not have sufficient sensors to monitor the flow of traffic within the interconnected nodes of the networks. Without early detection and making corrections, these anomalies may aggravate over time and could possibly cause disastrous outcomes in the system in the unforeseeable future. Using only coarse-grained information from the two end points of network flows, we propose a network transmission model and a localization algorithm, to detect the location of anomalies and rank them using a proposed metric within distributed systems. We evaluate our approach on passengers' records of an urbanized city's public transportation system and correlate our findings with passengers' postings on social media micro blogs. Our experiments show that the metric derived using our localization algorithm gives a better ranking of anomalies as compared to standard deviation measures from statistical models. Our case studies also demonstrate that transportation events reported in social media micro blogs matches the locations of our detect anomalies, suggesting that our algorithm performs well in locating the anomalies within distributed systems.

【Keywords】: distributed processing; public transport; social networking (online); anomaly location detection; coarse-grained information; distributed systems; flow anomaly detection; localization algorithm; network transmission model; social media microblogs; traffic flow monitoring; urbanized city public transportation system; Computer networks; Data models; Histograms; Joining processes; Mathematical model; Sensors; Transportation

12. Low-Rank Common Subspace for Multi-view Learning.

【Paper Link】【Pages】:110-119

【Authors】: Zhengming Ding ; Yun Fu

【Abstract】: Multi-view data is very popular in real-world applications, as different view-points and various types of sensors help to better represent data when fused across views or modalities. Samples from different views of the same class are less similar than those with the same view but different class. We consider a more general case that prior view information of testing data is inaccessible in multi-view learning. Traditional multi-view learning algorithms were designed to obtain multiple view-specific linear projections and would fail without this prior information available. That was because they assumed the probe and gallery views were known in advance, so the correct view-specific projections were to be applied in order to better learn low-dimensional features. To address this, we propose a Low-Rank Common Subspace (LRCS) for multi-view data analysis, which seeks a common low-rank linear projection to mitigate the semantic gap among different views. The low-rank common projection is able to capture compatible intrinsic information across different views and also well-align the within-class samples from different views. Furthermore, with a low-rank constraint on the view-specific projected data and that transformed by the common subspace, the within-class samples from multiple views would concentrate together. Different from the traditional supervised multi-view algorithms, our LRCS works in a weakly supervised way, where only the view information gets observed. Such a common projection can make our model more flexible when dealing with the problem of lacking prior view information of testing data. Two scenarios of experiments, robust subspace learning and transfer learning, are conducted to evaluate our algorithm. Experimental results on several multi-view datasets reveal that our proposed method outperforms state-of-the-art, even when compared with some supervised learning methods.

【Keywords】: data analysis; learning (artificial intelligence); LRCS; low-rank common subspace; multiview data analysis; multiview learning; Algorithm design and analysis; Face; Linear programming; Noise; Probes; Robustness; Testing; Multi-view; common subspace; low-rank

13. Sparse Real Estate Ranking with Online User Reviews and Offline Moving Behaviors.

【Paper Link】【Pages】:120-129

【Authors】: Yanjie Fu ; Yong Ge ; Yu Zheng ; Zijun Yao ; Yanchi Liu ; Hui Xiong ; Jing Yuan

【Abstract】: Ranking residential real estates based on investment values can provide decision making support for home buyers and thus plays an important role in estate marketplace. In this paper, we aim to develop methods for ranking estates based on investment values by mining users' opinions about estates from online user reviews and offline moving behaviors (e.g., Taxi traces, smart card transactions, check-ins). While a variety of features could be extracted from these data, these features are Interco related and redundant. Thus, selecting good features and integrating the feature selection into the fitting of a ranking model are essential. To this end, in this paper, we first strategically mine the fine-grained discrminative features from user reviews and moving behaviors, and then propose a probabilistic sparse pair wise ranking method for estates. Specifically, we first extract the explicit features from online user reviews which express users' opinions about point of interests (POIs) near an estate. We also mine the implicit features from offline moving behaviors from multiple perspectives (e.g., Direction, volume, velocity, heterogeneity, topic, popularity, etc.). Then we learn an estate ranking predictor by combining a pair wise ranking objective and a sparsity regularization in a unified probabilistic framework. And we develop an effective solution for the optimization problem. Finally, we conduct a comprehensive performance evaluation with real world estate related data, and the experimental results demonstrate the competitive performance of both features and the proposed model.

【Keywords】: data mining; investment; optimisation; probability; real estate data processing; estate marketplace; estate ranking predictor; feature selection; investment value; offline moving behavior; online user review; opinion mining; optimization problem; pairwise ranking objective; probabilistic sparse pairwise ranking method; residential real estate; sparse real estate ranking; sparsity regularization; Data mining; Feature extraction; Investment; Mobile communication; Smart cards; Trajectory; Offline Moving Behaviors; Online User Reviews; Residential Real Estate; Sparse Ranking

14. Finding the Optimal Subspace for Clustering.

【Paper Link】【Pages】:130-139

【Authors】: Sebastian Goebl ; Xiao He ; Claudia Plant ; Christian Böhm

【Abstract】: The ability to simplify and categorize things is one of the most important elements of human thought, understanding, and learning. The corresponding explorative data analysis techniques -- dimensionality reduction and clustering -- have initially been studied by our community as two separate research topics. Later algorithms like CLIQUE, ORCLUS, 4C, etc. Performed clustering and dimensionality reduction in a joint, alternating process to find clusters residing in low-dimensional subspaces. Such a low-dimensional representation is extremely useful, because it allows us to visualize the relationships between the various objects of a cluster. However, previous methods of subspace, correlation or projected clustering determine an individual subspace for each cluster. In this paper, we demonstrate that it is even much more valuable to find clusters in one common low-dimensional subspace, because then we can study not only the intra-cluster but also the inter-cluster relationships of objects, and the relationships of the whole clusters to each other. We develop the mathematical foundation ORT (Optimal Rigid Transform) to determine an arbitrarily-oriented subspace, suitable for a given cluster structure. Based on ORT, we propose FOSSCLU (Finding the Optimal Sub Space for Clustering), a new iterative clustering algorithm. Our extensive experiments demonstrate that FOSSCLU outperforms the previous methods even in both aspects: clustering and dimensionality reduction.

【Keywords】: data analysis; iterative methods; pattern clustering; FOSSCLU; ORT; data analysis; dimensionality reduction; finding the optimal sub space for clustering; iterative clustering algorithm; low-dimensional representation; mathematical foundation; optimal rigid transform; optimal subspace; Clustering algorithms; Covariance matrices; Eigenvalues and eigenfunctions; Matrix decomposition; Noise; Principal component analysis; Transforms; Joint Subspace Clustering

15. A Collaborative Kalman Filter for Time-Evolving Dyadic Processes.

【Paper Link】【Pages】:140-149

【Authors】: San Gultekin ; John Paisley

【Abstract】: We present the collaborative Kalman filter (CKF), a dynamic model for collaborative filtering and related factorization models. Using the matrix factorization approach to collaborative filtering, the CKF accounts for time evolution by modeling each low-dimensional latent embedding as a multidimensional Brownian motion. Each observation is a random variable whose distribution is parameterized by the dot product of the relevant Brownian motions at that moment in time. This is naturally interpreted as a Kalman filter with multiple interacting state space vectors. We also present a method for learning a dynamically evolving drift parameter for each location by modeling it as a geometric Brownian motion. We handle posterior intractability via a mean-field variational approximation, which also preserves tractability for downstream calculations in a manner similar to the Kalman filter. We evaluate the model on several large datasets, providing quantitative evaluation on the 10 million Movie lens and 100 million Netflix datasets and qualitative evaluation on a set of 39 million stock returns divided across roughly 6,500 companies from the years 1962-2014.

【Keywords】: Brownian motion; Kalman filters; approximation theory; collaborative filtering; learning (artificial intelligence); matrix decomposition; CKF; Movie lens datasets; Netflix datasets; collaborative Kalman filter; dot product; drift parameter; geometric Brownian motion; low-dimensional latent modelling; matrix factorization approach; mean-field variational approximation; multidimensional Brownian motion; multiple interacting state space vectors; posterior intractability; random variable; time-evolving dyadic processes; Approximation methods; Collaboration; Data models; Equations; Kalman filters; Mathematical model; Vectors

16. SNOC: Streaming Network Node Classification.

【Paper Link】【Pages】:150-159

【Authors】: Ting Guo ; Xingquan Zhu ; Jian Pei ; Chengqi Zhang

【Abstract】: Many real-world networks are featured with dynamic changes, such as new nodes and edges, and modification of the node content. Because changes are continuously introduced to the network in a streaming fashion, we refer to such dynamic networks as streaming networks. In this paper, we propose a new classification method for streaming networks, namely streaming network node classification (SNOC). For streaming networks, the essential challenge is to properly capture the dynamic changes of the node content and node interactions to support node classification. While streaming networks are dynamically evolving, for a short temporal period, a subset of salient features are essentially tied to the network content and structures, and therefore can be used to characterize the network for classification. To achieve this goal, we propose to carry out streaming network feature selection (SNF) from the network, and use selected features as gauge to classify unlabeled nodes. A Laplacian based quality criterion is proposed to guide the node classification, where the Laplacian matrix is generated based on node labels and structures. Node classification is achieved by finding the class that results in the minimal gauging value with respect to the selected features. By frequently updating the features selected from the network, node classification can quickly adapt to the changes in the network for maximal performance gain. Experiments demonstrate that SNOC is able to capture changes in network structures and node content, and outperforms baseline approaches with significant performance gain.

【Keywords】: feature selection; matrix algebra; pattern classification; Laplacian based quality criterion; Laplacian matrix; SNF; SNOC; dynamic network; network content; real-world network; salient feature; streaming fashion; streaming network feature selection; streaming network node classification; Accuracy; Data mining; Educational institutions; Laplace equations; Linear programming; Optimization; Vectors; Classification; Dynamic; Feature Selection; Network

17. Dual-Domain Hierarchical Classification of Phonetic Time Series.

【Paper Link】【Pages】:160-169

【Authors】: Hossein Hamooni ; Abdullah Mueen

【Abstract】: Phonemes are the smallest units of sound produced by a human being. Automatic classification of phonemes is a well-researched topic in linguistics due to its potential for robust speech recognition. With the recent advancement of phonetic segmentation algorithms, it is now possible to generate datasets of millions of phonemes automatically. Phoneme classification on such datasets is a challenging data mining task because of the large number of classes (over a hundred) and complexities of the existing methods. In this paper, we introduce the phoneme classification problem as a data mining task. We propose a dual-domain (time and frequency) hierarchical classification algorithm. Our method uses a Dynamic Time Warping (DTW) based classifier in the top layers and time-frequency features in the lower layer. We cross-validate our method on phonemes from three online dictionaries and achieved up to 35% improvement in classification compared to existing techniques. We provide case studies on classifying accented phonemes and speaker invariant phoneme classification.

【Keywords】: data mining; linguistics; signal classification; speech recognition; time series; DTW based classifier; automatic classification; data mining; datasets; dual-domain hierarchical classification; dynamic time warping; frequency hierarchical classification algorithm; linguistics; online dictionaries; phonetic segmentation algorithms; phonetic time series; robust speech recognition; sound units; speaker invariant phoneme classification; time hierarchical classification algorithm; time-frequency features; Accuracy; Dictionaries; Robustness; Speech; Speech recognition; Standards; Time series analysis; Big data; Phoneme classification; time series mining

18. Sequence Classification Based on Delta-Free Sequential Patterns.

【Paper Link】【Pages】:170-179

【Authors】: Pierre Holat ; Marc Plantevit ; Chedy Raïssi ; Nadi Tomeh ; Thierry Charnois ; Bruno Crémilleux

【Abstract】: Sequential pattern mining is one of the most studied and challenging tasks in data mining. However, the extension of well-known methods from many other classical patterns to sequences is not a trivial task. In this paper we study the notion of δ-freeness for sequences. While this notion has extensively been discussed for itemsets, this work is the first to extend it to sequences. We define an efficient algorithm devoted to the extraction of δ-free sequential patterns. Furthermore, we show the advantage of the δ-free sequences and highlight their importance when building sequence classifiers, and we show how they can be used to address the feature selection problem in statistical classifiers, as well as to build symbolic classifiers which optimizes both accuracy and earliness of predictions.

【Keywords】: data mining; feature selection; pattern classification; statistical analysis; δ-free sequential patterns; δ-freeness; data mining; delta-free sequential patterns; feature selection problem; itemsets; sequence classification; sequential pattern mining; statistical classifiers; symbolic classifiers; Accuracy; Data mining; Feature extraction; Generators; Itemsets; Runtime; early classification; feature selection; free patterns; sequence mining; text classification

【Paper Link】【Pages】:180-189

【Authors】: Xia Hu ; Jiliang Tang ; Huiji Gao ; Huan Liu

【Abstract】: Social media is a popular platform for spammers to unfairly overwhelm normal users with unwanted or fake content via social networking. The spammers significantly hinder the use of social media systems for effective information dissemination and sharing. Different from the spammers in traditional platforms such as email and the Web, spammers in social media can easily connect with each other, sometimes without mutual consent. They collude with each other to imitate normal users by quickly accumulating a large number of "human" friends. In addition, content information in social media is noisy and unstructured. It is infeasible to directly apply traditional spammer detection methods in social media. Understanding and detecting deception has been extensively studied in traditional sociology and social sciences. Motivated by psychological findings in physical world, we investigate whether sentiment analysis can help spammer detection in online social media. In particular, we first conduct an exploratory study to analyze the sentiment differences between spammers and normal users, and then present an optimization formulation that incorporates sentiment information into a novel social spammer detection framework. Experimental results on real-world social media datasets show the superior performance of the proposed framework by harnessing sentiment analysis for social spammer detection.

【Keywords】: information dissemination; optimisation; psychology; social networking (online); unsolicited e-mail; information dissemination; information sharing; online social media; optimization formulation; psychological findings; sentiment information; social networking; social sciences; social spammer detection; sociology; Computational modeling; Data models; Laplace equations; Media; Sentiment analysis; Twitter

【Paper Link】【Pages】:190-199

【Authors】: Yan Hua ; Shuhui Wang ; Siyuan Liu ; Qingming Huang ; Anni Cai

【Abstract】: With the explosive growth of web data, effective and efficient technologies are in urgent needs for retrieving semantically relevant contents of heterogeneous modalities. Previous studies construct global transformations to project the heterogeneous data into a measurable subspace. However, global projections cannot appropriately adapt to diverse contents, and the naturally existing multi-level semantic relation in web data is ignored. We study the problem of semantic coherent retrieval, where documents from different modalities should be ranked by the semantic relevance to the queries. Accordingly, we propose TINA, a correlation learning method by Adaptive Hierarchical Semantic Aggregation. First, by joint modeling of content and ontology similarities, we build a semantic hierarchy to measure multi-level semantic relevance. Second, with a set of local linear projections aggregated by gating functions, we optimize the structure risk objective function that involves semantic coherence measurement, local projection consistency and the complexity penalty of local projections. Therefore, semantic coherence and a better bias-variance trade-off can be achieved by TINA. Extensive experiments on widely used NUS-WIDE and ICML-Challenge datasets demonstrate that TINA outperforms state-of-the-art, and achieves better adaptation to the multi-level semantic relation and content divergence.

【Keywords】: Internet; content-based retrieval; learning (artificial intelligence); relevance feedback; ICML-challenge dataset; NUS-WIDE dataset; TINA; Web data; adaptive hierarchical semantic aggregation; bias-variance trade-off; complexity penalty; content divergence; content similarity; correlation learning method; cross-modal correlation learning; gating function; global transformation; heterogeneous data; heterogeneous modality; joint modeling; local linear projection; local projection consistency; multilevel semantic relation; multilevel semantic relevance; ontology similarity; semantic coherence measurement; semantic coherent retrieval; semantic hierarchy; semantically relevant content retrieval; structure risk objective function; Adaptation models; Coherence; Correlation; Ontologies; Semantics; Training; Visualization; Cross-modal retrieval; Local correlation learning; Semantic hierarchy

21. Diverse Power Iteration Embeddings and Its Applications.

【Paper Link】【Pages】:200-209

【Authors】: Hao Huang ; Shinjae Yoo ; Dantong Yu ; Hong Qin

【Abstract】: Spectral Embedding is one of the most effective dimension reduction algorithms in data mining. However, its computation complexity has to be mitigated in order to apply it for real-world large scale data analysis. Many researches have been focusing on developing approximate spectral embeddings which are more efficient, but meanwhile far less effective. This paper proposes Diverse Power Iteration Embeddings (DPIE), which not only retains the similar efficiency of power iteration methods but also produces a series of diverse and more effective embedding vectors. We test this novel method by applying it to various data mining applications (e.g. Clustering, anomaly detection and feature selection) and evaluating their performance improvements. The experimental results show our proposed DPIE is more effective than popular spectral approximation methods, and obtains the similar quality of classic spectral embedding derived from eigen-decompositions. Moreover it is extremely fast on big data applications. For example in terms of clustering result, DPIE achieves as good as 95% of classic spectral clustering on the complex datasets but 4000+ times faster in limited memory environment.

【Keywords】: Big Data; computational complexity; data analysis; data mining; eigenvalues and eigenfunctions; pattern clustering; vectors; DPIE; approximate spectral embeddings; big data applications; computation complexity; data mining; dimension reduction algorithms; diverse power iteration embeddings; eigendecompositions; embedding vectors; large scale data analysis; spectral clustering; Approximation algorithms; Clustering algorithms; Complexity theory; Data mining; Eigenvalues and eigenfunctions; Equations; Vectors

22. Noise-Resistant Unsupervised Feature Selection via Multi-perspective Correlations.

【Paper Link】【Pages】:210-219

【Authors】: Hao Huang ; Shinjae Yoo ; Dantong Yu ; Hong Qin

【Abstract】: Unsupervised feature selection is an important issue for high dimensional dataset analysis. However popular methods are susceptible to noisy instances (observations) or noisy features. We propose a noise-resistant feature selection algorithm by capturing multi-perspective correlations. Our proposed approach, called Noise-Resistant Unsupervised Feature Selection (NRFS), is based on multi-perspective correlation that reflects the importance of feature with respect to noise-resistant representative instances and various global trends from spectral decomposition. In this way, the model concisely captures a wide variety of local patterns. Experimental results demonstrate the effectiveness of our algorithm.

【Keywords】: data analysis; feature selection; NRFS; global trends; high dimensional dataset analysis; local patterns; multiperspective correlations; noise-resistant feature selection algorithm; noise-resistant representative instances; noise-resistant unsupervised feature selection; spectral decomposition; Bismuth; Clustering algorithms; Correlation; Equations; Manifolds; Mathematical model; Noise measurement

23. Technology Prospecting for High Tech Companies through Patent Mining.

【Paper Link】【Pages】:220-229

【Authors】: Bo Jin ; Yong Ge ; Hengshu Zhu ; Li Guo ; Hui Xiong ; Chao Zhang

【Abstract】: Technology prospecting is a process to evaluate the potential business values of high tech companies from the technology perspective. In this paper, we provide a new view-angle to understand technology prospecting by studying the evolving distributions of technologies in the companies. Specifically, we first exploit topic models to learn technological context in the form of probabilistic distributions of assignees and locations from large-scale patent documents. Then, we develop a matching solution to measure the relationships between patent topics and the description documents of technology terms. In this way, we can obtain the distribution of technologies for each company. In addition, we are able to assess the technology prospecting of a company by a designed indicator, which allows to compare the levels of discrepancies between the emerging technology distributions available as Garner Hype Cycles and the distribution of technologies of the company. Finally, experimental results on real-world patent data show the effectiveness of our approach for technology prospecting.

【Keywords】: data mining; innovation management; patents; pattern matching; statistical distributions; technology management; Garner Hype Cycles; business values; high tech companies; large-scale patent documents; patent topics; probabilistic distributions; real-world patent data mining; technology distribution; technology prospecting; Companies; Electronic publishing; Encyclopedias; Internet; Patents; Hype Cycle; Patent Mining; Technology Prospecting; Topic Modeling

24. News Credibility Evaluation on Microblog with a Hierarchical Propagation Model.

【Paper Link】【Pages】:230-239

【Authors】: Zhiwei Jin ; Juan Cao ; Yu-Gang Jiang ; Yongdong Zhang

【Abstract】: Benefiting from its openness, collaboration and real-time features, Micro blog has become one of the most important news communication media in modern society. However, it is also filled with fake news. Without verification, such information could spread promptly through social network and result in serious consequences. To evaluate news credibility on Micro blog, we propose a hierarchical propagation model. We detect sub-events within a news event to describe its detailed aspects. Thus, for a news event, a three-layer credibility network consisting of event, sub-events and messages can represent it from different scale and reveal vital information for credibility evaluation. After linking these entities with their semantic and social associations, the credibility value of each entity is propagated on this network to achieve the final evaluation result. By formulating this propagation process as a graph optimization problem, we provide a globally optimal solution with an iterative algorithm. Experiments conducted on two real-world datasets show that the proposed model boosts the accuracy by more than 6% and the F-score by more than 16% over a baseline method.

【Keywords】: information analysis; iterative methods; optimisation; social networking (online); graph optimization problem; hierarchical propagation model; iterative algorithm; microblog; news communication media; news credibility evaluation; social network; three-layer credibility network; Clustering algorithms; Feature extraction; Media; Optimization; Semantics; Symmetric matrices; Vectors; Microblog; Social media credibility; news credibility; rumor detection

25. Neural Conditional Energy Models for Multi-label Classification.

【Paper Link】【Pages】:240-249

【Authors】: How Jing ; Shou-De Lin

【Abstract】: Multi-label classification (MLC) is a type of structured output prediction problems where a given instance can be associated to more than one labels at a time. From the probabilistic point of view, a model predicts a set of labels y given an input vector v by learning a conditional distribution p(y|v). This paper presents a powerful model called a Neural Conditional Energy Model (NCEM) to solve MLC. The model can be viewed as a hybrid deterministic-stochastic network of which we use a deterministic neural network to transform the input data, before contributing to the energy landscape of v, y, and a single stochastic hidden layer h. Non-linear transformation given by the neural network makes our model more expressive and more capable of capturing complex relations between input and output, and using deterministic neurons facilitates exact inference. We present an efficient learning algorithm that is simple to implement. We conduct extensive experiments on 15 real-world datasets from wide variety of domains with various evaluation metrics to confirm that NCEM is significantly superior to current state-of-the-art models most of the time based on pair-wise t-test at 5% significance level. The MATLAB source code to replicate our experiments are available at https://github.com/Kublai-Jing/NCEM.

【Keywords】: learning (artificial intelligence); mathematics computing; neural nets; pattern classification; probability; stochastic processes; vectors; MLC; Matlab source code; NCEM; conditional distribution; deterministic neural network; multilabel classification; neural conditional energy models; nonlinear transformation; probabilistic point of view; stochastic hidden layer; vector; Computational modeling; Correlation; Data models; Mathematical model; Stochastic processes; Training; Vectors; Multi-Label Classification; Probabilistic Modeling

26. A Transfer Probabilistic Collective Factorization Model to Handle Sparse Data in Collaborative Filtering.

【Paper Link】【Pages】:250-259

【Authors】: How Jing ; An-Chun Liang ; Shou-De Lin ; Yu Tsao

【Abstract】: Data Sparsity incurs serious concern in collaborative filtering (CF). This issue is especially critical for newly launched CF applications where observed ratings are too scarce to learn a good model to predict missing values. There could be, however, information from other related domains which are with relatively denser data that can be utilized. This paper proposes a transfer-learning based approach that exploits probabilistic matrix factorization model trained with variational expectation-maximization (VIM) to resolve data sparsity by using information from multiple auxiliary domains. We conduct experiments on several data combination and report significant improvements over state-of-the-art transfer-based models for collaborative filtering. The results also show that our framework is the only solution that can achieve acceptable performance when each user has only one single rating. The code of our model is available at https://github.com/Kublai-Jing/TIC https://github.com/Kublai-Jing/TIC.

【Keywords】: collaborative filtering; data handling; expectation-maximisation algorithm; learning (artificial intelligence); matrix decomposition; variational techniques; VIM; collaborative filtering; data sparsity; probabilistic matrix factorization model; sparse data handling; transfer probabilistic collective factorization model; transfer-learning based approach; variational expectation-maximization; Adaptation models; Collaboration; Data models; Equations; Mathematical model; Motion pictures; Probabilistic logic; Collaborative Filtering; Data Sparsity; Probabilistic Modeling

27. An Examination of Multivariate Time Series Hashing with Applications to Health Care.

【Paper Link】【Pages】:260-269

【Authors】: David C. Kale ; Dian Gong ; Zhengping Che ; Yan Liu ; Gérard G. Medioni ; Randall C. Wetzel ; Patrick Ross

【Abstract】: As large-scale multivariate time series data become increasingly common in application domains, such as health care and traffic analysis, researchers are challenged to build efficient tools to analyze it and provide useful insights. Similarity search, as a basic operator for many machine learning and data mining algorithms, has been extensively studied before, leading to several efficient solutions. However, similarity search for multivariate time series data is intrinsically challenging because (1) there is no conclusive agreement on what is a good similarity metric for multivariate time series data and (2) calculating similarity scores between two time series is often computationally expensive. In this paper, we address this problem by applying a generalized hashing framework, namely kernelized locality sensitive hashing, to accelerate time series similarity search with a series of representative similarity metrics. Experiment results on three large-scale clinical data sets demonstrate the effectiveness of the proposed approach.

【Keywords】: data mining; health care; learning (artificial intelligence); time series; data mining algorithms; health care; kernelized locality sensitive hashing; large-scale clinical data sets; large-scale multivariate time series data; machine learning; multivariate time series hashing; representative similarity metrics; time series similarity search; Databases; Euclidean distance; Kernel; Time measurement; Time series analysis; Vectors; alignment; dynamic time warping; hashing; kernel methods; nearest neighbor; search; similarity; time series

28. Probabilistic Latent Document Network Embedding.

【Paper Link】【Pages】:270-279

【Authors】: Tuan M. V. Le ; Hady Wirawan Lauw

【Abstract】: A document network refers to a data type that can be represented as a graph of vertices, where each vertex is associated with a text document. Examples of such a data type include hyperlinked Web pages, academic publications with citations, and user profiles in social networks. Such data have very high-dimensional representations, in terms of text as well as network connectivity. In this paper, we study the problem of embedding, or finding a low-dimensional representation of a document network that "preserves" the data as much as possible. These embedded representations are useful for various applications driven by dimensionality reduction, such as visualization or feature selection. While previous works in embedding have mostly focused on either the textual aspect or the network aspect, we advocate a holistic approach by finding a unified low-rank representation for both aspects. Moreover, to lend semantic interpretability to the low-rank representation, we further propose to integrate topic modeling and embedding within a joint model. The gist is to join the various representations of a document (words, links, topics, and coordinates) within a generative model, and to estimate the hidden representations through MAP estimation. We validate our model on real-life document networks, showing that it outperforms comparable baselines comprehensively on objective evaluation metrics.

【Keywords】: document handling; embedded systems; feature selection; graph theory; probability; MAP estimation; data preservation; dimensionality reduction; embedding problem; feature selection; low-dimensional representation; probabilistic latent document network embedding; real-life document networks; semantic interpretability; unified low-rank representation; Data visualization; Educational institutions; Joints; Mathematical model; Nickel; Semantics; Visualization; dimensionality reduction; document network; embedding; generative model; topic modeling; visualization

29. Locating POS Terminals from Credit Card Transactions.

【Paper Link】【Pages】:280-289

【Authors】: Chao Li ; Jia Chen ; Jun Luo

【Abstract】: Credit card is a popular payment method and the transaction data keeps track of purchasing activities in people's daily lives. Extracting location of people's activities is an important task in many data mining problems because it may greatly help improve user experience and the service provided to people. Locating people from credit card transactions is equivalent to determining the location of every POS terminal where a payment takes place. This is however not an easy task because the locations of terminals are not usually provided to the credit card issuing companies and only a few terminals can be unambiguously located through map service by providing the merchants' names. In this paper, we propose a system to infer the locations of POS terminals using transaction data and map service. We first construct a transaction graph where the nodes are POS terminals. We then propose a two phase algorithm to find out uncertain and unknown locations of the terminals. In the first phase, we try to eliminate the uncertainty of POS terminals with multiple candidate locations. We show this problem is NP-hard and then give an effective heuristic algorithm to solve it. In the second phase, we compute the locations of unknown POS terminals by propagating the locations of known ones with spatial-temporal constraints. The algorithm is evaluated using a real-world credit card transaction data set and the result is promising for business applications.

【Keywords】: computational complexity; credit transactions; data mining; graph theory; purchasing; NP-hard; POS terminal locating; business applications; credit card transactions; data mining; location extraction; map service; payment method; purchasing activities; spatial-temporal constraints; transaction data; transaction graph; two phase algorithm; Companies; Credit cards; Data mining; Electronic mail; Trajectory; Uncertainty; POS; credit card transaction; location

30. Detecting Campaign Promoters on Twitter Using Markov Random Fields.

【Paper Link】【Pages】:290-299

【Authors】: Huayi Li ; Arjun Mukherjee ; Bing Liu ; Rachel Kornfield ; Sherry Emery

【Abstract】: As social media is becoming an increasingly important source of public information, companies, organizations and individuals are actively using social media platforms to promote their products, services, ideas and ideologies. Unlike promotional campaigns on TV or other traditional mass media platforms, campaigns on social media often appear in stealth modes. Campaign promoters often try to influence people's behaviors/opinions/decisions in a latent manner such that the readers are not aware that the messages they see are strategic campaign posts aimed at persuading them to buy target products/services. Readers take such campaign posts as just organic posts from the general public. It is thus important to discover such campaigns, their promoter accounts and how the campaigns are organized and executed as it can uncover the dynamics of Internet marketing. This discovery is clearly useful for competitors and also the general public. However, so far little work has been done to solve this problem. In this paper, we study this important problem in the context of the Twitter platform. Given a set of tweets streamed from Twitter based on a set of keywords representing a particular topic, the proposed technique aims to identify user accounts that are involved in promotion. We formulate the problem as a relational classification problem and solve it using typed Markov Random Fields (T-MRF), which is proposed as a generalization of the classic Markov Random Fields. Our experiments are carried out using three real-life datasets from the health science domain related to smoking. Such campaigns are interesting to health scientists, government health agencies and related businesses for obvious reasons. Our results show that the proposed method is highly effective.

【Keywords】: Internet; Markov processes; marketing data processing; pattern classification; random processes; social networking (online); Internet marketing; T-MRF; Twitter; campaign promoter detection; health science domain; public information; relational classification problem; social media platforms; strategic campaign posts; typed Markov random fields; Belief propagation; Context; Markov random fields; Media; Random variables; Twitter; Uniform resource locators; Campaign Promoter; Markov Random Fields

31. LRBM: A Restricted Boltzmann Machine Based Approach for Representation Learning on Linked Data.

【Paper Link】【Pages】:300-309

【Authors】: Kang Li ; Jing Gao ; Suxin Guo ; Nan Du ; Xiaoyi Li ; Aidong Zhang

【Abstract】: Linked data consist of both node attributes, e.g., Preferences, posts and degrees, and links which describe the connections between nodes. They have been widely used to represent various network systems, such as social networks, biological networks and etc. Knowledge discovery on linked data is of great importance to many real applications. One of the major challenges of learning linked data is how to effectively and efficiently extract useful information from both node attributes and links in linked data. Current studies on this topic either use selected topological statistics to represent network structures, or linearly map node attributes and network structures to a shared latent feature space. However, while approaches based on statistics may miss critical patterns in network structure, approaches based on linear mappings may not be sufficient to capture the non-linear characteristics of nodes and links. To handle the challenge, we propose, to our knowledge, the first deep learning method to learn from linked data. A restricted Boltzmann machine model named LRBM is developed for representation learning on linked data. In LRBM, we aim to extract the latent feature representation of each node from both node attributes and network structures, non-linearly map each pair of nodes to the links, and use hidden units to control the mapping. The details of how to adapt LRBM for link prediction and node classification on linked data have also been presented. In the experiments, we test the performance of LRBM as well as other baselines on link prediction and node classification. Overall, the extensive experimental evaluations confirm the effectiveness of the proposed LRBM model in mining linked data.

【Keywords】: Boltzmann machines; data mining; knowledge representation; learning (artificial intelligence); semantic Web; LRBM; Linked Data mining; deep learning method; knowledge discovery; latent feature representation; representation learning; restricted Boltzmann machine model; Data mining; Data models; Feature extraction; Probability distribution; Receivers; Social network services; Tensile stress; Deep Learning; Linked Data; Representation Learning; Restricted Boltzmann Machine

32. Early Classification of Ongoing Observation.

【Paper Link】【Pages】:310-319

【Authors】: Kang Li ; Sheng Li ; Yun Fu

【Abstract】: This work focuses on early classification of ongoing observation of the object, which is beneficial for a number of applications that require time-critical decision making. We propose an approach for discovering two key aspects of multivariate time series (m.t.s.) observation, (1) Temporal Dynamics and (2) Sequential Cues. The key idea is that m.t.s. Observation can be represented as an instantiation of a Multivariate Marked Point-Process (Multi-MPP). Each variable characterizes the temporal dynamics of a particular feature event of an object, where both timing and strength information of that feature event are preserved. To make this model computationally practical, we introduce the Multilevel-Discretized Marked Point-Process (MD-MPP) model which can ensure a good piece-wise stationary property both in the time-domain and mark-space while preserving dynamics as much as possible. Based on this model, another important temporal patterns of early classification, sequential cues among variables, becomes formalizable. We construct a probabilistic suffix tree to represent sequential patterns among features in terms of Variable order Markov Model (VMM). The effectiveness of our approach is evaluated on three experimental scenarios. Our method achieves superior performance for early classification of ongoing m.t.s. Observation data.

【Keywords】: Markov processes; data mining; decision making; learning (artificial intelligence); mathematics computing; pattern classification; time series; MD-MPP; Multi-MPP; VMM; data mining; early classification; machine learning; multilevel-discretized marked point-process; multivariate marked point-process; multivariate time series observation; sequential cues; temporal dynamics; time-critical decision making; variable order Markov model; Computational modeling; Correlation; Detectors; Heuristic algorithms; Stochastic processes; Time series analysis; Training; Early Classification; Sequential Cue; Temporal Dynamics; Time Series

33. Identifying Recurrent and Unknown Performance Issues.

【Paper Link】【Pages】:320-329

【Authors】: Meng-Hui Lim ; Jian-Guang Lou ; Hongyu Zhang ; Qiang Fu ; Andrew Beng Jin Teoh ; Qingwei Lin ; Rui Ding ; Dongmei Zhang

【Abstract】: For a large-scale software system, especially an online service system, when a performance issue occurs, it is desirable to check whether this issue has occurred before. If there are past similar issues, a known remedy could be applied. Otherwise, a new troubleshooting process may have to be initiated. The symptom of a performance issue can be characterized by a set of metrics. Due to the sophisticated nature of software systems, manual diagnosis of performance issues based on metric data is typically expensive and laborious. In this paper, we propose a Hidden Markov Random Field (HMRF) based approach to automatic identification of recurrent and unknown performance issues. We formulate the problem of issue identification as a HMRF-based clustering problem. Our approach incorporates the learning of metric discretization thresholds and the optimization of issue clustering. Based on the learned thresholds and cluster centroids, we can achieve accurate identification of recurrent issues and unknown issues. Experimental evaluations on an open benchmark and a large-scale industrial production system show that our approach is effective and outperforms the related state-of-the-art approaches.

【Keywords】: hidden Markov models; pattern clustering; HMRF-based clustering problem; automatic identification; cluster centroids; hidden Markov random field; issue clustering; issue identification; large-scale industrial production system; large-scale software system; metric data; metric discretization thresholds; online service system; troubleshooting process; unknown performance issues; Clustering algorithms; Fingerprint recognition; Hidden Markov models; Measurement; Monitoring; Production systems; Vectors; Issue identification; automated diagnosis; duplication detection; metrics; performance

34. Steering Information Diffusion Dynamically against User Attention Limitation.

【Paper Link】【Pages】:330-339

【Authors】: Shuyang Lin ; Qingbo Hu ; Fengjiao Wang ; Philip S. Yu

【Abstract】: As viral marketing in online social networks flourishes recently, a lot of attention has been drawn to the study of influence maximization in social networks. However, most works in influence maximization have overlooked the important role that social network providers (websites) play in the diffusion processes. Viral marketing campaigns are usually sold by websites as services to their clients. The websites can not only select initial sets of users to start diffusion processes, but can also have impacts throughout the diffusion processes by deciding when the information should be brought to the attention of individual users. This is especially true when user attention is limited, and the websites have to notify users about an item to bring it into the attention of users. In this paper, we study the diffusion of information from the perspective of social network websites. We propose a novel push-driven cascade (PDC) model, which emphasizes the role of websites during the diffusion of information. In the PDC model, the website "pushes" items to bring them to the attention of users, and whether a user is interested in an item is decided by her preference and the social influence from her friends. Analogous to the influence maximization problem on the traditional information diffusion models, we propose a dynamic influence maximization problem on the PDC model, which is defined as a sequential decision making problem for the website. We show that the problem can be formalized as a Markov sequential decision problem, and there exists a deterministic Markovian policy that is an optimal solution for the problem. We develop an AO algorithm that finds the optimal solution for the problem, and a heuristic online search algorithm, which has similar effectiveness, but is significantly more efficient. We evaluate the proposed algorithms on various real-world datasets, and find them significantly outperform the baselines.

【Keywords】: Markov processes; information dissemination; marketing data processing; optimisation; search problems; social networking (online); AO* algorithm; Markov sequential decision problem; Markovian policy; PDC model; heuristic online search algorithm; influence maximization problem; information diffusion; online social networks; optimal solution; push-driven cascade model; sequential decision making problem; social network Web sites; social network providers; user attention limitation; viral marketing campaigns; Diffusion processes; Heuristic algorithms; History; Markov processes; Mathematical model; Twitter; influence maximization; information diffusion; push-driven cascade; user attention

35. UnTangle: Visual Mining for Data with Uncertain Multi-labels via Triangle Map.

【Paper Link】【Pages】:340-349

【Authors】: Yu-Ru Lin ; Nan Cao ; David Gotz ; Lu Lu

【Abstract】: Data with multiple uncertain labels are common in many situations. For examples, a movie may be associated with multiple genres with different levels of confidence, and a protein sequence may be probabilistically assigned to several structural subcategories. Despite their ubiquity, the problem of visualizing uncertain labels has not been adequately addressed. Existing approaches often either discard the uncertainty information, or map the data to a low-dimensional subspace where their associations with multiple labels are obscured. In this paper, we propose a novel visual mining technique, UnTangle, for visualizing uncertain multi-labels. In our proposed visualization, data items are placed inside a web of connected triangles, with labels assigned to the triangle vertices such that nearby labels are more relevant to each other. The positions of the data items are determined based on the probabilistic associations between items and labels. UnTangle provides both (a) an automatic label placement algorithm, and (b) adaptive interaction mechanisms that allow users to control the label positioning for different visual queries. Our work makes a unique contribution by providing an effective way to investigate the relationship between data items and their uncertain labels, as well as the relationships among labels. Our user study suggests that the visualization effectively helps users discover emergent patterns and compare the nuances of uncertainty information in the data labels.

【Keywords】: data mining; data visualisation; UnTangle; adaptive interaction mechanisms; automatic label placement algorithm; data items; data mining; emergent pattern discovery; label positioning control; low-dimensional subspace; multiple genres; multiple uncertain labels; probabilistic associations; protein sequence; triangle map; uncertain multilabel visualization; uncertain multilabels; uncertainty information; visual mining; visual queries; visualizing uncertain labels; Data mining; Data visualization; Distributed databases; Motion pictures; Probabilistic logic; Uncertainty; Visualization; multi-labels; probablistic labels; ternary plot; uncertainty data; visual mining

【Paper Link】【Pages】:350-359

【Authors】: Qi Liu ; Zheng Dong ; Chuanren Liu ; Xing Xie ; Enhong Chen ; Hui Xiong

【Abstract】: The emergence of social networks has provided opportunities for both targeted marketing and viral marketing. By concentrating the efforts on a few key customers, targeted marketing could make the promotion of the items (products) much easier and more cost-effective. On the other hand, viral marketing aims at finding a set of individuals (seeds) to maximize the word-of-mouth propagation of an item. However, these two marketing strategies can only exploit some specific characteristics of the social networks, and the problem of how to combine them together to build a better, stronger business is still open. To that end, in this paper, we propose a general approach for integrated marketing. Specifically, to market a given item, we first generate the item-specific candidate users by a recommendation algorithm, and then select the typical users who have the best balanced utility scores and consumption/social entropy. Next, treating typical users as targeted customers, we study the problem of maximizing information awareness in viral marketing with these constrained targets. Along this line, we define it as a constrained coverage maximization problem, and propose three solutions: GMIC, LMIC and QMIC. Finally, extensive experimental results on real-world datasets demonstrate that our integrated marketing approach could outperform the methods that consider only targeted marketing or viral marketing.

【Keywords】: feature selection; marketing data processing; optimisation; recommender systems; social networking (online); GMIC; LMIC; QMIC; constrained coverage maximization problem; recommendation algorithm; social marketing; social network; targeted marketing; user selection; viral marketing; Collaboration; Computational modeling; Cultural differences; Entropy; Greedy algorithms; Linear programming; Social network services; Recommendation; Social Marketing; Targeted Marketing; Viral Marketing

37. Exploiting Heterogeneous Human Mobility Patterns for Intelligent Bus Routing.

【Paper Link】【Pages】:360-369

【Authors】: Yanchi Liu ; Chuanren Liu ; Nicholas Jing Yuan ; Lian Duan ; Yanjie Fu ; Hui Xiong ; Songhua Xu ; Junjie Wu

【Abstract】: Optimal planning for public transportation is one of the keys to sustainable development and better quality of life in urban areas. Compared to private transportation, public transportation uses road space more efficiently and produces fewer accidents and emissions. In this paper, we focus on the identification and optimization of flawed bus routes to improve utilization efficiency of public transportation services, according to people's real demand for public transportation. To this end, we first provide an integrated mobility pattern analysis between the location traces of taxicabs and the mobility records in bus transactions. Based on mobility patterns, we propose a localized transportation mode choice model, with which we can accurately predict the bus travel demand for different bus routing. This model is then used for bus routing optimization which aims to convert as many people from private transportation to public transportation as possible given budget constraints on the bus route modification. We also leverage the model to identify region pairs with flawed bus routes, which are effectively optimized using our approach. To validate the effectiveness of the proposed methods, extensive studies are performed on real world data collected in Beijing which contains 19 million taxi trips and 10 million bus trips.

【Keywords】: intelligent transportation systems; optimisation; public transport; road vehicles; sustainable development; vehicle routing; budget constraint; bus route modification; bus routing optimization; bus transaction; bus travel demand; flawed bus route; heterogeneous human mobility pattern; integrated mobility pattern analysis; intelligent bus routing; localized transportation mode choice model; mobility record; optimal planning; private transportation; public transportation services; quality of life; road space; sustainable development; taxicabs; utilization efficiency; Bridges; Cities and towns; Optimization; Roads; Routing; Vectors; bus routing; human mobility pattern; transportation

38. Anomaly Detection Using the Poisson Process Limit for Extremes.

【Paper Link】【Pages】:370-379

【Authors】: Stijn Luca ; Peter Karsmakers ; Bart Vanrumste

【Abstract】: Anomaly detection starts from a model of normal behavior and classifies departures from this model as anomalies. This paper introduces a statistical non-parametric approach for anomaly detection that is based on a multivariate extension of the Poisson point process model for univariate extremes. The method is demonstrated on both a synthetic and a real-world data set, the latter being an unbalanced data set of acceleration data collected from movements of 7 pediatric patients suffering from epilepsy that is previously studied in [1]. The positive predictive values could be improved with an increase up to 12.9% (and a mean of 7%) while the sensitivity scores stayed unaltered. The proposed method was also shown to outperform an one-class SVM classifier. Because the Poisson point process model of extremes is able to combine information on the number of excesses over a fixed threshold with that on the excess values, a powerful model to detect anomalies is obtained that can be of high value in many applications.

【Keywords】: nonparametric statistics; security of data; stochastic processes; Poisson point process model; anomaly detection; epilepsy; fixed threshold; multivariate extension; one-class SVM classifier; pediatric patients; positive predictive values; real-world data set; sensitivity scores; statistical nonparametric approach; synthetic data set; unbalanced data set; univariate extremes; Brain modeling; Data models; Estimation; Hidden Markov models; Kernel; Testing; Vectors; Poisson point process; anomaly detection; extreme value statistics; semi-supervised; unbalanced data

39. Ratable Aspects over Sentiments: Predicting Ratings for Unrated Reviews.

【Paper Link】【Pages】:380-389

【Authors】: Wenjuan Luo ; Fuzhen Zhuang ; Xiaohu Cheng ; Qing He ; Zhongzhi Shi

【Abstract】: Most existing rat able aspect generating methods for aspect mining focus on identifying and rating aspects of reviews with overall ratings, while huge amount of unrated reviews are beyond their ability. This drawback motivates the research problem in this paper: predicting aspect ratings and overall ratings for unrated reviews. To solve this problem, we novelly propose a topic model based on Latent Dirichlet Allocation with indirect supervision. Compared with the previous bag-of-words representation of review documents, we utilize the quad-tuples of (head, modifier, rating, entity) to explicitly model the associations between modifiers and ratings. Specifically, our solution for aspect mining in unrated reviews is decomposed into three steps. Firstly, rat able aspects are generated over sentiments from training reviews with overall ratings. Afterwards, inference of aspect identification and rating for unrated reviews are provided. Finally, overall ratings are predicted for unrated reviews. Under this framework, aspect and sentiment associations are captured in the form of joint probabilities through a generative process. The effectiveness of our approach is testified on a real-world dataset crawled from Trip Advisor http://www.tripadvisor.com/, and extensive experiments show that our method significantly outperforms state-of-the-art methods.

【Keywords】: data mining; document handling; aspect mining; bag-of-words representation; latent Dirichlet allocation; ratable aspect generating methods; review documents; sentiment associations; Equations; Feature extraction; Gaussian distribution; Inference algorithms; Mathematical model; Numerical models; Training; Aspect Identification; Aspect Rating Prediction; Overall Rating Prediction

40. Fast and Exact Monitoring of Co-Evolving Data Streams.

【Paper Link】【Pages】:390-399

【Authors】: Yasuko Matsubara ; Yasushi Sakurai ; Naonori Ueda ; Masatoshi Yoshikawa

【Abstract】: Given a huge stream of multiple co-evolving sequences, such as motion capture and web-click logs, how can we find meaningful patterns and spot anomalies? Our aim is to monitor data streams statistically, and find sub sequences that have the characteristics of a given hidden Markov model (HMM). For example, consider an online web-click stream, where massive amounts of access logs of millions of users are continuously generated every second. So how can we find meaningful building blocks and typical access patterns such as weekday/weekend patterns, and also, detect anomalies and intrusions? In this paper, we propose Stream Scan, a fast and exact algorithm for monitoring multiple co-evolving data streams. Our method has the following advantages: (a) it is effective, leading to novel discoveries and surprising outliers, (b) it is exact, and we theoretically prove that Stream Scan guarantees the exactness of the output, (c) it is fast, and requires O (1) time and space per time-tick. Our experiments on 67GB of real data illustrate that Stream Scan does indeed detect the qualifying subsequence patterns correctly and that it can offer great improvements in speed (up to 479,000 times) over its competitors.

【Keywords】: computational complexity; hidden Markov models; media streaming; HMM; StreamScan; access pattern detection; co-evolving data stream monitoring; hidden Markov model; Algorithm design and analysis; Heuristic algorithms; Hidden Markov models; Legged locomotion; Monitoring; Periodic structures; Viterbi algorithm; Co-evolving data streams; Hidden Markov models

41. Ternary Matrix Factorization.

【Paper Link】【Pages】:400-409

【Authors】: Samuel Maurus ; Claudia Plant

【Abstract】: Can we learn from the unknown? Logical data sets of the ternary kind are often found in information systems. They contain unknown as well as true/false values. An unknown value may represent a missing entry (lost or indeterminable) or something with meaning, like a "Don't Know" response in a questionnaire. In this paper we introduce an effectively- and efficiently-superior algorithm for reducing the dimensionality of logical data (categorical data in general) in the context of a new data mining challenge: Ternary Matrix Factorization (TMF). For a ternary data matrix, TMF exploits ternary logic to produce a basis matrix (which holds the major patterns in the data) and a usage matrix (which maps patterns to original observations). Both matrices are interpretable, and their ternary matrix product approximates the original matrix. TMF has applications in 1) finding targeted structure in ternary data, 2) imputing values through pattern-discovery in highly-incomplete categorical data sets, and 3) solving instances of its encapsulated Binary Matrix Factorization (BMF) problem. Our elegant algorithm Faster (Fast Ternary Matrix Factorization) has linear run-time complexity with respect to the dimensions of the data set and is parameter-robust. Experiments on synthetic and real-world data sets show that we are able to efficiently and effectively outperform state-of-the-art techniques in all three TMF applications.

【Keywords】: data mining; matrix decomposition; ternary logic; FasTer; TMF; basis matrix; data mining; encapsulated BMF problem; encapsulated binary matrix factorization problem; fast ternary matrix factorization; highly-incomplete categorical data sets; logical data dimensionality reduction; pattern-discovery; ternary data matrix; ternary logic; usage matrix; Complexity theory; Context; Matrix decomposition; Multivalued logic; Optimization; Uncertainty; Vectors; Three-valued logic; dimensionality reduction; imputation; matrix factorization; missing values; ternary data

42. Classification by CUT: Clearance under Threshold.

【Paper Link】【Pages】:410-419

【Authors】: Ryan McBride ; Ke Wang ; Wenyuan Li

【Abstract】: Identifying bad objects hidden amidst many good objects is important for public safety and decision-making. These problems are complicated in that the cost of leaving a bad object unidentified may not be specified easily, making it difficult to apply existing cost-sensitive classification that depends on knowing a cost matrix or cost distribution. A compelling case for this "illusive cost" issue is presented in our project of identifying contaminated transformers with an industrial partner. To address this problem, we present an alternative formulation of cost-sensitive classification, Clearance Under Threshold (CUT) Classification. Given a training set, CUT classification is to partition the attribute space such that a partition is cleared if the probability of a future object in this partition being bad is less than a user-specified threshold. The goal is to clear many low-risk objects so that users can more effectively target high-risk objects. We present a solution to this problem and evaluate it on a case study for clearing contaminated transformers and on public benchmarks from UC Irvine's Machine Learning Repository. According to the experiments, our algorithms performed far better than the baselines derived from previous classification approaches.

【Keywords】: decision making; learning (artificial intelligence); object recognition; pattern classification; probability; CUT classification; UC Irvine machine learning repository; clearance under threshold classification; contaminated transformers; cost distribution; cost matrix; cost-sensitive classification; decision making; future object probability; hidden object identification; high-risk objects; illusive cost issue; public benchmarks; public safety; Decision trees; Educational institutions; Oil insulation; Power transformer insulation; Sociology; Training; Transmission line matrix methods; Classification; Classification for Imbalanced Data

43. Modeling Adoptions and the Stages of the Diffusion of Innovations.

【Paper Link】【Pages】:420-429

【Authors】: Yasir Mehmood ; Nicola Barbieri ; Francesco Bonchi

【Abstract】: We study the data mining problem of modeling adoptions and the stages of the diffusion of an innovation. For our aim we propose a stochastic model which decomposes a diffusion trace (sequence of adoptions) in an ordered sequence of stages, where each stage is intuitively built around two dimensions: users and relative speed at which adoptions happen. Each stage is characterized by a specific rate of adoption and it involves different users to different extent, while the sequentiality in the diffusion is guaranteed by constraining the transition probabilities among stages. An empirical evaluation on synthetic and real-world adoption logs shows the effectiveness of the proposed framework in summarizing the adoption process, enabling several analysis tasks such as the identification of adopter categories, clustering and characterization of diffusion traces, and prediction of which users will adopt an item in the next future.

【Keywords】: data mining; probability; stochastic processes; adopter category; adoption process; data mining problem; diffusion trace; innovation diffusion; modeling adoption; real-world adoption; stochastic model; synthetic adoption; task analysis; transition probability; Data models; Hidden Markov models; Mathematical model; Social network services; Sociology; Statistics; Technological innovation

44. Stock Market Prediction from WSJ: Text Mining via Sparse Matrix Factorization.

【Paper Link】【Pages】:430-439

【Authors】: Felix Ming Fai Wong ; Zhenming Liu ; Mung Chiang

【Abstract】: We revisit the problem of predicting directional movements of stock prices based on news articles: here our algorithm uses daily articles from The Wall Street Journal to predict the closing stock prices on the same day. We propose a unified latent space model to characterize the "co-movements" between stock prices and news articles. Unlike many existing approaches, our new model is able to simultaneously leverage the correlations: (a) among stock prices, (b) among news articles, and (c) between stock prices and news articles. Thus, our model is able to make daily predictions on more than 500 stocks (most of which are not even mentioned in any news article) while having low complexity. We carry out extensive back testing on trading strategies based on our algorithm. The result shows that our model has substantially better accuracy rate (55.7%) compared to many widely used algorithms. The return (56%) and Sharpe ratio due to a trading strategy based on our model are also much higher than baseline indices.

【Keywords】: data mining; electronic trading; matrix decomposition; sparse matrices; stock markets; text analysis; Sharpe ratio; The Wall Street Journal; WSJ; directional movement; news article; sparse matrix factorization; stock market prediction; stock price; text mining; trading strategy; unified latent space model; Accuracy; Correlation; Educational institutions; Optimization; Prediction algorithms; Predictive models; Vectors; computational finance; sparse optimization; text mining

45. A Scalable Method for Exact Sampling from Kronecker Family Models.

【Paper Link】【Pages】:440-449

【Authors】: Sebastián Moreno ; Joseph J. Pfeiffer III ; Jennifer Neville ; Sergey Kirshner

【Abstract】: The recent interest in modeling complex networks has fueled the development of generative graph models, such as Kronecker Product Graph Model (KPGM) and mixed KPGM (mKPGM). The Kronecker family of models are appealing because of their elegant fractal structure, as well as their ability to capture important network characteristics such as degree, diameter, and (in the case of mKPGM) clustering and population variance. In addition, scalable sampling algorithms for KPGMs made the analysis of large-scale, sparse networks feasible for the first time. In this work, we show that the scalable sampling methods, in contrast to prior belief, do not in fact sample from the underlying KPGM distribution and often result in sampling graphs that are very unlikely. To address this issue, we develop a new representation that exploits the structure of Kronecker models and facilitates the development of novel grouped sampling methods that are provably correct. In this paper, we outline efficient algorithms to sample from mKPGMs and KPGMs based on these ideas. Notably, our mKPGM algorithm is the first available scalable sampling method for this model and our KPGM algorithm is both faster and more accurate than previous scalable methods. We conduct both theoretical analysis and empirical evaluation to demonstrate the strengths of our algorithms and show that we can sample a network with 75 million edges in 87 seconds on a single processor.

【Keywords】: complex networks; graph theory; sampling methods; Kronecker family models; Kronecker product graph model; complex networks; exact sampling; fractal structure; generative graph models; mKPGM; mixed KPGM; scalable sampling methods; Algorithm design and analysis; Fractals; Indexes; Probability distribution; Time complexity

46. Time Series Join on Subsequence Correlation.

【Paper Link】【Pages】:450-459

【Authors】: Abdullah Mueen ; Hossein Hamooni ; Trilce Estrada

【Abstract】: We consider the problem of joining two long time series based on their most correlated segments. Two time series can be joined at any locations and for arbitrary length. Such join locations and length provide useful knowledge about the synchrony of the two time series and have applications in many domains including environmental monitoring, patient monitoring and power monitoring. However, join on correlation is a computationally expensive task, specially when the time series are large. The naive algorithm requires O (n4) computation where n is the length of the time series. We propose an algorithm, named Jocor, that uses two algorithmic techniques to tackle the complexity. First, the algorithm reuses the computation by caching sufficient statistics and second, the algorithm prunes unnecessary correlation computation by admissible heuristics. The algorithm runs orders of magnitude faster than the naive algorithm and enables us to join long time series as well as many small time series. We propose a variant of Jocor for fast approximation and an extension to a GPU-based parallel method to bring down the running-time to interactive level for analytics applications. We show three independent uses of time series join on correlation which are made possible by our algorithm.

【Keywords】: approximation theory; graphics processing units; mathematics computing; parallel processing; statistical analysis; time series; GPU-based parallel method; Jocor; approximation; arbitrary length; environmental monitoring; naive algorithm; patient monitoring; power monitoring; statistics; subsequence correlation; time series joining; Computer science; Correlation; Educational institutions; Equations; Euclidean distance; Standards; Time series analysis; Alignment; Correlation; Matching; Similarity; Time Series

47. Locally Estimating Core Numbers.

【Paper Link】【Pages】:460-469

【Authors】: Michael P. O'Brien ; Blair D. Sullivan

【Abstract】: Graphs are a powerful way to model interactions and relationships in data from a wide variety of application domains. In this setting, entities represented by vertices at the 'center' of the graph are often more important than those associated with vertices on the 'fringes'. For example, central nodes tend to be more critical in the spread of information or disease and play an important role in clustering/community formation. Identifying such 'core' vertices has recently received additional attention in the context of network experiments, which analyze the response when a random subset of vertices are exposed to a treatment (e.g. Inoculation, free product samples, etc). Specifically, the likelihood of having many central vertices in any exposure subset can have a significant impact on the experiment. We focus on using k-cores and core numbers to measure the extent to which a vertex is central in a graph. Existing algorithms for computing the core number of a vertex require the entire graph as input, an unrealistic scenario in many real world applications. Moreover, in the context of network experiments, the sub graph induced by the treated vertices is only known in a probabilistic sense. We introduce a new method for estimating the core number based only on the properties of the graph within a region of radius δ around the vertex, and prove an asymptotic error bound of our estimator on random graphs. Further, we empirically validate the accuracy of our estimator for small values of δ on a representative corpus of real data sets. Finally, we evaluate the impact of improved local estimation on an open problem in network experimentation posed by Ugander et al.

【Keywords】: computational complexity; graph theory; network theory (graphs); set theory; application domains; asymptotic error bound; central nodes; central vertices; clustering formation; community formation; core vertices; data interactions; data relationships; empirical analysis; exposure subset; graph fringes; graph properties; graph vertices; k-cores; local core number estimation; network experiments; probability; radius region; random graphs; random vertex subset; real data sets; real world applications; response analysis; subgraphs; time complexity; unrealistic scenario; Accuracy; Communities; Computational complexity; Data models; Equations; Estimation; Upper bound; core numbers; graph algorithms; k-cores; network experiments

48. Dynamic Time Warping Averaging of Time Series Allows Faster and More Accurate Classification.

【Paper Link】【Pages】:470-479

【Authors】: François Petitjean ; Germain Forestier ; Geoffrey I. Webb ; Ann E. Nicholson ; Yanping Chen ; Eamonn J. Keogh

【Abstract】: Recent years have seen significant progress in improving both the efficiency and effectiveness of time series classification. However, because the best solution is typically the Nearest Neighbor algorithm with the relatively expensive Dynamic Time Warping as the distance measure, successful deployments on resource constrained devices remain elusive. Moreover, the recent explosion of interest in wearable devices, which typically have limited computational resources, has created a growing need for very efficient classification algorithms. A commonly used technique to glean the benefits of the Nearest Neighbor algorithm, without inheriting its undesirable time complexity, is to use the Nearest Centroid algorithm. However, because of the unique properties of (most) time series data, the centroid typically does not resemble any of the instances, an unintuitive and underappreciated fact. In this work we show that we can exploit a recent result to allow meaningful averaging of 'warped' times series, and that this result allows us to create ultra-efficient Nearest 'Centroid' classifiers that are at least as accurate as their more lethargic Nearest Neighbor cousins.

【Keywords】: computational complexity; pattern classification; time series; dynamic time warping averaging; nearest centroid classifiers; nearest neighbor algorithm; resource constrained devices; time complexity; time series classification; warped time series averaging; wearable devices; Accuracy; Artificial neural networks; Classification algorithms; Heuristic algorithms; Prototypes; Time series analysis; Training; Dynamic Time Warping; Nearest Neighbor; Nearest centroid; Time series averaging; Time series classification

49. A Statistically Efficient and Scalable Method for Log-Linear Analysis of High-Dimensional Data.

【Paper Link】【Pages】:480-489

【Authors】: François Petitjean ; Lloyd Allison ; Geoffrey I. Webb

【Abstract】: Log-linear analysis is the primary statistical approach to discovering conditional dependencies between the variables of a dataset. A good log-linear analysis method requires both high precision and statistical efficiency. High precision means that the risk of false discoveries should be kept very low. Statistical efficiency means that the method should discover actual associations with as few samples as possible. Classical approaches to log-linear analysis make use of χ² tests to control this balance between quality and complexity. We present an information-theoretic approach to log-linear analysis. We show that our approach 1) requires significantly fewer samples to discover the true associations than statistical approaches -- statistical efficiency -- 2) controls for the risk of false discoveries as well as statistical approaches -- high precision - and 3) can perform the discovery on datasets with hundreds of variables on a standard desktop computer -- computational efficiency.

【Keywords】: data mining; statistical analysis; high-dimensional data; information-theoretic approach; log-linear analysis; statistical efficiency; Computational modeling; Data models; Encoding; Head; Maximum likelihood estimation; Particle separators; Standards; Association discovery; Data modeling; Graphical models; High-dimensional data; Information theory; Log-linear Analysis; Statistical inference

50. Composite Likelihood Data Augmentation for Within-Network Statistical Relational Learning.

【Paper Link】【Pages】:490-499

【Authors】: Joseph J. Pfeiffer III ; Jennifer Neville ; Paul N. Bennett

【Abstract】: The prevalence of datasets that can be represented as networks has recently fueled a great deal of work in the area of Relational Machine Learning (RML). Due to the statistical correlations between linked nodes in the network, many RML methods focus on predicting node features (i.e., labels) using the network relationships. However, many domains are comprised of a single, partially-labeled network. Thus, relational versions of Expectation Maximization (i.e., R-EM), which jointly learn parameters and infer the missing labels, can outperform methods that learn parameters from the labeled data and apply them for inference on the unlabeled nodes. Although R-EM methods can significantly improve predictive performance in networks that are densely labeled, they do not achieve the same gains in sparsely labeled networks and can perform worse than RML methods. In this work, we show the fixed-point methods that R-EM uses for approximate learning and inference result in errors that prevent convergence in sparsely labeled networks. We then propose two methods that do not experience this problem. First, we develop a Relational Stochastic EM (R-SEM) method, which uses stochastic parameters that are not as susceptible to approximation errors. Then we develop a Relational Data Augmentation (R-DA) method, which integrates over a range of stochastic parameter values for inference. R-SEM and R-DA can use any collective RML algorithm for learning and inference in partially labeled networks. We analyze their performance with two RML learners over four real world datasets, and show that they outperform independent learning, RML and R-EM -- particularly in sparsely labeled networks.

【Keywords】: approximation theory; data handling; expectation-maximisation algorithm; learning (artificial intelligence); network theory (graphs); stochastic processes; R-DA method; R-EM methods; R-SEM method; RML methods; approximate learning; approximation errors; collective RML algorithm; composite likelihood data augmentation; expectation maximization; fixed-point methods; linked nodes; network relationships; node feature prediction; partially-labeled network; predictive performance; relational data augmentation method; relational inference; relational machine learning; relational stochastic EM method; relational versions; sparsely labeled networks; statistical correlations; stochastic parameters; within-network statistical relational learning; Approximation algorithms; Approximation methods; Estimation; Inference algorithms; Joints; Markov processes; Data Augmentation; Mixing Rate; Nonlinear Dynamical Systems; Statistical Relational Learning

51. Structural Bregman Distance Functions Learning to Rank with Self-Reinforcement.

【Paper Link】【Pages】:500-509

【Authors】: Te Pi ; Xi Li ; Zhongfei Zhang

【Abstract】: Learning to rank is an important task for many data mining applications. Essentially, the goal of learning to rank is to learn an appropriate similarity or distance metric to determine the relevance relationships among data points. However, most of the existing approaches for distance metric learning are limited in three aspects. First, they often assume a fixed form of distance metric for the entire input space. Second, the assumed distance functions are often computationally expensive or even intractable to learn for high dimensional data, such as Mahalanobis distance. Third, most of these approaches lack robustness to noisily labeled data, which is pervasive in many real-world applications. In this paper, we study learning to rank as a problem of distance metric learning to address the above three problems. We choose Bregman distance as the target distance function, due to its general functional form as a generalization of a wide class of distance functions, and its capacity of exploiting complicated nonlinear patterns underlying the data. Under the framework of structural SVM, we formulate the problem of learning Bregman distance functions for ranking as a QP problem by a nonparametric approach, and present an effective algorithm. Furthermore, we propose a self-reinforcement scheme that adaptively differentiates each data point in the role of learning to secure the robustness. We emphasize that the proposed method SBLR-S (Structural Bregman distance functions Learning to Rank with Self-reinforcement) is more general than the conventional distance metric learning approaches, and is able to handle high dimensional data as well as noisily labeled data. The experiments of data ranking on real-world datasets show the superiority of this method to the state-of-the-art literature.

【Keywords】: data mining; generalisation (artificial intelligence); learning (artificial intelligence); nonparametric statistics; support vector machines; Bregman distance; Bregman distance function learning problem; Mahalanobis distance; QP problem; SBLR-S method; complicated nonlinear patterns; data mining applications; data points; distance functions; distance metric learning approach; nonparametric approach; self-reinforcement; self-reinforcement scheme; structural Bregman distance function learning; structural Bregman distance functions learning to rank with self-reinforcement method; structural SVM; Data models; Kernel; Measurement; Optimization; Robustness; Support vector machines; Training; Bregman distance; distance metric learning; learning to rank; structural learning

52. Metric Factorization for Exploratory Analysis of Complex Data.

【Paper Link】【Pages】:510-519

【Authors】: Claudia Plant

【Abstract】: How to explore complex data? Often, several representations for each data object are available, the data are described by attributes of heterogeneous data type and/or each data object is characterized by many features. It is difficult to choose a suitable similarity measure and an appropriate data mining technique to get an unbiased overview on the information contained in complex data. In this paper, we introduce Metric Factorization as a novel data mining task. The goal of Metric Factorization is to discover the major alternative views of complex data. Our novel algorithm MF extends matrix factorization techniques to support metric data. We do not need to choose a single similarity measure but can just input any available metric. Metric Factorization builds automatically interesting basis spaces from a large variety of input metrics. Due to metric properties, the basis spaces can be further explored with standard techniques like Multidimensional Scaling. We relate the Metric Factorization task to data compression and demonstrate how ideas from information theory (Minimum Description Length principle) make the parametrization of MF optional. We further introduce the idea of landmark points to effectively compress and thus support large data sets. Extensive experiments demonstrate the benefits of our approach.

【Keywords】: data analysis; data compression; data mining; information theory; matrix decomposition; MF optional parametrization; data compression; data mining technique; heterogeneous data type; information theory; metric factorization; minimum description length principle; multidimensional scaling; similarity measure; Data mining; Encoding; Feature extraction; Image color analysis; Measurement; Optimization; Standards; MDL; Matrix factorization; metric data

53. Online Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction.

【Paper Link】【Pages】:520-529

【Authors】: Zhi Qiao ; Peng Zhang ; Wenjia Niu ; Chuan Zhou ; Peng Wang ; Li Guo

【Abstract】: Max-margin matrix factorization (M3F) has been popularly applied to collaborative filtering for personalized recommendations. The nonparametric M3F model represents the latest progress of the M3F methods, which can auto-select the number of factors by using nonparametric techniques. However, existing non-parametric M3F methods assume a collection of user rating data can be fully obtained before training, and they are inapplicable for on-the-fly recommender systems where user rating data arrive continuously. In this paper, we present a new efficient online nonparametric 3F model for flexible recommendation. Specifically, we design an online nonparametric M3F model (OnM3F) based on the online Passive-Aggressive learning and solve the corresponding optimization problem by using the online stochastic gradient descent. Empirical studies on two large real-world data sets verify the effectiveness of the proposed method.

【Keywords】: collaborative filtering; learning (artificial intelligence); matrix decomposition; recommender systems; M3F model; OnM3F model; collaborative filtering; collaborative prediction; nonparametric M3F methods; online nonparametric max-margin matrix factorization; online passive-aggressive learning; online stochastic gradient descent; personalized recommendation system; user rating data collection; Adaptation models; Bayes methods; Collaboration; Computational modeling; Data models; Learning systems; Optimization; Collaborative Prediction; Nonparametric Max-Margin Matrix Factorization; Online Learning

54. Diffusion Archaeology for Diffusion Progression History Reconstruction.

【Paper Link】【Pages】:530-539

【Authors】: Emre Sefer ; Carl Kingsford

【Abstract】: Diffusion through graphs can be used to model many real-world process, such as the spread of diseases, social network memes, computer viruses, or water contaminants. Often, a real-world diffusion cannot be directly observed while it is occurring -- perhaps it is not noticed until some time has passed, continuous monitoring is too costly, or privacy concerns limit data access. This leads to the need to reconstruct how the present state of the diffusion came to be from partial diffusion data. Here, we tackle the problem of reconstructing a diffusion history from one or more snapshots of the diffusion state. This ability can be invaluable to learn when certain computer nodes are infected or which people are the initial disease spreaders to control future diffusions. We formulate this problem over discrete-time SEIRS-type diffusion models in terms of maximum likelihood. We design methods that are based on sub modularity and a novel prize-collecting dominating-set vertex cover (PCDSVC) relaxation that can identify likely diffusion steps with some provable performance guarantees. Our methods are the first to be able to reconstruct complete diffusion histories accurately in real and simulated situations. As a special case, they can also identify the initial spreaders better than existing methods for that problem. Our results for both meme and contaminant diffusion show that the partial diffusion data problem can be overcome with proper modeling and methods, and that hidden temporal characteristics of diffusion can be predicted from limited data.

【Keywords】: data handling; diffusion; discrete time systems; graph theory; maximum likelihood estimation; PCDSVC relaxation; contaminant diffusion; continuous monitoring; data access; diffusion archaeology; diffusion history reconstruction; diffusion progression history reconstruction; diffusion state; discrete-time SEIRS-type diffusion model; disease spreader; graph; maximum likelihood; partial diffusion data problem; performance guarantee; prize-collecting dominating-set vertex cover relaxation; real-world diffusion; real-world process; temporal characteristics; Approximation methods; Computational modeling; Computers; History; Integrated circuit modeling; Mathematical model; Silicon; diffusion; epidemics; history

55. Privacy-Preserving Personalized Recommendation: An Instance-Based Approach via Differential Privacy.

【Paper Link】【Pages】:540-549

【Authors】: Yilin Shen ; Hongxia Jin

【Abstract】: Recommender systems become increasingly popular and widely applied nowadays. The release of users' private data is required to provide users accurate recommendations, yet this has been shown to put users at risk. Unfortunately, existing privacy-preserving methods are either developed under trusted server settings with impractical private recommender systems or lack of strong privacy guarantees. In this paper, we develop the first lightweight and provably private solution for personalized recommendation, under untrusted server settings. In this novel setting, users' private data is obfuscated before leaving their private devices, giving users greater control on their data and service providers less responsibility on privacy protections. More importantly, our approach enables the existing recommender systems (with no changes needed) to directly use perturbed data, rendering our solution very desirable in practice. We develop our data perturbation approach on differential privacy, the state-of-the-art privacy model with lightweight computation and strong but provable privacy guarantees. In order to achieve useful and feasible perturbations, we first design a novel relaxed admissible mechanism enabling the injection of flexible instance-based noises. Using this novel mechanism, our data perturbation approach, incorporating the noise calibration and learning techniques, obtains perturbed user data with both theoretical privacy and utility guarantees. Our empirical evaluation on large-scale real-world datasets not only shows its high recommendation accuracy but also illustrates the negligible computational overhead on both personal computers and smart phones. As such, we are able to meet two contradictory goals, privacy preservation and recommendation accuracy. This practical technology helps to gain user adoption with strong privacy protection and benefit companies with high-quality personalized services on perturbed user data.

【Keywords】: calibration; data privacy; personal computing; recommender systems; trusted computing; computational overhead; data perturbation; differential privacy; high quality personalized services; noise calibration; perturbed user data; privacy preservation; privacy protections; privacy-preserving methods; privacy-preserving personalized recommendation; private recommender systems; provable privacy guarantees; recommendation accuracy; smart phones; strong privacy protection; theoretical privacy; untrusted server settings; user adoption; user private data; utility guarantees; Aggregates; Data privacy; Noise; Privacy; Sensitivity; Servers; Vectors; Data Perturbation; Differential Privacy; Learning and Optimization; Probabilistic Analysis; Recommender System

56. Mining Contentious Documents Using an Unsupervised Topic Model Based Approach.

【Paper Link】【Pages】:550-559

【Authors】: Amine Trabelsi ; Osmar R. Zaïane

【Abstract】: This work proposes an unsupervised method intended to enhance the quality of opinion mining in contentious text. It presents a Joint Topic Viewpoint (JTV) probabilistic model to analyse the underlying divergent arguing expressions that may be present in a collection of contentious documents. It extends the original Latent Dirichlet Allocation (LDA), which makes it domain and thesaurus-independent, e.g., does not rely on Word Net coverage. The conceived JTV has the potential of automatically carrying the tasks of extracting associated terms denoting an arguing expression, according to the hidden topics it discusses and the embedded viewpoint it voices. Furthermore, JTV's structure enables the unsupervised grouping of obtained arguing expressions according to their viewpoints, using a constrained clustering approach. Experiments are conducted on three types of contentious documents: polls, online debates and editorials. The qualitative and quantitative analysis of the experimental results show the effectiveness of our model to handle six different contentious issues when compared to a state-of-the-art method. Moreover, the ability to automatically generate distinctive and informative patterns of arguing expressions is demonstrated.

【Keywords】: data mining; document handling; probability; JTV probabilistic model; LDA; Word Net coverage; arguing expression; contentious document mining; editorials; joint topic viewpoint probabilistic model; latent Dirichlet allocation; online debates; opinion mining; polls; qualitative analysis; quantitative analysis; unsupervised grouping; unsupervised method; unsupervised topic model based approach; Data mining; Data models; Editorials; Government; Insurance; Joints; Medical services

57. Efficient Integrity Verification for Outsourced Collaborative Filtering.

【Paper Link】【Pages】:560-569

【Authors】: Jaideep Vaidya ; Ibrahim Yakut ; Anirban Basu

【Abstract】: Collaborative filtering (CF) over large datasets requires significant computing power. Due to this data owning organizations often outsource the computation of CF (including some abstraction of the data itself) to a public cloud infrastructure. However, this leads to the question of how to verify the integrity of the outsourced computation. In this paper, we develop verification mechanisms for two popular item based collaborative filtering techniques. We further analyze the cheating behavior of the cloud from the game-theoretic perspective. Coupled with the right incentives, we can ensure that the computation is incentive compatible thus ensuring that a rational adversary will not cheat. Leveraging this, we can develop efficient and effective mechanisms to address the problem of integrity in outsourcing.

【Keywords】: cloud computing; collaborative filtering; formal verification; game theory; outsourcing; cheating behavior; data owning organization; game-theoretic perspective; integrity verification; item based collaborative filtering technique; outsourced collaborative filtering; outsourced computation; public cloud infrastructure; verification mechanism; Collaboration; Educational institutions; Electronic mail; Measurement; Outsourcing; Prediction algorithms; Servers; Collaborative Filtering; Integrity Verification; Outsourcing

58. PGT: Measuring Mobility Relationship Using Personal, Global and Temporal Factors.

【Paper Link】【Pages】:570-579

【Authors】: Hongjian Wang ; Zhenhui Li ; Wang-Chien Lee

【Abstract】: Rich location data of mobile users collected from smart phones and location-based social networking services enable us to measure the mobility relationship strength based on their interactions in the physical world. A commonly-used measure for such relationship is the frequency of meeting events (i.e., Co-locate at the same time). That is, the more frequently two persons meet, the stronger their mobility relationship is. However, we argue that not all the meeting events are equally important in measuring the mobility relationship and propose to consider personal and global factors to differentiate meeting events. Personal factor models the probability for an individual user to visit a certain location, whereas the global factor models the popularity of a location based on the behavior of general public. In addition, we introduce the temporal factor to further consider the time gaps between meeting events. Accordingly, we propose a unified framework, called PGT, that considers personal, global, and temporal factors to measure the strength of the relationship between two given mobile users. Extensive experiments on real datasets validate our ideas and show that our method significantly outperforms the state-of-the-art methods.

【Keywords】: mobile computing; probability; smart phones; social networking (online); PGT; general public behavior; location-based social networking services; meeting event frequency; mobile user location data collection; mobility relationship strength measurement; personal-global-and-temporal factors; physical world; probability; smart phones; time gaps; unified framework; Conferences; Data mining; mobility; relationship strength; social computing; spatiotemporal

59. MASCOT: Fast and Highly Scalable SVM Cross-Validation Using GPUs and SSDs.

【Paper Link】【Pages】:580-589

【Authors】: Zeyi Wen ; Rui Zhang ; Kotagiri Ramamohanarao ; Jianzhong Qi ; Kerry L. Taylor

【Abstract】: Cross-validation is a commonly used method for evaluating the effectiveness of Support Vector Machines (SVMs). However, existing SVM cross-validation algorithms are not scalable to large datasets because they have to (i) hold the whole dataset in memory and/or (ii) perform a very large number of kernel value computation. In this paper, we propose a scheme to dramatically improve the scalability and efficiency of SVM cross-validation through the following key ideas. (i) To avoid holding the whole dataset in the memory and avoid performing repeated kernel value computation, we precompute the kernel values and reuse them. (ii) We store the precomputed kernel values to a high-speed storage framework, consisting of CPU memory extended by solid state drives (SSDs) and GPU memory as a cache, so that reusing (i.e., Reading) kernel values takes much lesser time than computing them on-the-fly. (iii) To further improve the efficiency of the SVM training, we apply a number of techniques for the extreme example search algorithm, design a parallel kernel value read algorithm, propose a caching strategy well-suited to the characteristics of the storage framework, and parallelize the tasks on the GPU and the CPU. For datasets of sizes that existing algorithms can handle, our scheme achieves several orders of magnitude of speedup. More importantly, our scheme enables SVM cross-validation on datasets of very large scale that existing algorithms are unable to handle.

【Keywords】: cache storage; disc drives; graphics processing units; storage management; support vector machines; CPU memory; GPU memory; MASCOT; SSD; caching strategy; high-speed storage framework; parallel kernel value read algorithm; scalable SVM cross-validation; solid state drive; support vector machine; Algorithm design and analysis; Equations; Graphics processing units; Instruction sets; Kernel; Support vector machines; Training; Cross-validation; GPUs; SSDs; SVM; Training

60. Multi-graph-view Learning for Graph Classification.

【Paper Link】【Pages】:590-599

【Authors】: Jia Wu ; Zhibin Hong ; Shirui Pan ; Xingquan Zhu ; Zhihua Cai ; Chengqi Zhang

【Abstract】: Graph classification has traditionally focused on graphs generated from a single feature view. In many applications, it is common to have useful information from different channels/views to describe objects, which naturally results in a new representation with multiple graphs generated from different feature views being used to describe one object. In this paper, we formulate a new Multi-Graph-View learning task for graph classification, where each object to be classified contains graphs from multiple graph-views. This problem setting is essentially different from traditional single-graph-view graph classification, where graphs are from one single feature view. To solve the problem, we propose a Cross Graph-View Sub graph Feature based Learning (gCGVFL) algorithm that explores an optimal set of sub graphs, across multiple graph-views, as features to represent graphs. Specifically, we derive an evaluation criterion to estimate the discriminative power and the redundancy of sub graph features across all views, and assign proper weight values to each view to indicate its importance for graph classification. The iterative cross graph-view sub graph scoring and graph-view weight updating form a closed loop to find optimal sub graphs to represent graphs for multi-graph-view learning. Experiments and comparisons on real-world tasks demonstrate the algorithm's performance.

【Keywords】: graph theory; iterative methods; learning (artificial intelligence); cross graph-view subgraph feature based learning; graph classification; graph-view weight updating; iterative cross graph-view subgraph scoring; multigraph-view learning; Educational institutions; Histograms; Image color analysis; Optimization; Redundancy; Training; Vectors; Feature Selection; Graph Classification; Multi-Graph-View; Subgraph Mining

61. RS-Forest: A Rapid Density Estimator for Streaming Anomaly Detection.

【Paper Link】【Pages】:600-609

【Authors】: Ke Wu ; Kun Zhang ; Wei Fan ; Andrea Edwards ; Philip S. Yu

【Abstract】: Anomaly detection in streaming data is of high interest in numerous application domains. In this paper, we propose a novel one-class semi-supervised algorithm to detect anomalies in streaming data. Underlying the algorithm is a fast and accurate density estimator implemented by multiple fully randomized space trees (RS-Trees), named RS-Forest. The piecewise constant density estimate of each RS-tree is defined on the tree node into which an instance falls. Each incoming instance in a data stream is scored by the density estimates averaged over all trees in the forest. Two strategies, statistical attribute range estimation of high probability guarantee and dual node profiles for rapid model update, are seamlessly integrated into RS Forestto systematically address the ever-evolving nature of data streams. We derive the theoretical upper bound for the proposed algorithm and analyze its asymptotic properties via bias-variance decomposition. Empirical comparisons to the state-of-the-art methods on multiple benchmark datasets demonstrate that the proposed method features high detection rate, fast response, and insensitivity to most of the parameter settings. Algorithm implementations and datasets are available upon request.

【Keywords】: data mining; learning (artificial intelligence); security of data; statistical analysis; trees (mathematics); RS-forest; RS-trees; anomaly detection; asymptotic property; bias-variance decomposition; dual node profile; one-class semisupervised algorithm; piecewise constant density; randomized space trees; rapid density estimator; statistical attribute range estimation; streaming data; Benchmark testing; Data models; Detectors; Estimation; Predictive models; Upper bound; Vegetation; Anomaly detection; data streams; density estimation; ensembles; streaming data

62. Heterogeneous Metric Learning with Content-Based Regularization for Software Artifact Retrieval.

【Paper Link】【Pages】:610-619

【Authors】: Liang Wu ; Liang Du ; Bo Liu ; Guandong Xu ; Yong Ge ; Yanjie Fu ; Jianhui Li ; Yuanchun Zhou ; Hui Xiong

【Abstract】: The problem of software artifact retrieval has the goal to effectively locate software artifacts, such as a piece of source code, in a large code repository. This problem has been traditionally addressed through the textual query. In other words, information retrieval techniques will be exploited based on the textual similarity between queries and textual representation of software artifacts, which is generated by collecting words from comments, identifiers, and descriptions of programs. However, in addition to these semantic information, there are rich information embedded in source codes themselves. These source codes, if analyzed properly, can be a rich source for enhancing the efforts of software artifact retrieval. To this end, in this paper, we develop a feature extraction method on source codes. Specifically, this method can capture both the inherent information in the source codes and the semantic information hidden in the comments, descriptions, and identifiers of the source codes. Moreover, we design a heterogeneous metric learning approach, which allows to integrate code features and text features into the same latent semantic space. This, in turn, can help to measure the artifact similarity by exploiting the joint power of both code and text features. Finally, extensive experiments on real-world data show that the proposed method can help to improve the performances of software artifact retrieval with a significant margin.

【Keywords】: feature extraction; learning (artificial intelligence); query processing; source code (software); text analysis; artifact similarity; code feature; code repository; content-based regularization; feature extraction method; heterogeneous metric learning approach; information retrieval technique; latent semantic space; semantic information; software artifact retrieval; source code; text feature; textual query; textual representation; textual similarity; Electronic mail; Feature extraction; Information retrieval; Measurement; Optimization; Semantics; Software

63. A Fast Inference Algorithm for Stochastic Blockmodel.

【Paper Link】【Pages】:620-629

【Authors】: Zhiqiang Xu ; Yiping Ke ; Yi Wang

【Abstract】: Stochastic block model is a widely used statistical tool for modeling graphs and networks. Despite its popularity, the development on efficient inference algorithms for this model is surprisingly inadequate. The existing solutions are either too slow to handle large networks, or suffer from convergence issues. In this paper, we propose a fast and principled inference algorithm for stochastic block model. The algorithm is based on the variational Bayesian framework, and deploys the natural conjugate gradient method to accelerate the optimization of the variational bound. Leveraging upon the power of both conjugate and natural gradients, it converges super linearly and produces high quality solutions in practice. In particular, we apply our algorithm to the community detection task and compare it with the state-of-the-art variational Bayesian algorithms. We show that it can achieve up to two orders of magnitude speedup without significantly compromising the quality of solutions.

【Keywords】: Bayes methods; gradient methods; graphs; inference mechanisms; stochastic processes; community detection task; fast inference algorithm; modeling graphs; natural conjugate gradient method; natural gradients; principled inference algorithm; statistical tool; stochastic block model; variational Bayesian algorithms; variational Bayesian framework; variational bound; Bayes methods; Convergence; Equations; Gradient methods; Inference algorithms; Manifolds

【Paper Link】【Pages】:630-639

【Authors】: Bo Yang ; Hongbin Pei ; Hechang Chen ; Jiming Liu ; Shang Xia

【Abstract】: During an epidemic, the spatial, temporal and demographical patterns of disease transmission are determined by multiple factors. Besides the physiological properties of pathogenes and hosts, the social contacts of host population, which characterize individuals' reciprocal exposures of infection in view of demographical structures and various social activities, are also pivotal to understand and further predict the prevalence of infectious diseases. The means of measuring social contacts will dominate the extent how precisely we can forecast the dynamics of infections in the real world. Most current works focus their efforts on modeling the spatial patterns of static social contacts. In this work, we address the problem on how to characterize and measure dynamical social contacts during an epidemic from a novel perspective. We propose an epidemic-model-based tensor deconvolution framework to address this issue, in which the spatiotemporal patterns of social contacts are represented by the factors of tensors, which can be discovered by a tensor deconvolution procedure with an integration of epidemic models from rich types of data, mainly including heterogeneous outbreak surveillance, social-demographic census and physiological data from medical reports. Taking SIR model as a case study, the efficacy of the proposed method is theoretically analyzed and empirically validated through a set of rigorous experiments on both synthetic and real-world data.

【Keywords】: data mining; diseases; medical computing; pattern recognition; SIR model; demographical pattern; demographical structures; disease transmission; epidemic; epidemic-model-based tensor deconvolution framework; heterogeneous data; heterogeneous outbreak surveillance; infectious disease; metapopulation; physiological data; social activities; social-demographic census; spatial pattern; spatiotemporal social contact mining; temporal pattern; Data models; Diseases; Educational institutions; Sociology; Statistics; Surveillance; Tensile stress; epidemic modeling; healthcare; multiple source data mining; spatiotemporal social contact; tensor deconvolution

65. Continuous KNN Join Processing for Real-Time Recommendation.

【Paper Link】【Pages】:640-649

【Authors】: Chong Yang ; Xiaohui Yu ; Yang Liu

【Abstract】: The explosive growth of user-generated contents in social networking websites necessitates the recommendation functionality that can push to the user the content that he/she is most likely to be interested in. Such recommendation should happen in real-time as new contents become available, because "freshness" is an important consideration in people's content-consumption behavior. Representing users and contents as feature vectors in a high-dimensional space, we can essentially cast the problem of real-time recommendations as the problem of computing the list of k nearest neighbors of each user, which we call kNN join. Given the vast volume of contents and users, the biggest challenge is how to continuously update the kNN join results as new contents arrive. Existing methods for incremental kNN join on data streams suffer from the "curse of dimensionality" and high in-memory search cost. In this paper, we present a solution that first identifies the users whose kNN's might be affected by the newly arrived content, and then update their kNN's respectively. We propose a new index structure named HDR-tree in order to support the efficient search of affected users. HDR-tree performs dimensionality reduction through clustering and principle component analysis (PCA) in order to improve the search effectiveness. To further reduce response time, we propose a variant of HDR-tree, called HDR-tree, that supports more efficient but approximate solutions. The results of extensive experiments show that our methods significantly outperform baseline methods.

【Keywords】: data reduction; pattern clustering; principal component analysis; recommender systems; social networking (online); tree data structures; HDR-tree index structure; PCA; Web sites; clustering analysis; continuous KNN join processing; curse of dimensionality reduction; data streams; high in-memory search cost; k nearest neighbors; people content-consumption behavior; principle component analysis; real-time recommendation; social networking; user-generated contents; Clustering algorithms; Collaboration; Educational institutions; Indexes; Principal component analysis; Real-time systems; Vegetation; high-dimensional data; k nearest neighbor join; real-time recommendation

66. Scalable SVM-Based Classification in Dynamic Graphs.

【Paper Link】【Pages】:650-659

【Authors】: Yibo Yao ; Lawrence B. Holder

【Abstract】: With the emergence of networked data, graph classification has received considerable interest during the past years. Most approaches to graph classification focus on designing effective kernels to compute similarities for static graphs. However, they become computationally intractable in terms of time and space when a graph is presented in a incremental fashion with continuous updates, i.e., Insertions of nodes and edges. In this paper, we examine the problem of classification in large-scale and incrementally changing graphs. To this end, a framework combining an incremental Support Vector Machine (SVM) with the Weisfeiler-Lehman (W-L) graph kernel has been proposed to study this problem. By retaining the support vectors from each learning step, the classification model is incrementally updated whenever new changes are made to the subject graph. Furthermore, we design an entropy-based sub graph extraction strategy to select informative neighbor nodes and discard those with less discriminative power, to facilitate an effective classification process. We demonstrate the advantages of our learning techniques by conducting an empirical evaluation on two large-scale real-world graph datasets. The experimental results also validate the benefits of our sub graph extraction method when combined with the incremental learning techniques.

【Keywords】: entropy; graph theory; learning (artificial intelligence); mathematics computing; pattern classification; support vector machines; SVM; W-L graph kernel; Weisfeiler-Lehman graph kernel; entropy-based subgraph extraction; graph classification; learning technique; support vector machine; Data mining; Entropy; Kernel; Prediction algorithms; Predictive models; Support vector machines; Training

67. Towards Scalable and Accurate Online Feature Selection for Big Data.

【Paper Link】【Pages】:660-669

【Authors】: Kui Yu ; Xindong Wu ; Wei Ding ; Jian Pei

【Abstract】: Feature selection is important in many big data applications. There are at least two critical challenges. Firstly, in many applications, the dimensionality is extremely high, in millions, and keeps growing. Secondly, feature selection has to be highly scalable, preferably in an online manner such that each feature can be processed in a sequential scan. In this paper, we develop SAOLA, a Scalable and Accurate On Line Approach for feature selection. With a theoretical analysis on a low bound on the pair wise correlations between features in the currently selected feature subset, SAOLA employs novel online pair wise comparison techniques to address the two challenges and maintain a parsimonious model over time in an online manner. An empirical study using a series of benchmark real data sets shows that SAOLA is scalable on data sets of extremely high dimensionality, and has superior performance over the state-of-the-art feature selection methods.

【Keywords】: Big Data; data mining; Big Data; SAOLA; feature selection; pairwise correlation; scalable and accurate online approach; Accuracy; Big data; Correlation; Markov processes; Redundancy; Search problems; Training; Extremely high dimensionality; Feature redundancy; Online feature selection

68. Learning Fine-Grained Spatial Models for Dynamic Sports Play Prediction.

【Paper Link】【Pages】:670-679

【Authors】: Yisong Yue ; Patrick Lucey ; G. Peter K. Carr ; Alina Bialkowski ; Iain A. Matthews

【Abstract】: We consider the problem of learning predictive models for in-game sports play prediction. Focusing on basketball, we develop models for anticipating near-future events given the current game state. We employ a latent factor modeling approach, which leads to a compact data representation that enables efficient prediction given raw spatiotemporal tracking data. We validate our approach using tracking data from the 2012-2013 NBA season, and show that our model can make accurate in-game predictions. We provide a detailed inspection of our learned factors, and show that our model is interpretable and corresponds to known intuitions of basketball game play.

【Keywords】: data handling; data structures; learning (artificial intelligence); prediction theory; sport; NBA season; basketball game play; compact data representation; dynamic sports play prediction; factor modeling; fine-grained spatial model learning; in-game predictions; in-game sports play prediction; near-future events; predictive models; raw spatiotemporal tracking data; Analytical models; Computational modeling; Data models; Games; Gaussian processes; Predictive models; Training; predictive modeling; representation learning; spatiotemporal reasoning; sports analytics

69. Output Feature Augmented Lasso.

【Paper Link】【Pages】:680-686

【Authors】: Changqing Zhang ; Yahong Han ; Xiaojie Guo ; Xiaochun Cao

【Abstract】: Lasso simultaneously conducts variable selection and supervised regression. In this paper, we extend Lasso to multiple output prediction, which belongs to the categories of structured learning. Though structured learning makes use of both input and output simultaneously, the joint feature mapping in current framework of structured learning is usually application-specific. As a result, ad hoc heuristics have to be employed to design different joint feature mapping functions for different applications, which results in the lackness of generalization ability for multiple output prediction. To address this limitation, in this paper, we propose to augment Lasso with output by decoupling the joint feature mapping function of traditional structured learning. The contribution of this paper is three-fold: 1) The augmented Lasso conducts regression and variable selection on both the input and output features, and thus the learned model could fit an output with both the selected input variables and the other correlated outputs. 2) To be more general, we set up nonlinear dependencies among output variables by generalized Lasso. 3) Moreover, the Augmented Lagrangian Method (ALM) with Alternating Direction Minimizing (ADM) strategy is used to find the optimal model parameters. The extensive experimental results demonstrate the effectiveness of the proposed method.

【Keywords】: feature selection; learning (artificial intelligence); regression analysis; ADM strategy; ALM; alternating direction minimizing strategy; augmented Lagrangian method; joint feature mapping functions; multiple output prediction; output feature augmented Lasso; structured learning; supervised regression; variable selection; Correlation; Input variables; Joints; Kernel; Linear programming; Training; Vectors; Lasso; alternating direction minimizing; multiple output prediction; structured learning

70. Multi-touch Attribution in Online Advertising with Survival Theory.

【Paper Link】【Pages】:687-696

【Authors】: Ya Zhang ; Yi Wei ; Jianbiao Ren

【Abstract】: Multi-touch attribution, which allows distributing the credit to all related advertisements based on their corresponding contributions, has recently become an important research topic in digital advertising. Traditionally, rule-based attribution models have been used in practice. The drawback of such rule-based models lies in the fact that the rules are not derived form the data but only based on simple intuition. With the ever enhanced capability to tracking advertisement and users' interaction with the advertisement, data-driven multi-touch attribution models, which attempt to infer the contribution from user interaction data, become an important research direction. We here propose a new data-driven attribution model based on survival theory. By adopting a probabilistic framework, one key advantage of the proposed model is that it is able to remove the presentation biases inherit to most of the other attribution models. In addition to model the attribution, the proposed model is also able to predict user's 'conversion' probability. We validate the proposed method with a real-world data set obtained from a operational commercial advertising monitoring company. Experiment results have shown that the proposed method is quite promising in both conversion prediction and attribution.

【Keywords】: Internet; advertising data processing; data handling; probability; commercial advertising monitoring company; data-driven multitouch attribution models; digital advertising; online advertising; probabilistic framework; rule-based attribution models; survival theory; user conversion probability prediction; user interaction data; Advertising; Data models; Gold; Hazards; Hidden Markov models; Kernel; Predictive models; Multi-touch attribution; Online Advertising; Survival theory

【Paper Link】【Pages】:697-706

【Authors】: Chen Zhu ; Hengshu Zhu ; Yong Ge ; Enhong Chen ; Qi Liu

【Abstract】: Many of today's online news websites have enabled users to specify different types of emotions (e.g., Angry and shocked) they have after reading news. Compared with traditional user feedbacks such as comments and ratings, these specific emotion annotations are more accurate for expressing users' personal emotions. In this paper, we propose to exploit these users' emotion annotations for online news in order to track the evolution of emotions, which plays an important role in various online services. A critical challenge is how to model emotions with respect to time spans. To this end, we propose a time-aware topic modeling perspective for solving this problem. Specifically, we first develop a model named emotion-Topic over Time (eToT), in which we represent the topics of news as a Beta distribution over time and a multinomial distribution over emotions. Whilee ToT can uncover the latent relationship among news, emotion and time directly, it cannot capture the dynamics of topics. Therefore, we further develop another model named emotion based Dynamic Topic Model (eDTM), where we explore the state space model for tracking the dynamics of topics. In addition, we demonstrate that both eToT and eDTM could enable several potential applications, such as emotion prediction, emotion-based news recommendations and emotion anomaly detections. Finally, we validate the proposed models with extensive experiments with a real-world data set.

【Keywords】: Internet; social networking (online); social sciences computing; statistical distributions; Beta distribution over time; eDTM; eToT model; emotion based dynamic topic model; emotion-topic over time model; multinomial distribution over emotion; online news Web sites; online services; social emotion evolution tracking; state space model; time-aware topic modeling; user emotion annotations; user feedbacks; user personal emotions; Analytical models; Data models; Joints; Predictive models; Semantics; Sentiment analysis; Vectors

Short Papers 72

72. Mp-Dissimilarity: A Data Dependent Dissimilarity Measure.

【Paper Link】【Pages】:707-712

【Authors】: Sunil Aryal ; Kai Ming Ting ; Gholamreza Haffari ; Takashi Washio

【Abstract】: Nearest neighbour search is a core process in many data mining algorithms. Finding reliable closest matches of a query in a high dimensional space is still a challenging task. This is because the effectiveness of many dissimilarity measures, that are based on a geometric model, such as lp-norm, decreases as the number of dimensions increases. In this paper, we examine how the data distribution can be exploited to measure dissimilarity between two instances and propose a new data dependent dissimilarity measure called 'mp-dissimilarity'. Rather than relying on geometric distance, it measures the dissimilarity between two instances in each dimension as a probability mass in a region that encloses the two instances. It deems the two instances in a sparse region to be more similar than two instances in a dense region, though these two pairs of instances have the same geometric distance. Our empirical results show that the proposed dissimilarity measure indeed provides a reliable nearest neighbour search in high dimensional spaces, particularly in sparse data. Mp-dissimilarity produced better task specific performance than lp-norm and cosine distance in classification and information retrieval tasks.

【Keywords】: data mining; probability; query processing; search problems; cosine distance; data dependent dissimilarity measure; data distribution; data mining algorithms; geometric distance; geometric model; high dimensional space; information retrieval tasks; l_p-norm; m_p-dissimilarity measures; nearest neighbour search; probability mass; reliable nearest neighbour search; Accuracy; Approximation methods; Data mining; Educational institutions; Electronic mail; Information retrieval; Vectors; distance measure; lp-norm; mp-dissimilarity

73. Check-in Location Prediction Using Wavelets and Conditional Random Fields.

【Paper Link】【Pages】:713-718

【Authors】: Roland Assam ; Thomas Seidl

【Abstract】: The widespread adoption of ubiquitous devices does not only facilitate the connection of billions of people, but has also fuelled a culture of sharing rich, high resolution locations through check-ins. Despite the profusion of GPS and WiFi driven location prediction techniques, the sparse and random nature of check-in data generation have ushered diverse problems, which have prompted the prediction of future check-ins to be very challenging. In this paper, we propose a novel enhanced location predictor for check-in data that is crafted using Poisson distribution, Wavelets and Conditional Random Fields (CRF). Specifically, we show that check-in generation is governed by the Poisson distribution. In addition, among others, we utilize wavelets to rigorously analyze social influence and learn elusive underlying patterns, as well as human mobility behaviors embedded in check-in data. We utilize this knowledge to institute CRF features, which capture latent trends that govern users' mobility. These CRF features are employed to build a robust predictive model that predicts future locations with enhanced accuracy. We demonstrate the effectiveness of our predictive model on two real datasets. Furthermore, our experiments reveal that our approach outperforms a state-of-the-art work with an accuracy of 36%.

【Keywords】: Poisson distribution; information services; ubiquitous computing; wavelet transforms; CRF features; GPS driven location prediction techniques; Global Positioning System; Poisson distribution; WiFi driven location prediction techniques; Wireless Fidelity; check-in location prediction; conditional random field; ubiquitous devices; wavelet transform; Accuracy; Equations; Hidden Markov models; Mathematical model; Multiresolution analysis; Predictive models; Time series analysis; Data Mining; Location Based Services; Prediction

74. Janus - Analytics-Driven Transition Planner.

【Paper Link】【Pages】:719-724

【Authors】: Manasi Belhe ; Kriti Shrivastava ; Maitreya Natu ; Vaishali P. Sadaphal

【Abstract】: In this paper, we address the problem of transition of IT operations from one service provider to another. We present analytics-driven solutions to generate a transition plan while addressing various aspects such as coverage, risk, time, and cost. We model the IT operations through graphs and use the well defined problems in graph theory to build solutions for transition planner. We demonstrate the proof-of-concept of proposed ideas using a real-world case-study.

【Keywords】: business data processing; graph theory; IT operations; Janus; analytics-driven transition planner; graph theory; service provider; Business; Communities; Databases; Graph theory; Hardware; IEEE Potentials; Software; Data center management; IT Operations; Transition

75. Large-Scale Analysis of Soccer Matches Using Spatiotemporal Tracking Data.

【Paper Link】【Pages】:725-730

【Authors】: Alina Bialkowski ; Patrick Lucey ; Peter Carr ; Yisong Yue ; Sridha Sridharan ; Iain A. Matthews

【Abstract】: Although the collection of player and ball tracking data is fast becoming the norm in professional sports, large-scale mining of such spatiotemporal data has yet to surface. In this paper, given an entire season's worth of player and ball tracking data from a professional soccer league (≈400,000,000 data points), we present a method which can conduct both individual player and team analysis. Due to the dynamic, continuous and multi-player nature of team sports like soccer, a major issue is aligning player positions over time. We present a "role-based" representation that dynamically updates each player's relative role at each frame and demonstrate how this captures the short-term context to enable both individual player and team analysis. We discover role directly from data by utilizing a minimum entropy data partitioning method and show how this can be used to accurately detect and visualize formations, as well as analyze individual player behavior.

【Keywords】: data analysis; data mining; sport; ball tracking data; large-scale analysis; large-scale mining; minimum entropy data partitioning method; soccer match analysis; spatiotemporal tracking data; sports; Context; Data visualization; Entropy; Games; Probability density function; Spatiotemporal phenomena; Trajectory; Formation; Role; Spatiotemporal Tracking Data; Sports Analytics

76. Graded Multilabel Classification by Pairwise Comparisons.

【Paper Link】【Pages】:731-736

【Authors】: Christian Brinker ; Eneldo Loza Mencía ; Johannes Fürnkranz

【Abstract】: The task in multilabel classification is to predict for a given set of labels whether each individual label should be attached to an instance or not. Graded multilabel classification generalizes this setting by allowing to specify for each label a degree of membership on an ordinal scale. This setting can be frequently found in practice, for example when movies or books are assessed on a one-to-five star rating in multiple categories. In this paper, we propose to reformulate the problem in terms of preferences between the labels and their scales, which can then be tackled by learning from pair wise comparisons. We present three different approaches which make use of this decomposition and show on three datasets that we are able to outperform baseline approaches. In particular, we show that our solution, which is able to model pair wise preferences across multiple scales, outperforms a straight-forward approach which considers the problem as a set of independent ordinal regression tasks.

【Keywords】: learning (artificial intelligence); pattern classification; regression analysis; graded multilabel classification; independent ordinal regression tasks; learning; pairwise comparisons; Calibration; Loss measurement; Medical diagnostic imaging; Motion pictures; Robustness; TV; Training; graded multilabel classification; learning by pairwise comparisons; ordinal classification

77. TRIBAC: Discovering Interpretable Clusters and Latent Structures in Graphs.

【Paper Link】【Pages】:737-742

【Authors】: Jeffrey Chan ; Christopher Leckie ; James Bailey ; Kotagiri Ramamohanarao

【Abstract】: Graphs are a powerful representation of relational data, such as social and biological networks. Often, these entities form groups and are organised according to a latent structure. However, these groupings and structures are generally unknown and it can be difficult to identify them. Graph clustering is an important type of approach used to discover these vertex groups and the latent structure within graphs. One type of approach for graph clustering is non-negative matrix factorisation However, the formulations of existing factorisation approaches can be overly relaxed and their groupings and results consequently difficult to interpret, may fail to discover the true latent structure and groupings, and converge to extreme solutions. In this paper, we propose a new formulation of the graph clustering problem that results in clusterings that are easy to interpret. Combined with a novel algorithm, the clusterings are also more accurate than state-of-the-art algorithms for both synthetic and real datasets.

【Keywords】: graph theory; matrix decomposition; pattern clustering; TRIBAC; graph clustering problem; interpretable clusters discovery; latent structures; nonnegative matrix factorisation; Airports; Clustering algorithms; Communities; Equations; Image edge detection; Matrix decomposition; Optimization; blockmodelling; graph clustering; interpretability; non-negative matrix factorisation

78. Stream Mining Using Statistical Relational Learning.

【Paper Link】【Pages】:743-748

【Authors】: Swarup Chandra ; Justin Sahs ; Latifur Khan ; Bhavani M. Thuraisingham ; Charu C. Aggarwal

【Abstract】: Stream mining has gained popularity in recent years due to the availability of numerous data streams from sources such as social media and sensor networks. Data mining on such continuous streams possess a variety of challenges including concept drift and unbounded stream length. Traditional data mining approaches to these problems have difficulty incorporating relational domain knowledge and feature relationships, which can be used to improve the accuracy of a classifier. In this work, we model large data streams using statistical relational learning techniques for classification, in particular, we use a Markov Logic Network to capture relational features in structured data and show that this approach performs better for supervised learning than current state-of-the-art approaches. Additionally, we evaluate our approach with semi-supervised learning scenarios, where class labels are only partially available during training.

【Keywords】: Markov processes; data mining; data models; learning (artificial intelligence); pattern classification; Markov logic network; class labels; classification; concept drift; data mining; feature relationships; large data streams model; relational domain knowledge; relational features; sensor networks; social media; statistical relational learning; stream mining; structured data; supervised learning; unbounded stream length; Accuracy; Data mining; Data models; Grounding; Markov random fields; Training; Classification; Statistical Relational Learning; Stream Mining

79. Ups and Downs in Buzzes: Life Cycle Modeling for Temporal Pattern Discovery.

【Paper Link】【Pages】:749-754

【Authors】: Yi Chang ; Makoto Yamada ; Antonio Ortega ; Yan Liu

【Abstract】: In social media analysis, one critical task is detecting burst of topics or buzz, which is reflected by extremely frequent mentions of certain key words in a short time interval. Detecting buzz not only provides useful insights into the information propagation mechanism, but also plays an essential role in preventing malicious rumors. However, buzz modeling is a challenging task because a buzz time-series usually exhibits sudden spikes and heavy tails, which fails most existing time-series models. To deal with buzz time-series sequences, we propose a novel time-series modeling approach which captures the rise and fade temporal patterns via Product Life Cycle (PLC) models, a classical concept in economics. More specifically, we propose a mixture of PLC models to capture the multiple peaks in buzz time-series and furthermore develop a probabilistic graphical model (K-MPLC) to automatically discover inherent life cycle patterns within a collection of buzzes. Our experiment results show that our proposed method significantly outperforms existing state-of-the-art approaches on buzzes clustering.

【Keywords】: pattern clustering; product life cycle management; social networking (online); text analysis; time series; K-MPLC; PLC models; buzz detection; buzz modeling; buzz time-series sequences; buzzes clustering; information propagation mechanism; key words frequent mentions; life cycle patterns; malicious rumors prevention; probabilistic graphical model; product life cycle modeling; social media analysis; temporal pattern discovery; time series clustering; time-series modeling; Biological system modeling; Clustering algorithms; Graphical models; Measurement; Media; Optimization; Probabilistic logic; Time-Series Clustering; Time-Series Modeling

80. Flu Gone Viral: Syndromic Surveillance of Flu on Twitter Using Temporal Topic Models.

【Paper Link】【Pages】:755-760

【Authors】: Liangzhe Chen ; K. S. M. Tozammel Hossain ; Patrick Butler ; Naren Ramakrishnan ; B. Aditya Prakash

【Abstract】: Surveillance of epidemic outbreaks and spread from social media is an important tool for governments and public health authorities. Machine learning techniques for now casting the flu have made significant inroads into correlating social media trends to case counts and prevalence of epidemics in a population. There is a disconnect between data-driven methods for forecasting flu incidence and epidemiological models that adopt a state based understanding of transitions, that can lead to sub-optimal predictions. Furthermore, models for epidemiological activity and social activity like on Twitter predict different shapes and have important differences. We propose a temporal topic model to capture hidden states of a user from his tweets and aggregate states in a geographical region for better estimation of trends. We show that our approach helps fill the gap between phenomenological methods for disease surveillance and epidemiological models. We validate this approach by modeling the flu using Twitter in multiple countries of South America. We demonstrate that our model can consistently outperform plain vocabulary assessment in flu case-count predictions, and at the same time get better flu-peak predictions than competitors. We also show that our fine-grained modeling can reconcile some contrasting behaviors between epidemiological and social models.

【Keywords】: epidemics; health care; learning (artificial intelligence); medical computing; social networking (online); Twitter; epidemic outbreak surveillance; machine learning technique; public health authority; social media; temporal topic model; Biological system modeling; Data models; Google; Market research; Predictive models; Switches; Vocabulary; Data Mining; Epidemiology; Social Media; Syndromic Surveillance; Topic Model; Twitter

81. Mining Personal Health Index from Annual Geriatric Medical Examinations.

【Paper Link】【Pages】:761-766

【Authors】: Ling Chen ; Xue Li ; Sen Wang ; Hsiao-Yun Hu ; Nicole Huang ; Quan Z. Sheng ; Mohamed A. Sharaf

【Abstract】: People take regular medical examinations mostly not for discovering diseases but for having a peace of mind regarding their health status. Therefore, it is important to give them an overall feedback with respect to all the health indicators that have been ranked against the whole population. In this paper, we propose a framework of mining Personal Health Index (PHI) from a large and comprehensive geriatric medical examination (GME) dataset. We define PHI as an overall score of personal health status based on a complement probability of health risks. The health risks are calculated using the information from the cause of death (COD) dataset that is linked to the GME dataset. Especially, the highest health risk is revealed in the cases of people who had been taking GME for some years and then passed away for medical reasons. The proposed framework consists of methods in data pre-processing, feature extraction and selection, and model selection. The effectiveness of the proposed framework is validated by a set of comprehensive experiments based on the records of 102,258 participants. As the first of this kind, our work provides a baseline for further research.

【Keywords】: data mining; diseases; feature extraction; feature selection; health care; medical computing; probability; GME dataset; annual geriatric medical examinations; cause of death dataset; complement probability; comprehensive geriatric medical examination dataset; data preprocessing; feature extraction; feature selection; health indicators; health risks; model selection; personal health index mining; personal health status; Cities and towns; Data mining; Educational institutions; Feature extraction; Geriatrics; Indexes; Support vector machines; Personal Health Index; geriatric medical examination

【Paper Link】【Pages】:767-772

【Authors】: Yu Cheng ; Ankit Agrawal ; Alok N. Choudhary ; Huan Liu ; Tao Zhang

【Abstract】: In this paper, we study a challenging problem of inferring individuals' role and statuses in a professional social network, which is of central importance in workforce optimization and human capital management. Realizing the natural setting of social nodes associated with dual view information, i.e., The local node characteristics and the global network influence, we present a novel model that explores graph regularization techniques and integrates such information to achieve improved prediction performance. In particular, our prediction model is built upon the graph transductive learning framework that encodes an uncertainty regularization term in the conventional empirical risk minimization principle. Through taking advantage of the information from both the local profile and the global network characteristics, the final inference of the role or statues achieves minimum an empirical loss on the labeled set, as well as a minimum uncertainty on the unlabeled social nodes. We perform extensive empirical study using real-world data and compare with representative peer approaches. The experimental results on three real social network data sets show that the proposed model greatly outperforms a number of baseline models and is able to effectively infer in a wide range of scenarios.

【Keywords】: graph theory; learning (artificial intelligence); risk management; social networking (online); dual uncertainty minimization regularization; dual view information; empirical risk minimization principle; global network influence; graph regularization techniques; graph transductive learning framework; human capital management; local node characteristics; local profile; prediction model; professional social network; social nodes; social role identification; workforce optimization; Data models; Electronic mail; Feature extraction; LinkedIn; Minimization; Uncertainty; Dual Uncertainty Minimization; Graph Regularization; Social Role Identification

83. A Joint Model for Topic-Sentiment Evolution over Time.

【Paper Link】【Pages】:773-778

【Authors】: Mohamed Dermouche ; Julien Velcin ; Leila Khouas ; Sabine Loudcher

【Abstract】: Most existing topic models focus either on extracting static topic-sentiment conjunctions or topic-wise evolution over time leaving out topic-sentiment dynamics and missing the opportunity to provide a more in-depth analysis of textual data. In this paper, we propose an LDA-based topic model for analyzing topic-sentiment evolution over time by modeling time jointly with topics and sentiments. We derive inference algorithm based on Gibbs Sampling process. Finally, we present results on reviews and news datasets showing interpretable trends and strong correlation with ground truth in particular for topic-sentiment evolution over time.

【Keywords】: Markov processes; Monte Carlo methods; inference mechanisms; information resources; text analysis; Gibbs sampling process; LDA-based topic model; inference algorithm; joint model; news datasets; static topic-sentiment conjunction extraction; textual data in-depth analysis; topic-sentiment dynamics; topic-sentiment evolution analysis; topic-wise evolution; Accuracy; Analytical models; Correlation; Data mining; Data models; Joints; Mathematical model; joint topic sentiment models; opinion mining; sentiment analysis; time series; topic models; trend analysis

84. Hidden Conditional Random Fields with Deep User Embeddings for Ad Targeting.

【Paper Link】【Pages】:779-784

【Authors】: Nemanja Djuric ; Vladan Radosavljevic ; Mihajlo Grbovic ; Narayan Bhamidipati

【Abstract】: Estimating a user's propensity to click on a display ad or purchase a particular item is a critical task in targeted advertising, a burgeoning online industry worth billions of dollars. Better and more accurate estimation methods result in improved online user experience, as only relevant and interesting ads are shown, and may also lead to large benefits for advertisers, as targeted users are more likely to click or make a purchase. In this paper we address this important problem, and propose an approach for improved estimation of ad click or conversion probability based on a sequence of user's online actions, modeled using Hidden Conditional Random Fields (HCRF) model. In addition, in order to address the sparsity issue at the input side of the HCRF model, we propose to learn distributed, low-dimensional representations of user actions through a directed skip-gram, a neural architecture suitable for sequential data. Experimental results on a real-world data set comprising thousands of user sessions collected at Yahoo servers clearly indicate the benefits and the potential of the proposed approach, which outperformed competing state-of-the-art algorithms and obtained significant improvements in terms of retrieval measures.

【Keywords】: advertising data processing; information retrieval; neural nets; probability; HCRF model; ad click estimation; conversion probability; deep user embeddings; directed skip-gram architecture; hidden conditional random fields; neural architecture; retrieval measure; targeted advertising; user propensity estimation; Advertising; Browsers; Context; Context modeling; Hidden Markov models; Predictive models; Support vector machines; ad targeting; click modeling; hidden CRF; purchase prediction

85. Learning to Grade Student Programs in a Massive Open Online Course.

【Paper Link】【Pages】:785-790

【Authors】: Anna Drummond ; Yanxin Lu ; Swarat Chaudhuri ; Christopher M. Jermaine ; Joe Warren ; Scott Rixner

【Abstract】: We study the problem of automatically evaluating the quality of computer programs produced by students in a very large, online, interactive programming course (or "MOOC"). Automatically evaluating interactive programs (such as computer games) is not easy because such programs lack any sort of well-defined logical specification. As an alternative, we devise some simple statistical approaches to assigning a score to a student-produced code.

【Keywords】: computer aided instruction; computer games; computer science education; educational administrative data processing; educational courses; formal specification; programming; statistical analysis; computer games; computer programs; interactive programming course; massive open online course; quality evaluation; statistical approaches; student program grading; student-produced code; well-defined logical specification; Computational modeling; Games; Libraries; Measurement; Peer-to-peer computing; Programming; Prototypes

86. Senders, Receivers and Authors in Document Classification.

【Paper Link】【Pages】:791-796

【Authors】: Anna Drummond ; Christopher M. Jermaine

【Abstract】: In many document classification problems, sets of people will be associated with the document. These sets might include document authors, or people who have read the document, or the sender of an electronic message, or the recipients of the message, or those carbon copied, or those blind carbon copied. It is obvious that these sets of people can constitute important information that can help to classify the document. In this paper, we propose a simple method for mapping the set of people in a sender or receiver category to a single, low dimensional vector in a latent space. There are many ways that this vector can be used to help with the document classification task, and in the paper we consider three distinct possibilities in detail. We find that mapping a set of senders or receivers to a latent space in this way and incorporating this mapping into a classifier can greatly boost classification accuracy on several real electronic discovery tasks.

【Keywords】: document handling; electronic messaging; pattern classification; classification accuracy; document author; document classification; electronic discovery task; electronic message sender; message recipient; Bayes methods; Carbon; Electronic mail; Encoding; Receivers; Support vector machines; Vectors

【Paper Link】【Pages】:797-802

【Authors】: Jianguang Du ; Jing Jiang ; Liu Yang ; Dandan Song ; Lejian Liao

【Abstract】: Threaded debate forums have become one of the major social media platforms. Usually people argue with one another using not only claims and evidences about the topic under discussion but also language used to organize them, which we refer to as shell. In this paper, we study how to separate shell from topical contents using unsupervised methods. Along this line, we develop a latent variable model named Shell Topic Model (STM) to jointly model both topics and shell. Experiments on real online debate data show that our model can find both meaningful shell and topics. The results also show the effectiveness of our model by comparing it with several baselines in shell phrases extraction and document modeling.

【Keywords】: data mining; learning (artificial intelligence); social networking (online); text analysis; STM; argumentative text; document modeling; latent variable model; organizational phrase mining; shell miner; shell phrase extraction; shell topic model; social media; topical content; Data mining; Data models; Educational institutions; Hidden Markov models; Media; Noise measurement; Training; argumentative text; latent variable model; organizational phrases; topic modeling

88. Topic Models with Topic Ordering Regularities for Topic Segmentation.

【Paper Link】【Pages】:803-808

【Authors】: Lan Du ; John K. Pate ; Mark Johnson

【Abstract】: Documents from the same domain usually discuss similar topics in a similar order. In this paper we present new ordering-based topic models that use generalised Mallows models to capture this regularity to constrain topic assignments. Specifically, these new models assume that there is a canonical topic ordering shared amongst documents from the same domain, and each document-specific topic ordering is allowed to vary from the canonical topic ordering. Instead of full orderings over a set of all possible topics covered by a domain, we make use of top-t orderings via a multistage ranking process. We show how to reformulate the new models so that a point-wise sampling algorithm from the Bayesian word segmentation literature can be used for posterior inference. Experimental results on several document collections with different properties show that our model performs much better than the other topic ordering-based models, and competitively with other state-of-the-art topic segmentation models.

【Keywords】: belief networks; document handling; pattern classification; sampling methods; Bayesian word segmentation literature; canonical topic ordering; document-specific topic ordering; generalised Mallows models; multistage ranking process; ordering-based topic models; point-wise sampling algorithm; posterior inference; top-t orderings; topic assignments; topic ordering regularities; topic segmentation; Adaptation models; Biological system modeling; Electronic publishing; Encyclopedias; Hidden Markov models; Internet; GMM; Topic model; permutation; top-t ordering; topic segmentation

89. Understanding Where Your Classifier Does (Not) Work - The SCaPE Model Class for EMM.

【Paper Link】【Pages】:809-814

【Authors】: Wouter Duivesteijn ; Julia Thaele

【Abstract】: FACT, the First G-APD Cherenkov Telescope, detects air showers induced by high-energetic cosmic particles. It is desirable to classify a shower as being induced by a gamma ray or a background particle. Generally, it is nontrivial to get any feedback on the real-life training task, but we can attempt to understand how our classifier works by investigating its performance on Monte Carlo simulated data. To this end, in this paper we develop the SCaPE (Soft Classifier Performance Evaluation) model class for Exceptional Model Mining, which is a Local Pattern Mining framework devoted to highlighting unusual interplay between multiple targets. In our Monte Carlo simulated data, we take as targets the computed classifier probabilities and the binary column containing the ground truth: which kind of particle induced the corresponding shower. Using a newly developed quality measure based on ranking loss, the SCaPE model class highlights subspaces of the search space where the classifier performs particularly well or poorly. These subspaces arrive in terms of conditions on attributes of the data, hence they come in a language a domain expert understands, which should aid him in understanding where his/her classifier does (not) work. Found subgroups highlight subspaces whose difficulty for classification is corroborated by astrophysical interpretation, as well as subspaces that warrant further investigation.

【Keywords】: Monte Carlo methods; astronomical telescopes; astronomy computing; data mining; pattern classification; probability; search problems; EMM; FACT telescope; G-APD Cherenkov telescope; Monte Carlo simulated data; SCaPE; air shower detection; astrophysical interpretation; background particle; binary column; classifier probabilities; exceptional model mining; gamma ray; high-energetic cosmic particles; local pattern mining framework; ranking loss; search space; shower classification; soft classifier performance evaluation model class; Atmospheric modeling; Data mining; Loss measurement; Protons; Radio frequency; Telescopes; Terrestrial atmosphere; Astrophysics; Cherenkov radiation; Exceptional Model Mining; soft classifier

90. Ring-Shaped Hotspot Detection: A Summary of Results.

【Paper Link】【Pages】:815-820

【Authors】: Emre Eftelioglu ; Shashi Shekhar ; Dev Oliver ; Xun Zhou ; Michael R. Evans ; Yiqun Xie ; James M. Kang ; Renee Laubscher ; Christopher Farah

【Abstract】: Given a collection of geo-located activities (e.g., Crime reports), ring-shaped hotspot detection (RHD) finds rings, where concentration of activities inside the ring is much higher than outside. RHD is important for the applications such as crime analysis, where it may focus the search for crime source's location, e.g. The home of a serial criminal. RHD is challenging because of the large number of candidate rings and the high computational cost of the statistical significance test. Previous statistically significant hotspot detection techniques (e.g., Sat Scan) identify circular/rectangular areas, but can not discover rings. This paper proposes a dual grid based pruning (DGP) approach to detect ring-shaped hotspots. A case study on real crime data confirms that DGP detects novel ring-shaped regions, regions that go undetected by Sat Scan. Experiments show that DGP improves the computational cost of a naive approach substantially.

【Keywords】: geographic information systems; grid computing; statistical analysis; DGP approach; RHD; dual grid based pruning; geo-located activities; ring-shaped hotspot detection; statistical significance test; Biology; Computational efficiency; Equations; Geology; Mathematical model; Monte Carlo methods; Upper bound

91. Tensor Regression Based on Linked Multiway Parameter Analysis.

【Paper Link】【Pages】:821-826

【Authors】: Yifan Fu ; Junbin Gao ; Xia Hong ; David Tien

【Abstract】: Classical regression methods take vectors as covariates and estimate the corresponding vectors of regression parameters. When addressing regression problems on covariates of more complex form such as multi-dimensional arrays (i.e. Tensors), traditional computational models can be severely compromised by ultrahigh dimensionality as well as complex structure. By exploiting the special structure of tensor covariates, the tensor regression model provides a promising solution to reduce the model's dimensionality to a manageable level, thus leading to efficient estimation. Most of the existing tensor-based methods independently estimate each individual regression problem based on tensor decomposition which allows the simultaneous projections of an input tensor to more than one direction along each mode. As a matter of fact, multi-dimensional data are collected under the same or very similar conditions, so that data share some common latent components but can also have their own independent parameters for each regression task. Therefore, it is beneficial to analyse regression parameters among all the regressions in a linked way. In this paper, we propose a tensor regression model based on Tucker Decomposition, which identifies not only the common components of parameters across all the regression tasks, but also independent factors contributing to each particular regression task simultaneously. Under this paradigm, the number of independent parameters along each mode is constrained by a sparsity-preserving regulariser. Linked multiway parameter analysis and sparsity modeling further reduce the total number of parameters, with lower memory cost than their tensor-based counterparts. The effectiveness of the new method is demonstrated on real data sets.

【Keywords】: data acquisition; parameter estimation; regression analysis; tensors; vectors; Tucker decomposition; linked multiway parameter analysis; memory cost; multidimensional arrays; multidimensional data; sparsity modeling; sparsity-preserving regulariser; tensor decomposition; tensor regression model; Data models; Educational institutions; Nickel; Tensile stress; Training; Vectors; linked multiway parameter analysis; sparse coding; tensor regression

92. Heavyweight Pattern Mining in Attributed Flow Graphs.

【Paper Link】【Pages】:827-832

【Authors】: Carolina Simoes Gomes ; José Nelson Amaral ; Jörg Sander ; Joran Siu ; Li Ding

【Abstract】: This paper defines a new problem - heavyweight pattern mining in attributed flow graphs. The problem can be described as the discovery of patterns in flow graphs that have sets of attributes associated with their nodes. A connection between nodes is represented as a directed edge. The amount of load that goes through a path between nodes, or the frequency of transmission of such load between nodes, is represented as edge weights. A heavyweight pattern is a sub-set of attributes, found in a dataset of attributed flow graphs, that are connected by edges and have a computed weight higher than an user-defined threshold. A new algorithm called AFG Miner is introduced, the first one to our knowledge that finds heavyweight patterns in a dataset of attributed flow graphs and associates each pattern with its occurrences. The paper also describes a new tool for compiler engineers, HEP Miner, that applies the AFG Miner algorithm to Profile-based Program Analysis modeled as a heavyweight pattern mining problem.

【Keywords】: data flow graphs; data mining; directed graphs; AFG Miner algorithm; HEP Miner; attributed flow graphs; compiler engineers; directed edge; heavyweight pattern mining; profile-based program analysis; user-defined threshold; Algorithm design and analysis; Complexity theory; Data mining; Electronic mail; Flow graphs; Labeling; Servers; data mining; flow graphs; graph mining; pattern mining; program analysis; program profiling; software analysis; sub-graph mining

93. Online Spectral Learning on a Graph with Bandit Feedback.

【Paper Link】【Pages】:833-838

【Authors】: Quanquan Gu ; Jiawei Han

【Abstract】: Online learning on a graph is appealing due to its efficiency. However, existing online learning algorithms on a graph are limited to binary classification. Moreover, they require accessing the full label information, where the label oracle needs to return the true class label after the learner makes classification of each node. In many application scenarios, we only have access to partial label information, where the label oracle will return a single bit indicating whether the prediction is correct or not, instead of the true class label. This is also known as bandit feedback. In this paper, to overcome the above limitations of existing online learning algorithms on a graph, we study online learning on a graph for multi-class node classification, in both the full information setting and the partial information setting. First, we present an online multi-class classification algorithm in the full information setting. It is based on function learning on a graph using the spectral information of the graph Laplacian. We show that it attains O (cd log T) regret bound, where T is the number of rounds in online learning, c is the number of classes, and d is the number of eigenvectors of the graph Laplacian used for learning. Second, we present an online multi-class classification algorithm with bandit feedback. We use upper-confidence bound technique to trade off the exploration and exploitation of label information. We show that it attains O (cd √T log T) regret bound, which is only a √T factor worse than the proposed algorithm in the full information setting. Experiments on several benchmark graph datasets show that the proposed online multi-class classification algorithm beats the state-of-art baseline, and the proposed bandit algorithm is also much better than the bandit version of the baseline.

【Keywords】: computational complexity; eigenvalues and eigenfunctions; feedback; graph theory; learning (artificial intelligence); pattern classification; bandit feedback; eigenvectors; function learning on a graph; graph Laplacian; multiclass node classification; online learning on a graph; online multiclass classification algorithm; online spectral learning; upper-confidence bound technique; Algorithm design and analysis; Data mining; Eigenvalues and eigenfunctions; Error analysis; Laplace equations; Prediction algorithms; Vectors; Bandit Feedback; Graph Cut Size; Online Learning on a Graph; Partial Information Setting; Regret Bound; Spectral Learning

94. Low-Density Cut Based Tree Decomposition for Large-Scale SVM Problems.

【Paper Link】【Pages】:839-844

【Authors】: Lifang He ; Hong-Han Shuai ; Xiangnan Kong ; Zhifeng Hao ; Xiaowei Yang ; Philip S. Yu

【Abstract】: The current trend of growth of information reveals that it is inevitable that large-scale learning problems become the norm. In this paper, we propose and analyze a novel Low-density Cut based tree Decomposition method for large-scale SVM problems, called LCD-SVM. The basic idea here is divide and conquer: use a decision tree to decompose the data space and train SVMs on the decomposed regions. Specifically, we demonstrate the application of low density separation principle to devise a splitting criterion for rapidly generating a high-quality tree, thus maximizing the benefits of SVMs training. Extensive experiments on 14 real-world datasets show that our approach can provide a significant improvement in training time over state-of-the-art methods while keeps comparable test accuracy with other methods, especially for very large-scale datasets.

【Keywords】: decision trees; learning (artificial intelligence); pattern classification; support vector machines; LCD-SVM; SVM problems; SVM training; data space decomposition; decision tree; large-scale learning problems; low density separation principle; low-density cut based tree decomposition; splitting criterion; Accuracy; Computational complexity; Decision trees; Educational institutions; Histograms; Support vector machines; Training; Support vector machines; decision tree; large scale; splitting criterion

【Paper Link】【Pages】:845-850

【Authors】: Bo Hu ; Martin Ester

【Abstract】: In this paper, we address the problem of recommending Point-of-Interests (POIs) to users in a location-based social network. To the best of our knowledge, we are the first to propose the ST (Social Topic) model capturing both the social and topic aspects of user check-ins. We conduct experiments on real life data sets from Foursquare and Yelp. We evaluate the effectiveness of ST by evaluating the accuracy of top-k POI recommendation. The experimental results show that ST achieves better performance than the state-of-the-art models in the areas of social network-based recommender systems, and exploits the power of the location-based social network that has never been utilized before.

【Keywords】: recommender systems; social networking (online); Foursquare; ST model; Yelp; location-based social network; point-of-interest recommendation; real life data sets; social aspects; social network-based recommender systems; social topic modeling; top-k POI recommendation; topic aspects; user check-ins; Accuracy; Cities and towns; Data models; Indexes; Motion pictures; Recommender systems; Social network services

96. Bayesian Heteroskedastic Choice Modeling on Non-identically Distributed Linkages.

【Paper Link】【Pages】:851-856

【Authors】: Liang Hu ; Wei Cao ; Jian Cao ; Guandong Xu ; Longbing Cao ; Zhiping Gu

【Abstract】: Choice modeling (CM) aims to describe and predict choices according to attributes of subjects and options. If we presume each choice making as the formation of link between subjects and options, immediately CM can be bridged to link analysis and prediction (LAP) problem. However, such a mapping is often not trivial and straightforward. In LAP problems, the only available observations are links among objects but their attributes are often inaccessible. Therefore, we extend CM into a latent feature space to avoid the need of explicit attributes. Moreover, LAP is usually based on binary linkage assumption that models observed links as positive instances and unobserved links as negative instances. Instead, we use a weaker assumption that treats unobserved links as pseudo negative instances. Furthermore, most subjects or options may be quite heterogeneous due to the long-tail distribution, which is failed to capture by conventional LAP approaches. To address above challenges, we propose a Bayesian heteroskedastic choice model to represent the non-identically distributed linkages in the LAP problems. Finally, the empirical evaluation on real-world datasets proves the superiority of our approach.

【Keywords】: belief networks; data analysis; statistical distributions; Bayesian heteroskedastic choice modeling; CM; LAP problems; link analysis and prediction; long-tail distribution; nonidentically distributed linkages; pseudo negative instances; Adaptation models; Bayes methods; Biological system modeling; Couplings; Data models; Predictive models; Vectors; heteroskedastic choice model; link analysis and prediction; non-IID Bayesian analysis; parallel Gibbs sampling

97. Robust Dynamic Trajectory Regression on Road Networks: A Multi-task Learning Framework.

【Paper Link】【Pages】:857-862

【Authors】: Aiqing Huang ; Linli Xu ; Yitan Li ; Enhong Chen

【Abstract】: Trajectory regression, which aims to predict the travel time of arbitrary trajectories on road networks, attracts significant attention in various applications of traffic systems these years. In this paper, we tackle this problem with a multitask learning (MTL) framework. To take the temporal nature of the problem into consideration, we divide the regression problem into a set of sub-tasks of distinct time periods, then the problem can be treated in a multi-task learning framework. Further, we propose a novel regularization term in which we exploit the block sparse structure to augment the robustness of the model. In addition, we incorporate the spatial smoothness over road links and thus achieve a spatial-temporal framework. An accelerated proximal algorithm is adopted to solve the convex but non-smooth problem, which will converge to the global optimum. Experiments on both synthetic and real data sets demonstrate the effectiveness of the proposed method.

【Keywords】: learning (artificial intelligence); regression analysis; traffic engineering computing; trajectory control; MTL framework; accelerated proximal algorithm; arbitrary trajectories; block sparse structure; convex nonsmooth problem; multitask learning framework; regression problem; road networks; robust dynamic trajectory regression; spatial-temporal framework; traffic systems; travel time prediction; Acceleration; Data models; Optimization; Roads; Robustness; Training; Trajectory; dynamic; multi-task learning; structured sparsity; trajectory regression

98. Detecting Volatility Shift in Data Streams.

【Paper Link】【Pages】:863-868

【Authors】: David Tse Jung Huang ; Yun Sing Koh ; Gillian Dobbie ; Russel Pears

【Abstract】: Current drift detection techniques detect a change in distribution within a stream. However, there are no current techniques that analyze the change in the rate of these detected changes. We coin the term stream volatility, to describe the rate of changes in a stream. A stream has a high volatility if changes are detected frequently and has a low volatility if changes are detected infrequently. We are particularly interested in a volatility shift which is a change in the rate of change (e.g. From high volatility to low volatility). We introduce and define the concept of stream volatility, and propose a novel technique to detect volatility on data streams in the presence of concept drifts. In the experiments we show our algorithm to be both fast and efficient. We also propose a new algorithm for drift detection called SEED that is faster and more memory efficient than the existing state-of-the-art drift detection approach. A faster drift detection algorithm has a flow-on benefit to the subsequent volatility detection stage because both algorithms run concurrently on the data stream.

【Keywords】: data mining; SEED; data streams; drift detection techniques; stream change rate; stream volatility; volatility shift detection; Algorithm design and analysis; Delays; Detection algorithms; Detectors; Memory management; Reservoirs; Testing; Data Stream; Drift Detection; Volatility Detection

99. Latent Ranking Analysis Using Pairwise Comparisons.

【Paper Link】【Pages】:869-874

【Authors】: Younghoon Kim ; Wooyeol Kim ; Kyuseok Shim

【Abstract】: Ranking objects is an essential problem in recommendation systems. Since comparing two objects is the simplest type of queries in order to measure the relevance of objects, the problem of aggregating pair wise comparisons to obtain a global ranking has been widely studied. In order to learn a ranking model, a training set of queries as well as their correct labels are supplied and a machine learning algorithm is used to find the appropriate parameters of the ranking model with respect to the labels. In this paper, we propose a probabilistic model for learning multiple latent rankings using pair wise comparisons. Our novel model can capture multiple hidden rankings underlying the pair wise comparisons. Based on the model, we develop an efficient inference algorithm to learn multiple latent rankings. The performance study with synthetic and real-life data sets confirms the effectiveness of our model and inference algorithm.

【Keywords】: inference mechanisms; learning (artificial intelligence); inference algorithm; latent ranking analysis; machine learning algorithm; multiple hidden rankings; multiple latent rankings; pairwise comparisons; probabilistic model; ranking model; ranking objects; real-life data sets; recommendation systems; Accuracy; Data models; Educational institutions; Equations; Probabilistic logic; Standards; Vectors; Learning to rank; multiple latent rankings; supervised learning

100. Bus Travel Time Predictions Using Additive Models.

【Paper Link】【Pages】:875-880

【Authors】: Matthias Kormaksson ; Luciano Barbosa ; Marcos R. Vieira ; Bianca Zadrozny

【Abstract】: Many factors can affect the predictability of public bus services such as traffic, weather, day of week, and hour of day. However, the exact nature of such relationships between travel times and predictor variables is, in most situations, not known. In this paper we develop a framework that allows for flexible modeling of bus travel times through the use of Additive Models. The proposed class of models provides a principled statistical framework that is highly flexible in terms of model building. The experimental results demonstrate uniformly superior performance of our best model as compared to previous prediction methods when applied to a very large GPS data set obtained from buses operating in the city of Rio de Janeiro.

【Keywords】: statistical analysis; transportation; Brazil; GPS data set; Global Positioning System; Rio de Janeiro; additive models; bus travel time prediction; model building; predictor variables; principled statistical framework; public bus service predictability; Additives; Data models; Global Positioning System; Kernel; Predictive models; Support vector machines; Trajectory; Arrival Time Prediction; Basis Function; Mixed Models; Tensor Product Basis; Traffic Modeling; Trajectory Data

101. Explicit Versus Implicit Graph Feature Maps: A Computational Phase Transition for Walk Kernels.

【Paper Link】【Pages】:881-886

【Authors】: Nils Kriege ; Marion Neumann ; Kristian Kersting ; Petra Mutzel

【Abstract】: As many real-world data can elegantly be represented as graphs, various graph kernels and methods for computing them have been proposed. Surprisingly, many of the recent graph kernels do not employ the kernel trick anymore but rather compute an explicit feature map and report higher efficiency. So, is there really no benefit of the kernel trick when it comes to graphs? Triggered by this question, we investigate under which conditions it is possible to compute a graph kernel explicitly and for which graph properties this computation is actually more efficient. We give a sufficient condition for R-convolution kernels that enables kernel computation by explicit mapping. We theoretically and experimentally analyze efficiency and flexibility of implicit kernel functions and dot products of explicitly computed feature maps for widely used graph kernels such as random walk kernels, sub graph matching kernels, and shortest-path kernels. For walk kernels we observe a phase transition when comparing runtime with respect to label diversity and walk lengths leading to the conclusion that explicit computations are only favourable for smaller label sets and walk lengths whereas implicit computation is superior for longer walk lengths and data sets with larger label diversity.

【Keywords】: computational complexity; data handling; graph theory; random processes; R-convolution kernels; computational phase transition; dot products; efficiency analysis; explicit graph feature maps; flexibility analysis; graph kernels; graph properties; implicit graph feature maps; implicit kernel functions; kernel computation; label diversity; label sets; phase transition; random walk kernels; real-world data; runtime analysis; shortest-path kernels; subgraph matching kernels; sufficient condition; time complexity; walk lengths; Convolution; Data mining; Java; Kernel; Runtime; Symmetric matrices; Vectors

102. Spectral Clustering for Medical Imaging.

【Paper Link】【Pages】:887-892

【Authors】: Chia-Tung Kuo ; Peter B. Walker ; Owen T. Carmichael ; Ian Davidson

【Abstract】: Spectral clustering is often reported in the literature as successfully being applied to applications from image segmentation to community detection. However, what is not reported is that great time and effort are required to construct a graph Laplacian to achieve these successes. This problem which we call Laplacian construction is critical for the success of spectral clustering but is not well studied by the community. Instead the best Laplacian is typically learnt for each domain from trial and error. This is problematic for areas such as medical imaging since: (i) the same images can be segmented in multiple ways depending on the application focus and (ii) we don't wish to construct one Laplacian, rather we wish to create a method to construct a Laplacian for each patient's scan. In this paper we attempt to automate the process of Laplacian creation with the help of guidance towards the application focus. In most domains creating a basic Laplacian is plausible, so we propose adjusting this given Laplacian by discovering important nodes. We formulate this problem as an integer linear program with a precise geometric interpretation which is globally minimized using large scale solvers such as Gurobi. We show the usefulness on a real world problem in the area of fMRI scan segmentation where methods using standard Laplacians perform poorly.

【Keywords】: biomedical MRI; graph theory; image segmentation; integer programming; linear programming; medical image processing; minimisation; pattern clustering; Gurobi; automatic Laplacian creation process; community detection; fMRI scan segmentation; geometric interpretation; global minimization; graph Laplacian construction; image segmentation; integer linear program; large scale solvers; medical imaging; node discovery; patient scan; real world problem; spectral clustering; Biomedical imaging; Image edge detection; Laplace equations; Senior citizens; Sociology; Statistics; Vectors

103. Fast Algorithms for Frequent Itemset Mining from Uncertain Data.

【Paper Link】【Pages】:893-898

【Authors】: Carson Kai-Sang Leung ; Richard Kyle MacKinnon

【Abstract】: The majority of existing data mining algorithms mine frequent item sets from precise databases. A well-known algorithm is FP-growth, which builds a compact FP-tree structure to capture important contents of the database and mines frequent item sets from the FP-tree. However, there are situations in which data are uncertain. In recent years, researchers have paid attention to frequent item set mining from uncertain databases. UFP-growth is one of the frequently cited algorithms for mining uncertain data. However, the corresponding UFP-tree structure can be large. Other tree structures for handling uncertain data may achieve compactness at the expense of looser upper bounds on expected supports. To solve this problem, we propose two compact tree structures which capture uncertain data with tighter upper bounds than existing tree structures. We also designed two algorithms that mine frequent item sets from our proposed trees. Our experimental results show the tightness of bounds to expected supports provided by these algorithms.

【Keywords】: data handling; data mining; tree data structures; FP-growth; UF-growth algorithm; UF-tree structure; compact FP-tree structure; data mining algorithms; fast algorithms; frequent itemset mining; looser upper bounds; tightened upper bounds; uncertain data handling; Algorithm design and analysis; Data mining; Electron tubes; Integrated circuits; Itemsets; Upper bound; Association analysis; data mining algorithms; frequent patterns; tree structures; uncertain data

104. Spotting Fake Reviews via Collective Positive-Unlabeled Learning.

【Paper Link】【Pages】:899-904

【Authors】: Huayi Li ; Zhiyuan Chen ; Bing Liu ; Xiaokai Wei ; Jidong Shao

【Abstract】: Online reviews have become an increasingly important resource for decision making and product designing. But reviews systems are often targeted by opinion spamming. Although fake review detection has been studied by researchers for years using supervised learning, ground truth of large scale datasets is still unavailable and most of existing approaches of supervised learning are based on pseudo fake reviews rather than real fake reviews. Working with Dianping, the largest Chinese review hosting site, we present the first reported work on fake review detection in Chinese with filtered reviews from Dianping's fake review detection system. Dianping's algorithm has a very high precision, but the recall is hard to know. This means that all fake reviews detected by the system are almost certainly fake but the remaining reviews (unknown set) may not be all genuine. Since the unknown set may contain many fake reviews, it is more appropriate to treat it as an unlabeled set. This calls for the model of learning from positive and unlabeled examples (PU learning). By leveraging the intricate dependencies among reviews, users and IP addresses, we first propose a collective classification algorithm called Multi-typed Heterogeneous Collective Classification (MHCC) and then extend it to Collective Positive and Unlabeled learning (CPU). Our experiments are conducted on real-life reviews of 500 restaurants in Shanghai, China. Results show that our proposed models can markedly improve the F1 scores of strong baselines in both PU and non-PU learning settings. Since our models only use language independent features, they can be easily generalized to other languages.

【Keywords】: Internet; data mining; learning (artificial intelligence); pattern classification; CPU learning; Dianping fake review detection system; MHCC; collective positive and unlabeled learning; multityped heterogeneous collective classification; online review; opinion spamming; supervised learning; Data models; IP networks; Nickel; Reliability; Testing; Training; Training data; Collective PU Learning; Spam Detection

【Paper Link】【Pages】:905-910

【Authors】: Guannan Liu ; Yanjie Fu ; Tong Xu ; Hui Xiong ; Guoqing Chen

【Abstract】: Social media has become one of the most popular marketing channels for many companies, which aims at maximizing their influence by various marketing campaigns conducted from their official accounts on social networks. However, most of these marketing accounts merely focus on the contents of their tweets. Less effort has been made on understanding tweeting time, which is a major contributing factor in terms of attracting customers' attention and maximizing the influence of a social marketing campaign. To that end, in this paper, we provide a focused study of temporal retweeting patterns and their influence on social media marketing campaigns. Specifically, we investigate the users' retweeting patterns by modeling their retweeting behaviors as a generative process, which considers temporal, social, and topical factors. Moreover, we validate the predictive power of the model on the dataset collected from Sina Weibo, the most popular micro blog platform in China. By discovering the temporal retweeting patterns, we analyze the temporal popular topics and recommend tweets to users in a time-aware manner. Finally, experimental results show that the proposed algorithm outperforms other baseline methods. This model is applicable for companies to conduct their marketing campaigns at the right time on social media.

【Keywords】: marketing data processing; social networking (online); China; Sina Weibo; generative process; microblog platform; retweeting behaviors; social factor; social media marketing campaigns; temporal factor; temporal retweeting patterns; topical factor; Companies; Context; Context modeling; Educational institutions; History; Media; Predictive models

【Paper Link】【Pages】:911-916

【Authors】: Rui Liu ; Kiyana Zolfaghar ; Si-Chi Chin ; Senjuti Basu Roy ; Ankur Teredesai

【Abstract】: In this paper, we describe a novel framework to recommend personalized intervention strategies to minimize 30-day readmission risk for heart failure (HF) patients, as they move through the provider's cardiac care protocol. We design principled solutions by learning the structure and parameters of a multi-layer hierarchical Bayesian network from underlying high-dimensional patient data. Next, we generate and summarize the rules leading to personalized interventions which can be applied to individual patients as they progress from admit to discharge. We present comprehensive experimental results as well as interesting case studies to demonstrate the effectiveness of our proposed framework using large real-world patient datasets on Microsoft Azure for Research platform.

【Keywords】: belief networks; cardiology; learning (artificial intelligence); medical computing; recommender systems; risk management; HF patients; Microsoft Azure; heart failure readmission risk; high-dimensional patient data; multilayer hierarchical Bayesian network; parameter learning; personalized intervention strategy recommendation; provider cardiac care protocol; research platform; structure learning; time 30 day; Conferences; Data mining; bayesian network; heart failure; intervention recommendation; risk of readmission

【Paper Link】【Pages】:917-922

【Authors】: Chen Luo ; Wei Pang ; Zhe Wang ; Chenghua Lin

【Abstract】: In this paper, we investigate the social-based recommendation algorithms on heterogeneous social networks and proposed Hete-CF, a social collaborative filtering algorithm using heterogeneous relations. Distinct from the exiting methods, Hete-CF can effectively utilise multiple types of relations in a heterogeneous social network. More importantly, Hete-CF is a general approach and can be used in arbitrary social networks, including event based social networks, location based social networks, and any other types of heterogeneous information networks associated with social information. The experimental results on a real-world dataset DBLP (a typical heterogeneous information network)demonstrate the effectiveness of our algorithm.

【Keywords】: collaborative filtering; social networking (online); Hete-CF; event based social networks; heterogeneous information networks; heterogeneous relations; heterogeneous social networks; location based social networks; real-world dataset DBLP; social collaborative filtering algorithm; social information; social-based collaborative filtering recommendation; social-based recommendation algorithms; Collaboration; Equations; Mathematical model; Prediction algorithms; Social network services; Sparse matrices; Vectors

108. Hierarchical Incident Ticket Classification with Minimal Supervision.

【Paper Link】【Pages】:923-928

【Authors】: Andrii Maksai ; Jasmina Bogojeska ; Dorothea Wiesmann

【Abstract】: In this paper, we introduce a novel approach for incident ticket classification that aims at minimizing the manual labelling effort while achieving good-quality predictions. To accomplish this, we devise a two-stage technique that employs hierarchical clustering using a combination of graph clustering (community finding) and topic modelling as first stage, followed by either another round of hierarchical clustering or an active learning approach as second stage. We evaluate the performance of our method in terms of manual labelling effort, prediction quality and efficiency on three real-world datasets and demonstrate that classical approaches to text classification are not well suited for incident ticket texts.

【Keywords】: pattern classification; pattern clustering; text analysis; active learning approach; community finding; good-quality predictions; graph clustering; hierarchical clustering; hierarchical incident ticket text classification; manual labelling; minimal supervision; performance evaluation; prediction efficiency; prediction quality; real-world datasets; topic modelling; two-stage technique; Clustering algorithms; Communities; Feature extraction; Labeling; Manuals; Monitoring; Servers; multi-class classification; text mining

109. A Gaussian Process Model for Knowledge Propagation in Web Ontologies.

【Paper Link】【Pages】:929-934

【Authors】: Pasquale Minervini ; Claudia d'Amato ; Nicola Fanizzi ; Floriana Esposito

【Abstract】: We consider the problem of predicting missing class-memberships and property values of individual resources in Web ontologies. We first identify which relations tend to link similar individuals by means of a finite-set Gaussian Process regression model, and then efficiently propagate knowledge about individuals across their relations. Our experimental evaluation demonstrates the effectiveness of the proposed method.

【Keywords】: Gaussian processes; Internet; ontologies (artificial intelligence); regression analysis; Web ontologies; class-memberships; finite-set Gaussian process regression model; knowledge propagation; property values; Gaussian processes; Kernel; Labeling; Ontologies; Portals; Symmetric matrices; Training; gaussian process; semantic web; semi-supervised; transductive

110. Mutual Information Based Output Dimensionality Reduction.

【Paper Link】【Pages】:935-940

【Authors】: Shishir Pandey ; Rahul Vaze

【Abstract】: Given a large dimensional input and output space, even simple regression is prohibitively costly. Dimensionality reduction in the output space is important for efficient learning and prediction as modern paradigms, e.g. Topic modelling, image classification, etc., have extremely large output spaces. In contrast to input dimensionality reduction, dimension reduction in output side is complicated. We propose, mutual information based output dimensionality reduction, that takes into account the relationship between the input and the output which is essential for regression and classification problems. Our method selects those labels to form the compressed label space that typically have the maximum mutual information with the input. Selecting the best subset is computationally hard, but we provide a polynomial time algorithm with provable approximation guarantee. We conduct experiments on seven multi-label classification datasets. Results show our method performs better than existing methods on some datasets.

【Keywords】: approximation theory; computational complexity; learning (artificial intelligence); pattern classification; regression analysis; approximation guarantee; classification problems; compressed label space; dimensional input space; dimensional output space; learning; multilabel classification datasets; mutual information based output dimensionality reduction; polynomial time algorithm; prediction; regression problems; subset selection; Approximation methods; Compressed sensing; Decoding; Educational institutions; Greedy algorithms; Mutual information; Vectors; dimension reduction; multi-label; mutual information; output dimension reduction; submodular function

111. Multi-label Classification with Meta-Labels.

【Paper Link】【Pages】:941-946

【Authors】: Jesse Read ; Antti Puurula ; Albert Bifet

【Abstract】: The area of multi-label classification has rapidly developed in recent years. It has become widely known that the baseline binary relevance approach can easily be outperformed by methods which learn labels together. A number of methods have grown around the label power set approach, which models label combinations together as class values in a multi-class problem. We describe the label-power set-based solutions under a general framework of meta-labels and provide some theoretical justification for this framework which has been lacking, explaining how meta-labels essentially allow a random projection into a space where non-linearities can easily be tackled with established linear learning algorithms. The proposed framework enables comparison and combination of related approaches to different multi-label problems. We present a novel model in the framework and evaluate it empirically against several high-performing methods, with respect to predictive performance and scalability, on a number of datasets and evaluation metrics. This deployment obtains competitive accuracy for a fraction of the computation required by the current meta-label methods for multi-label classification.

【Keywords】: learning (artificial intelligence); pattern classification; baseline binary relevance approach; evaluation metrics; label combinations; label powerset approach; label-powerset-based solutions; linear learning algorithms; meta-labels; multilabel classification; Accuracy; Indexes; Neural networks; Predictive models; Scalability; Training; Vectors; classification; multi-label

112. Graph Summarization with Quality Guarantees.

【Paper Link】【Pages】:947-952

【Authors】: Matteo Riondato ; David García-Soriano ; Francesco Bonchi

【Abstract】: We study the problem of graph summarization. Given a large graph we aim at producing a concise lossy representation that can be stored in main memory and used to approximately answer queries about the original graph much faster than by using the exact representation. In this paper we study a very natural type of summary: the original set of vertices is partitioned into a small number of super nodes connected by super edges to form a complete weighted graph. The super edge weights are the edge densities between vertices in the corresponding super nodes. The goal is to produce a summary that minimizes the reconstruction error w.r.t. The original graph. By exposing a connection between graph summarization and geometric clustering problems (i.e., k-means and k-median), we develop the first polynomial-time approximation algorithm to compute the best possible summary of a given size.

【Keywords】: approximation theory; computational complexity; geometry; graph theory; pattern clustering; complete weighted graph; geometric clustering problems; graph summarization; polynomial-time approximation algorithm; Algorithm design and analysis; Approximation algorithms; Approximation methods; Clustering algorithms; Silicon; Smoothing methods; Symmetric matrices

113. Data Fusion Using Restricted Boltzmann Machines.

【Paper Link】【Pages】:953-958

【Authors】: Yoshiki Sakai ; Kenji Yamanishi

【Abstract】: We address the issue of data fusion. Suppose that we are given two datasets, where some variables are different each other and others are the same. The goal of data fusion is to complement the missing unique variables in each dataset using the common variables. Data fusion facilitates inference over multiple independent and different datasets, which is an important data mining issues that affect many applications, such as recommendation, image reconstruction, or market analysis. In this paper, we propose a novel approach to data fusion using restricted Boltzmann machines (RBMs). In applying to data fusion, RBMs are able to model hidden patterns lying behind the two datasets with bipartite graph structure between hidden variables and observable ones. There is a bottleneck in the application of RBMs to data fusion: It is computationally expensive to learn RBMs from data with missing values. Therefore, we propose a new efficient algorithm for learning RBMs from missing data. This algorithm maximizes the lower bound on the observation likelihood, which can efficiently be computed. With benchmark datasets, we empirically demonstrate that our RBM-based data fusion method significantly outperforms existing methods in terms of complement accuracy. These results demonstrate an advantage of data fusion based on latent-variable modeling.

【Keywords】: Boltzmann machines; data mining; maximum likelihood estimation; sensor fusion; RBM; data fusion; data mining; latent-variable modelling; observation likelihood; restricted Boltzmann machine; Computational modeling; Data integration; Data models; Joints; Maximum likelihood estimation; Time complexity; Vectors

114. Spotting Suspicious Link Behavior with fBox: An Adversarial Perspective.

【Paper Link】【Pages】:959-964

【Authors】: Neil Shah ; Alex Beutel ; Brian Gallagher ; Christos Faloutsos

【Abstract】: How can we detect suspicious users in large online networks? Online popularity of a user or product (via follows, page-likes, etc.) can be monetized on the premise of higher ad click-through rates or increased sales. Web services and social networks which incentivize popularity thus suffer from a major problem of fake connections from link fraudsters looking to make a quick buck. Typical methods of catching this suspicious behavior use spectral techniques to spot large groups of often blatantly fraudulent (but sometimes honest) users. However, small-scale, stealthy attacks may go unnoticed due to the nature of low-rank Eigen analysis used in practice. In this work, we take an adversarial approach to find and prove claims about the weaknesses of modern, state-of-the-art spectral methods and propose fBox, an algorithm designed to catch small-scale, stealth attacks that slip below the radar. Our algorithm has the following desirable properties: (a) it has theoretical underpinnings, (b) it is shown to be highly effective on real data and (c) it is scalable (linear on the input size). We evaluate fBox on a large, public 41.7 million node, 1.5 billion edge who-follows-whom social graph from Twitter in 2010 and with high precision identify many suspicious accounts which have persisted without suspension even to this day.

【Keywords】: security of data; social networking (online); Twitter; Web services; ad click-through rate; adversarial perspective; fBox algorithm; social graph; social networks; spectral techniques; suspicious link behavior spotting; Algorithm design and analysis; Encyclopedias; Fans; Matrix decomposition; Twitter; Vectors; link fraud; social networks

115. Recovering Low-Rank and Sparse Matrices via Robust Bilateral Factorization.

【Paper Link】【Pages】:965-970

【Authors】: Fanhua Shang ; Yuanyuan Liu ; James Cheng ; Hong Cheng

【Abstract】: Recovering low-rank and sparse matrices from partial, incomplete or corrupted observations is an important problem in many areas of science and engineering. In this paper, we propose a scalable robust bilateral factorization (RBF) method to recover both structured matrices from missing and grossly corrupted data such as robust matrix completion (RMC), or incomplete and grossly corrupted measurements such as compressive principal component pursuit (CPCP). With the unified framework, we first present two robust trace norm regularized bilateral factorization models for RMC and CPCP problems, which can achieve an orthogonal dictionary and a robust data representation, simultaneously. Then, we apply the alternating direction method of multipliers to efficiently solve the RMC problems. Finally, we provide the convergence analysis of our algorithm, and extend it to address general CPCP problems. Experimental results verified both the efficiency and effectiveness of our RBF method compared with the state-of-the-art methods.

【Keywords】: matrix decomposition; principal component analysis; sparse matrices; CPCP problems; RBF method; RMC; alternating direction method of multipliers; compressive principal component pursuit; convergence analysis; low-rank matrices recovery; orthogonal dictionary; robust data representation; robust matrix completion; robust trace norm regularized bilateral factorization models; sparse matrices recovery; structured matrices; Algorithm design and analysis; Convergence; Face; Image reconstruction; Matrix decomposition; Robustness; Sparse matrices; RPCA; compressive principal component pursuit; low-rank; robust matrix completion

116. A Parallel and Efficient Algorithm for Learning to Match.

【Paper Link】【Pages】:971-976

【Authors】: Jingbo Shang ; Tianqi Chen ; Hang Li ; Zhengdong Lu ; Yong Yu

【Abstract】: Many tasks in data mining and related fields can be formalized as matching between objects in two heterogeneous domains, including collaborative filtering, link prediction, image tagging, and web search. Machine learning techniques, referred to as learning-to-match in this paper, have been successfully applied to the problems. Among them, a class of state-of-the-art methods, named feature-based matrix factorization, formalize the task as an extension to matrix factorization by incorporating auxiliary features into the model. Unfortunately, making those algorithms scale to real world problems is challenging, and simple parallelization strategies fail due to the complex cross talking patterns between sub-tasks. In this paper, we tackle this challenge with a novel parallel and efficient algorithm. Our algorithm, based on coordinate descent, can easily handle hundreds of millions of instances and features on a single machine. The key recipe of this algorithm is an iterative relaxation of the objective to facilitate parallel updates of parameters, with guaranteed convergence on minimizing the original objective function. Experimental results demonstrate that the proposed method is effective on a wide range of matching problems, with efficiency significantly improved upon the baselines while accuracy retained unchanged.

【Keywords】: convergence; iterative methods; learning (artificial intelligence); matrix decomposition; parallel algorithms; pattern matching; Web search; auxiliary feature; collaborative filtering; complex cross talking patterns; convergence; coordinate descent; data mining; efficient algorithm; feature-based matrix factorization; heterogeneous domain; image tagging; iterative relaxation; learning-to-match; link prediction; machine learning techniques; matching problems; parallel algorithm; parallelization strategy; state-of-the-art method; Algorithm design and analysis; Collaboration; Convergence; Parallel algorithms; Prediction algorithms; Time complexity; Training; collaborative filtering; learning to match; parallel matrix factorization

117. Robust Spectral Learning for Unsupervised Feature Selection.

【Paper Link】【Pages】:977-982

【Authors】: Lei Shi ; Liang Du ; Yi-Dong Shen

【Abstract】: In this paper, we consider the problem of unsupervised feature selection. Recently, spectral feature selection algorithms, which leverage both graph Laplacian and spectral regression, have received increasing attention. However, existing spectral feature selection algorithms suffer from two major problems: 1) since the graph Laplacian is constructed from the original feature space, noisy and irrelevant features may have adverse effect on the estimated graph Laplacian and hence degenerate the quality of the induced graph embedding, 2) since the cluster labels are discrete in natural, relaxing and approximating these labels into a continuous embedding can inevitably introduce noise into the estimated cluster labels. Without considering the noise in the cluster labels, the feature selection process may be misguided. In this paper, we propose a Robust Spectral learning framework for unsupervised Feature Selection (RSFS), which jointly improves the robustness of graph embedding and sparse spectral regression. Compared with existing methods which are sensitive to noisy features, our proposed method utilizes a robust local learning method to construct the graph Laplacian and a robust spectral regression method to handle the noise on the learned cluster labels. In order to solve the proposed optimization problem, an efficient iterative algorithm is proposed. We also show the close connection between the proposed robust spectral regression and robust Huber M-estimator. Experimental results on different datasets show the superiority of RSFS.

【Keywords】: feature selection; graph theory; iterative methods; pattern clustering; regression analysis; unsupervised learning; RSFS; estimated cluster labels; graph Laplacian; induced graph embedding; iterative algorithm; optimization problem; robust Huber M-estimator; robust local learning method; robust spectral learning framework for unsupervised feature selection; robust spectral regression method; sparse spectral regression; spectral feature selection algorithms; Accuracy; Clustering algorithms; Laplace equations; Mutual information; Noise; Optimization; Robustness

118. Flow-Based Influence Graph Visual Summarization.

【Paper Link】【Pages】:983-988

【Authors】: Lei Shi ; Hanghang Tong ; Jie Tang ; Chuang Lin

【Abstract】: Visually mining a large influence graph is appealing yet challenging. Existing summarization methods enhance the visualization with blocked views, but have adverse effect on the latent influence structure. How can we visually summarize a large graph to maximize influence flows? In particular, how can we illustrate the impact of an individual node through the summarization? Can we maintain the appealing graph metaphor while preserving both the overall influence pattern and fine readability? To answer these questions, we first formally define the influence graph summarization problem. Second, we propose an end-to-end framework to solve the new problem. Last, we report our experiment results. Evidences demonstrate that our framework can effectively approximate the proposed influence graph summarization objective while outperforming previous methods in a typical scenario of visually mining academic citation networks.

【Keywords】: data visualisation; flow visualisation; graphs; academic citation networks; appealing graph metaphor; flow-based influence graph visual summarization; influence graph summarization objective; influence graph summarization problem; large influence graph; summarization methods; visualization; Clustering algorithms; Data mining; Linear programming; Matrix decomposition; Pipelines; Topology; Visualization; influence flow; influence graph; visualization

119. Distributed Methods for High-Dimensional and Large-Scale Tensor Factorization.

【Paper Link】【Pages】:989-994

【Authors】: Kijung Shin ; U. Kang

【Abstract】: Given a high-dimensional and large-scale tensor, how can we decompose it into latent factors? Can we process it on commodity computers with limited memory? These questions are closely related to recommendation systems exploiting context information such as time and location. They require tensor factorization methods scalable with both the dimension and size of a tensor. In this paper, we propose two distributed tensor factorization methods, SALS and CDTF. Both methods are scalable with all aspects of data, and they show an interesting trade-off between convergence speed and memory requirements. SALS updates a subset of the columns of a factor matrix at a time, and CDTF, a special case of SALS, updates one column at a time. On our experiment, only our methods factorize a 5-dimensional tensor with 1B observable entries, 10M mode length, and 1K rank, while all other state-of-the-art methods fail. Moreover, our methods require several orders of magnitude less memory than the competitors. We implement our methods on MapReduce with two widely applicable optimization techniques: local disk caching and greedy row assignment.

【Keywords】: data handling; least squares approximations; matrix decomposition; parallel processing; CDTF; MapReduce; SALS; coordinate descent for tensor factorization; distributed tensor factorization methods; factor matrix; greedy row assignment; high-dimensional tensor factorization; large-scale tensor factorization; local disk caching; optimization techniques; subset alternating least square; Distributed databases; Matrix decomposition; Memory management; Optimization; Scalability; Tensile stress; Tin; Distributed computing; MapReduce; Recommender system; Tensor factorization

120. Parallel Corpus Approach for Name Matching in Record Linkage.

【Paper Link】【Pages】:995-1000

【Authors】: Jeffrey Sukharev ; Leonid Zhukov ; Alexandrin Popescul

【Abstract】: Record linkage, or entity resolution, is an important area of data mining. Name matching is a key component of systems for record linkage. Alternative spellings of the same name are a common occurrence in many applications. We use the largest collection of genealogy person records in the world together with user search query logs to build name-matching models. The procedure for building a crowd-sourced training set is outlined together with the presentation of our method. We cast the problem of learning alternative spellings as a machine translation problem at the character level. We use information retrieval evaluation methodology to show that this method substantially outperforms on our data a number of standard well known phonetic and string similarity methods in terms of precision and recall. Our result can lead to a significant practical impact in entity resolution applications.

【Keywords】: data mining; language translation; parallel processing; query processing; string matching; character level; crowd-sourced training set; data mining; entity resolution; genealogy person records; information retrieval evaluation methodology; machine translation problem; name matching; parallel corpus approach; phonetic method; record linkage; string similarity methods; user search query logs; Buildings; Computational modeling; Couplings; Data mining; Databases; Probability; Training; Crowd Sourcing; Machine Translation; Record Linkage

121. Metric Ranking of Invariant Networks with Belief Propagation.

【Paper Link】【Pages】:1001-1006

【Authors】: Changxia Tao ; Yong Ge ; Qinbao Song ; Yuan Ge ; Olufemi A. Omitaomu

【Abstract】: The management of large-scale distributed information systems relies on the effective use and modeling of monitoring data collected at various points in the distributed information systems. A promising approach is to discover invariant relationships among the monitoring data and generate invariant networks, where a node is a monitoring data source (metric) and a link indicates an invariant relationship between two monitoring data. Such an invariant network representation can help system experts to localize and diagnose the system faults by examining those broken invariant relationships and their related metrics, because system faults usually propagate among the monitoring data and eventually lead to some broken invariant relationships. However, at one time, there are usually a lot of broken links (invariant relationships) within an invariant network. Without proper guidance, it is difficult for system experts to manually inspect this large number of broken links. Thus, a critical challenge is how to effectively and efficiently rank metrics (nodes) of invariant networks according to the anomaly levels of metrics. The ranked list of metrics will provide system experts with useful guidance for them to localize and diagnose the system faults. To this end, we propose to model the nodes and the broken links as a Markov Random Field (MRF), and develop an iteration algorithm to infer the anomaly of each node based on belief propagation (BP). Finally, we validate the proposed algorithm on both real-world and synthetic data sets to illustrate its effectiveness.

【Keywords】: Markov processes; belief networks; distributed processing; information systems; MRF; Markov random field; belief propagation; data source; invariant networks; large-scale distributed information systems; metric ranking; monitoring data; Belief propagation; Benchmark testing; Data models; Information systems; Measurement; Monitoring; Time series analysis; ARX Model; Belief Propogation; Invariant; Invariant Networks

122. High-Dimensional Data Stream Classification via Sparse Online Learning.

【Paper Link】【Pages】:1007-1012

【Authors】: Dayong Wang ; Pengcheng Wu ; Peilin Zhao ; Yue Wu ; Chunyan Miao ; Steven C. H. Hoi

【Abstract】: The amount of data in our society has been exploding in the era of big data today. In this paper, we address several open challenges of big data stream classification, including high volume, high velocity, high dimensionality, and high sparsity. Many existing studies in data mining literature solve data stream classification tasks in a batch learning setting, which suffers from poor efficiency and scalability when dealing with big data. To overcome the limitations, this paper investigates an online learning framework for big data stream classification tasks. Unlike some existing online data stream classification techniques that are often based on first-order online learning, we propose a framework of Sparse Online Classification (SOC) for data stream classification, which includes some state-of-the-art first-order sparse online learning algorithms as special cases and allows us to derive a new effective second-order online learning algorithm for data stream classification. We conduct an extensive set of experiments, in which encouraging results validate the efficacy of the proposed algorithms in comparison to a family of state-of-the-art techniques on a variety of data stream classification tasks.

【Keywords】: Big Data; data analysis; learning (artificial intelligence); pattern classification; Big Data stream classification; SOC; data stream classification tasks; first-order sparse online learning algorithms; high-dimensional data stream classification; online data stream classification techniques; second-order online learning algorithm; sparse online classification; Algorithm design and analysis; Big data; Electronic mail; Error analysis; Machine learning algorithms; Prediction algorithms; Training; data stream classification; online learning; sparse

123. On Sparse Feature Attacks in Adversarial Learning.

【Paper Link】【Pages】:1013-1018

【Authors】: Fei Wang ; Wei Liu ; Sanjay Chawla

【Abstract】: Adversarial learning is the study of machine learning techniques deployed in non-benign environments. Example applications include classifications for detecting spam email, network intrusion detection and credit card scoring. In fact as the gamut of application domains of machine learning grows, the possibility and opportunity for adversarial behavior will only increase. Till now, the standard assumption about modeling adversarial behavior has been to empower an adversary to change all features of the classifiers at will. The adversary pays a cost proportional to the size of "attack". We refer to this form of adversarial behavior as a dense feature attack. However, the aim of an adversary is not just to subvert a classifier but carry out data transformation in a way such that spam continues to appear like spam to the user as much as possible. We demonstrate that an adversary achieves this objective by carrying out a sparse feature attack. We design an algorithm to show how a classifier should be designed to be robust against sparse adversarial attacks. Our main insight is that sparse feature attacks are best defended by designing classifiers which use ℓ₁ regularizers.

【Keywords】: computer crime; learning (artificial intelligence); pattern classification; unsolicited e-mail; ℓ₁ regularizers; adversarial behavior modeling; adversarial learning; attack size; classifier; data transformation; dense feature attack; machine learning techniques; nonbenign environments; spam; sparse adversarial attacks; sparse feature attack; Data models; Electronic mail; Game theory; Games; Logistics; Robustness; Vectors; Adversarial learning; Sparse modelling; l1 regularizer

124. Stability-Based Stopping Criterion for Active Learning.

【Paper Link】【Pages】:1019-1024

【Authors】: Wenquan Wang ; Wenbin Cai ; Ya Zhang

【Abstract】: While active learning has drawn broad attention in recent years, there are relatively few studies on stopping criterion for active learning. We here propose a novel model stability based stopping criterion, which considers the potential of each unlabeled examples to change the model once added to the training set. The underlying motivation is that active learning should terminate when the model does not change much by adding remaining examples. Inspired by the widely used stochastic gradient update rule, we use the gradient of the loss at each candidate example to measure its capability to change the classifier. Under the model change rule, we stop active learning when the changing ability of all remaining unlabeled examples is less than a given threshold. We apply the stability-based stopping criterion to two popular classifiers: logistic regression and support vector machines (SVMs). It can be generalized to a wide spectrum of learning models. Substantial experimental results on various UCI benchmark data sets have demonstrated that the proposed approach outperforms state-of-art methods in most cases.

【Keywords】: gradient methods; learning (artificial intelligence); regression analysis; stability; stochastic processes; support vector machines; SVM; active learning; logistic regression; stability-based stopping criterion; stochastic gradient update rule; support vector machine; Benchmark testing; Data models; Logistics; Stability criteria; Support vector machines; Training; Active learning; Stability; Stopping criterion

125. Hashtag Graph Based Topic Model for Tweet Mining.

【Paper Link】【Pages】:1025-1030

【Authors】: Yuan Wang ; Jie Liu ; Jishi Qu ; Yalou Huang ; Jimeng Chen ; Xia Feng

【Abstract】: Mining topics in Twitter is increasingly attracting more attention. However, the shortness and informality of tweets leads to extreme sparse vector representation with a large vocabulary, which makes the conventional topic models (e.g., Latent Dirichlet Allocation) often fail to achieve high quality underlying topics. Luckily, tweets always show up with rich user-generated hash tags as keywords. In this paper, we propose a novel topic model to handle such semi-structured tweets, denoted as Hash tag Graph based Topic Model (HGTM). By utilizing relation information between hash tags in our hash tag graph, HGTM establishes word semantic relations, even if they haven't co-occurred within a specific tweet. In addition, we enhance the dependencies of both multiple words and hash tags via latent variables (topics) modeled by HGTM. We illustrate that the user-contributed hash tags could serve as weakly-supervised information for topic modeling, and hash tag relation could reveal the semantic relation between tweets. Experiments on a real-world twitter data set show that our model provides an effective solution to discover more distinct and coherent topics than the state-of-the-art baselines and has a strong ability to control sparseness and noise in tweets.

【Keywords】: data mining; social networking (online); HGTM; extreme sparse vector representation; hash tag graph based topic model; hash tag relation; keywords; latent variables; semistructured tweets; topic modeling; topics mining; tweet mining; twitter data set; user-generated hash tags; weakly supervised information; word semantic relations; Analytical models; Data models; Educational institutions; Mathematical model; Semantics; Twitter; Vectors

126. Contrary to Popular Belief Incremental Discretization can be Sound, Computationally Efficient and Extremely Useful for Streaming Data.

【Paper Link】【Pages】:1031-1036

【Authors】: Geoffrey I. Webb

【Abstract】: Discretization of streaming data has received surprisingly little attention. This might be because streaming data require incremental discretization with cut points that may vary over time and this is perceived as undesirable. We argue, to the contrary, that it can be desirable for a discretization to evolve in synchronization with an evolving data stream, even when the learner assumes that attribute values' meanings remain invariant over time. We examine the issues associated with discretization in the context of distribution drift and develop computationally efficient incremental discretization algorithms. We show that discretization can reduce the error of a classical incremental learner and that allowing a discretization to drift in synchronization with distribution drift can further reduce error.

【Keywords】: data handling; synchronisation; data streaming; distribution drift; incremental discretization; synchronization; Approximation algorithms; Context; Electricity; Histograms; Synchronization; Time-frequency analysis; Vectors

127. Scalable Multi-instance Learning.

【Paper Link】【Pages】:1037-1042

【Authors】: Xiu-Shen Wei ; Jianxin Wu ; Zhi-Hua Zhou

【Abstract】: Multi-instance learning (MIL) has been widely applied to diverse applications involving complicated data objects such as images and genes. However, most existing MIL algorithms can only handle small-or moderate-sized data. In order to deal with the large scale problems in MIL, we propose an efficient and scalable MIL algorithm named miFV. Our algorithm maps the original MIL bags into a new feature vector representation, which can obtain bag-level information, and meanwhile lead to excellent performances even with linear classifiers. In consequence, thanks to the low computational cost in the mapping step and the scalability of linear classifiers, miFV can handle large scale MIL data efficiently and effectively. Experiments show that miFV not only achieves comparable accuracy rates with state-of-the-art MIL algorithms, but has hundreds of times faster speed than other MIL algorithms.

【Keywords】: learning (artificial intelligence); pattern classification; bag-level information; data objects; feature vector representation; linear classifiers; low computational cost; miFV MIL algorithm; scalable multiinstance learning; Accuracy; Clustering algorithms; Kernel; Principal component analysis; Scalability; Training; Vectors; efficiency; large scale data; multi-instance learning; scalability

【Paper Link】【Pages】:1043-1048

【Authors】: Yu Ting Wen ; Po-Ruey Lei ; Wen-Chih Peng ; Xiaofang Zhou

【Abstract】: Recently, with the advent of location-based social networking services (LBSNs), travel planning and location-aware information recommendation based on LBSNs have attracted much research attention. In this paper, we study the impact of social relations hidden in LBSNs, i.e., The social influence of friends. We propose a new social influence-based user recommender framework (SIR) to discover the potential value from reliable users (i.e., Close friends and travel experts). Explicitly, our SIR framework is able to infer influential users from an LBSN. We claim to capture the interactions among virtual communities, physical mobility activities and time effects to infer the social influence between user pairs. Furthermore, we intend to model the propagation of influence using diffusion-based mechanism. Moreover, we have designed a dynamic fusion framework to integrate the features mined into a united follow probability score. Finally, our SIR framework provides personalized top-k user recommendations for individuals. To evaluate the recommendation results, we have conducted extensive experiments on real datasets (i.e., The Go Walla dataset). The experimental results show that the performance of our SIR framework is better than the state-of the-art user recommendation mechanisms in terms of accuracy and reliability.

【Keywords】: recommender systems; social networking (online); user interfaces; LBSN; SIR framework; diffusion-based mechanism; dynamic fusion framework; influential users; location-aware information recommendation; location-based social networking services; location-based social networks; mobility activities; personalized top-k user recommendations; probability score; reliability; social influence-based user recommender framework; social relations; travel planning; user recommendation mechanisms; virtual communities; Cities and towns; Educational institutions; Equations; Heating; Mathematical model; Social network services; Tuning

129. On Spectral Analysis of Signed and Dispute Graphs.

【Paper Link】【Pages】:1049-1054

【Authors】: Leting Wu ; Xintao Wu ; Aidong Lu ; Yuemeng Li

【Abstract】: This paper presents a study of signed networks from both theoretical and practical aspects. On the theoretical aspect, we conduct theoretical study based on matrix perturbation theorem for analyzing community structures of complex signed networks and show how the negative edges affect distributions and patterns of node spectral coordinates in the spectral space. We prove and demonstrate cluster orthogonality for two types of signed networks: graph with dense inter-community mixed sign edges and k-dispute graph. We show why the line orthogonality pattern does not hold in the spectral space for these two types of networks. On the practical aspect, we have developed a clustering method to study signed networks and k-dispute networks. Empirical evaluations on both synthetic and real networks show our algorithm outperforms existing clustering methods on signed networks in terms of accuracy.

【Keywords】: complex networks; matrix algebra; network theory (graphs); cluster orthogonality; complex signed network; dense intercommunity mixed sign edges; k-dispute graph; line orthogonality pattern; matrix perturbation theorem; node spectral coordinate; signed graph; spectral analysis; Clustering algorithms; Communities; Eigenvalues and eigenfunctions; Equations; Erbium; Partitioning algorithms; Spectral analysis; Spectral graph analysis; matrix perturbation; signed graph; social network analysis

130. Document-Specific Keyphrase Extraction Using Sequential Patterns with Wildcards.

【Paper Link】【Pages】:1055-1060

【Authors】: Fei Xie ; Xindong Wu ; Xingquan Zhu

【Abstract】: Finding good keyphrases for a document is beneficial for many applications, such as text summarization, browsing, and indexing. In this paper, we propose a sequential pattern mining based document-specific keyphrase extraction method. Our key innovation is to use wildcards (or gap constraints) to help extract sequential patterns, where the flexible wildcard constraints within a pattern can capture semantic relationships between words. To achieve this goal, we regard each single document as a sequential dataset, and propose an efficient algorithm to mine sequential patterns with wildcard and one-off conditions that allows important keyphrases to be captured during the mining process. For each extracted keyphrase candidate, we use some statistical pattern features to characterize it. A supervised learning classifier is trained to identify keyphrases from a test document. Comparisons on keyphrase benchmark datasets confirm that our document-specific keyphrase extraction method is effective in improving the quality of extracted keyphrases.

【Keywords】: Data mining; Databases; Educational institutions; Feature extraction; Microprogramming; Semantics; Time complexity; classification; keyphrase extraction; sequential pattern mining; wildcards

131. ORION: Online Regularized Multi-task Regression and Its Application to Ensemble Forecasting.

【Paper Link】【Pages】:1061-1066

【Authors】: Jianpeng Xu ; Pang-Ning Tan ; Lifeng Luo

【Abstract】: Ensemble forecasting is a well-known numerical prediction technique for modeling the evolution of nonlinear dynamic systems. The ensemble member forecasts are generated from multiple runs of a computer model, where each run is obtained by perturbing the starting condition or using a different model representation of the dynamic system. The ensemble mean or median is typically chosen as the consensus point estimate of the aggregated forecasts for decision making purposes. These approaches are limited in that they assume each ensemble member is equally skill ful and do not consider their inherent correlations. In this paper, we cast the ensemble forecasting task as an online, multi-task regression problem and present a framework called ORION to estimate the optimal weights for combining the ensemble members. The weights are updated using a novel online learning with restart strategy as new observation data become available. Experimental results on seasonal soil moisture predictions from 12 major river basins in North America demonstrate the superiority of the proposed approach compared to the ensemble median and other baseline methods.

【Keywords】: geophysics computing; learning (artificial intelligence); regression analysis; weather forecasting; ORION; ensemble forecasting task; nonlinear dynamic system; online learning; online regularized multitask regression; Computational modeling; Data models; Equations; Forecasting; Manganese; Prediction algorithms; Predictive models; Ensemble Forecasting; Online Multi-task Learning

132. Learning Low-Rank Label Correlations for Multi-label Classification with Missing Labels.

【Paper Link】【Pages】:1067-1072

【Authors】: Linli Xu ; Zhen Wang ; Zefan Shen ; Yubo Wang ; Enhong Chen

【Abstract】: Multi-label learning deals with the problem where each training example is associated with a set of labels simultaneously, with the set of labels corresponding to multiple concepts or semantic meanings. Intuitively, the multiple labels are usually correlated in some semantic space while sharing the same input space. As a consequence, the multi-label learning process can be augmented significantly by exploiting the label correlations effectively. Most of the existing approaches share the limitations in that the label correlations are typically taken as prior knowledge, which may not depict the true dependencies among labels correctly, or they do not adequately address the issue of missing labels. In this paper, we propose an integrated framework that learns the correlations among labels while training the multi-label model simultaneously. Specifically, a low rank structure is adopted to capture the complex correlations among labels. In addition, we incorporate a supplementary label matrix which augments the possibly incomplete label matrix by exploiting the label correlations. An alternating algorithm is then developed to solve the optimization problem. Extensive experiments are conducted on a number of image and text data sets to demonstrate the effectiveness of the proposed approach.

【Keywords】: image classification; learning (artificial intelligence); matrix algebra; optimisation; text analysis; alternating algorithm; image data sets; low-rank label correlation learning; missing labels; multilabel classification; multilabel learning; optimization problem; semantic space; supplementary label matrix; text data sets; Adaptation models; Birds; Correlation; Oceans; Semantics; Training; Vectors; label correlation; low rank; missing labels; multi-label learning

133. Learning Sparse Gaussian Bayesian Network Structure by Variable Grouping.

【Paper Link】【Pages】:1073-1078

【Authors】: Jie Yang ; Henry C. M. Leung ; Siu-Ming Yiu ; Yunpeng Cai ; Francis Y. L. Chin

【Abstract】: Bayesian networks (BNs) are popular for modeling conditional distributions of variables and causal relationships, especially in biological settings such as protein interactions, gene regulatory networks and microbial interactions. Previous BN structure learning algorithms treat variables with similar tendency separately. In this paper, we propose a grouped sparse Gaussian BN (GSGBN) structure learning algorithm which creates BN based on three assumptions: (i) variables follow a multivariate Gaussian distribution, (ii) the network only contains a few edges (sparse), (iii) similar variables have less-divergent sets of parents, while not-so-similar ones should have divergent sets of parents (variable grouping). We use L₁ regularization to make the learned network sparse, and another term to incorporate shared information among variables. For similar variables, GSGBN tends to penalize the differences of similar variables' parent sets more, compared to those not-so-similar variables' parent sets. The similarity of variables is learned from the data by alternating optimization, without prior domain knowledge. Based on this new definition of the optimal BN, a coordinate descent algorithm and a projected gradient descent algorithm are developed to obtain edges of the network and also similarity of variables. Experimental results on both simulated and real datasets show that GSGBN has substantially superior prediction performance for structure learning when compared to several existing algorithms.

【Keywords】: Gaussian distribution; belief networks; biology computing; gradient methods; learning (artificial intelligence); optimisation; GSGBN structure learning algorithm; L₁ regularization; alternating optimization; biological settings; causal relationships; coordinate descent algorithm; grouped sparse Gaussian BN; multivariate Gaussian distribution; not-so-similar variable parent sets; projected gradient descent algorithm; similar variable parent sets; sparse Gaussian Bayesian network structure learning; variable conditional distributions; variable grouping; Benchmark testing; Bismuth; Gaussian distribution; Linear regression; Optimization; Probability distribution; Sensitivity; Bayesian network; microbial interactions; sparsity; variable grouping

134. Learning from Label and Feature Heterogeneity.

【Paper Link】【Pages】:1079-1084

【Authors】: Pei Yang ; Jingrui He ; Hongxia Yang ; Haoda Fu

【Abstract】: Multiple types of heterogeneity, such as label heterogeneity and feature heterogeneity, often co-exist in many real-world data mining applications, such as news article categorization, gene functionality prediction. To effectively leverage such heterogeneity, in this paper, we propose a novel graph-based framework for Learning with both Label and Feature heterogeneities, namely L2F. It models the label correlation by requiring that any two label-specific classifiers behave similarly on the same views if the associated labels are similar, and imposes the view consistency by requiring that view-based classifiers generate similar predictions on the same examples. To solve the resulting optimization problem, we propose an iterative algorithm, which is guaranteed to converge to the global optimum. Furthermore, we analyze its generalization performance based on Rademacher complexity, which sheds light on the benefits of jointly modeling the label and feature heterogeneity. Experimental results on various data sets show the effectiveness of the proposed approach.

【Keywords】: data mining; graph theory; iterative methods; learning (artificial intelligence); optimisation; pattern classification; Rademacher complexity; feature heterogeneity; iterative algorithm; label heterogeneity; label-specific classifiers; multilabel learning; optimization; real-world data mining applications; Complexity theory; Correlation; Diabetes; Linear programming; Loss measurement; Optimization; Vectors; Rademacher complexity; heterogeneity; multi-label learning; multi-view learning

135. Learning from Imbalanced Data in Relational Domains: A Soft Margin Approach.

【Paper Link】【Pages】:1085-1090

【Authors】: Shuo Yang ; Tushar Khot ; Kristian Kersting ; Gautam Kunapuli ; Kris Hauser ; Sriraam Natarajan

【Abstract】: We consider the problem of learning probabilistic models from relational data. One of the key issues with relational data is class imbalance where the number of negative examples far outnumbers the number of positive examples. The common approach for dealing with this problem is the use of sub-sampling of negative examples. We, on the other hand, consider a soft margin approach that explicitly trades off between the false positives and false negatives. We apply this approach to the recently successful formalism of relational functional gradient boosting. Specifically, we modify the objective function of the learning problem to explicitly include the trade-off between false positives and negatives. We show empirically that this approach is more successful in handling the class imbalance problem than the original framework that weighed all the examples equally.

【Keywords】: data mining; gradient methods; probability; sampling methods; class imbalance; imbalanced data; probabilistic model; relational data; relational functional gradient boosting; soft margin approach; subsampling method; Boosting; Computational modeling; Cost function; Electronic mail; Measurement; Probabilistic logic; Standards

136. K-MEAP: Generating Specified K Clusters with Multiple Exemplars by Efficient Affinity Propagation.

【Paper Link】【Pages】:1091-1096

【Authors】: Yangtao Wang ; Lihui Chen

【Abstract】: Recently, an attractive clustering approach named multi-exemplar affinity propagation (MEAP) has been proposed as an extension to the single exemplar based Affinity Propagation (AP). MEAP is able to automatically identify multiple exemplars for each cluster associated with a super exemplar. However, if the cluster number is a prior knowledge and can be specified by the user, MEAP is unable to make use of such knowledge directly in its learning process. Instead it has to rely on re-running the process as many times as it takes by tuning parameters until it generates the desired number of clusters. The process of MEAP re-running may be very time consuming. In this paper, we propose a new clustering algorithm called KMEAP which is able to generate specified K clusters directly while retaining the advantages of MEAP. Two kinds of new additional messages are introduced in MEAP in order to control the number of clusters in the process of message passing. The detailed problem formulation, the derived updating rules for passing messages, and the in-depth analysis of the proposed K-MEAP are provided. Experimental studies demonstrated that K-MEAP not only generates K clusters directly and efficiently without tuning parameters, but also outperforms related approaches in terms of clustering accuracy.

【Keywords】: message passing; pattern clustering; K-MEAP; attractive clustering approach; message passing; multiexemplar affinity propagation; specified K clusters; Accuracy; Clustering algorithms; Couplings; Linear programming; Message passing; Time complexity; Tuning; affinity propagation; clustering; multiple exemplars

137. Naive-Bayes Inspired Effective Pre-Conditioner for Speeding-Up Logistic Regression.

【Paper Link】【Pages】:1097-1102

【Authors】: Nayyar A. Zaidi ; Mark James Carman ; Jesús Cerquides ; Geoffrey I. Webb

【Abstract】: We propose an alternative parameterization of Logistic Regression (LR) for the categorical data, multi-class setting. LR optimizes the conditional log-likelihood over the training data and is based on an iterative optimization procedure to tune this objective function. The optimization procedure employed may be sensitive to scale and hence an effective pre-conditioning method is recommended. Many problems in machine learning involve arbitrary scales or categorical data (where simple standardization of features is not applicable). The problem can be alleviated by using optimization routines that are invariant to scale such as (second-order) Newton methods. However, computing and inverting the Hessian is a costly procedure and not feasible for big data. Thus one must often rely on first-order methods such as gradient descent (GD), stochastic gradient descent (SGD) or approximate second-order such as quasi-Newton (QN) routines, which are not invariant to scale. This paper proposes a simple yet effective pre-conditioner for speeding-up LR based on naive Bayes conditional probability estimates. The idea is to scale each attribute by the log of the conditional probability of that attribute given the class. This formulation substantially speeds-up LR's convergence. It also provides a weighted naive Bayes formulation which yields an effective framework for hybrid generative-discriminative classification.

【Keywords】: Bayes methods; Newton method; convergence; learning (artificial intelligence); optimisation; pattern classification; regression analysis; LR convergence; categorical data; conditional log-likelihood; hybrid generative-discriminative classification; iterative optimization; logistic regression; machine learning; multiclass setting; naive Bayes conditional probability estimates; naive-Bayes inspired effective preconditioner; optimization routines; parameterisation; preconditioning method; second-order Newton methods; weighted naive Bayes formulation; Convergence; Equations; Logistics; Mathematical model; Niobium; Optimization; Training; classification; discriminative-generative learning; logistic regression; pre-conditioning; stochastic gradient descent; weighted naive Bayes

138. Multi-view Clustering via Multi-manifold Regularized Nonnegative Matrix Factorization.

【Paper Link】【Pages】:1103-1108

【Authors】: Xianchao Zhang ; Long Zhao ; Linlin Zong ; Xinyue Liu ; Hong Yu

【Abstract】: Multi-view clustering integrates complementary information from multiple views to gain better clustering performance rather than relying on a single view. NMF based multi-view clustering algorithms have shown their competitiveness among different multi-view clustering algorithms. However, NMF fails to preserve the locally geometrical structure of the data space. In this paper, we propose a multi-manifold regularized nonnegative matrix factorization framework (MMNMF) which can preserve the locally geometrical structure of the manifolds for multi-view clustering. MMNMF regards that the intrinsic manifold of the dataset is embedded in a convex hull of all the views' manifolds, and incorporates such an intrinsic manifold and an intrinsic (consistent) coefficient matrix with a multi-manifold regularizer to preserve the locally geometrical structure of the multi-view data space. We use linear combination to construct the intrinsic manifold, and propose two strategies to find the intrinsic coefficient matrix, which lead to two instances of the framework. Experimental results show that the proposed algorithms outperform existing NMF based algorithms for multi-view clustering.

【Keywords】: matrix decomposition; pattern clustering; MMNMF; geometrical structure; intrinsic coefficient matrix; linear combination; multimanifold regularized nonnegative matrix factorization; multiview clustering; Approximation methods; Clustering algorithms; Convergence; Educational institutions; Linear programming; Manifolds; Matrix decomposition

139. Investment Recommendation in P2P Lending: A Portfolio Perspective with Risk Management.

【Paper Link】【Pages】:1109-1114

【Authors】: Hongke Zhao ; Le Wu ; Qi Liu ; Yong Ge ; Enhong Chen

【Abstract】: P2P lending is an online platform to make borrowing and investment transactions. A central question on these platforms is how to align the right products with the right investors, thus helping investors to make better decisions. Along this line, tremendous efforts have been devoted to modeling the credits of products and borrowers from an economic perspective. However, these global models are only exploratory in nature and are not practical. In this paper, we focus on the personalized investment recommendation by reconstructing the two steps for investment decision making: what to buy and how much money to pay. Specifically, we first generate a candidate investment recommendation list for each investor that tackles "what to buy" problem. In this process, we consider various unique properties of investment recommendation. Furthermore, according to the portfolio theory, we optimize the shares of each recommended candidate by incorporating the investments an investor currently holds, thus solving the "how much money to pay" problem. Finally, extensive experimental results on a large-scale real world dataset show the effectiveness of our model under various evaluation metrics.

【Keywords】: decision making; financial data processing; investment; peer-to-peer computing; recommender systems; risk management; P2P lending; borrowing transaction; economic perspective; evaluation metrics; global models; how much money to pay problem; investment decision making; investment transaction; personalized investment recommendation; portfolio perspective; risk management; Biological system modeling; Context; Equations; Investment; Mathematical model; Measurement; Portfolios; Investment Recommendation; P2P Lending; Portfolio Perspective

140. Predicting the Geographical Origin of Music.

【Paper Link】【Pages】:1115-1120

【Authors】: Fang Zhou ; Claire Q ; Ross D. King

【Abstract】: Traditional research into the arts has almost always been based around the subjective judgment of human critics. The use of data mining tools to understand art has great promise as it is objective and operational. We investigate the distribution of music from around the world: geographical ethnomusicology. We cast the problem as training a machine learning program to predict the geographical origin of pieces of music. This is a technically interesting problem as it has features of both classification and regression, and because of the spherical geometry of the surface of the Earth. Because of these characteristics of the representation of geographical positions, most standard classification/regression methods cannot be directly used. Two applicable methods are K-Nearest Neighbors and Random forest regression, which are robust to the non-standard structure of data. We also investigated improving performance through use of bagging. We collected 1,142 pieces of music from 73 countries/areas, and described them using 2 different sets of standard audio descriptors using MARSYAS. 10-fold cross validation was used in all experiments. The experimental results indicate that Random forest regression produces significantly better results than KNN, and the use of bagging improves the performance of KNN. The best performing algorithm achieved a mean great circle distance error of 3,113 km.

【Keywords】: data mining; geography; learning (artificial intelligence); music; pattern classification; regression analysis; data mining; geographical ethnomusicology; k-nearest neighbor method; kNN; machine learning program; music geographical origin; random forest regression method; Art; Earth; Educational institutions; Feature extraction; Standards; Training; Training data; geographical ethnomusicology; random forest regression; regression

141. Co-Clustering Structural Temporal Data with Applications to Semiconductor Manufacturing.

【Paper Link】【Pages】:1121-1126

【Authors】: Yada Zhu ; Jingrui He

【Abstract】: Recent years have witnessed data explosion in semiconductor manufacturing due to advances in instrumentation and storage techniques. In particular, following the same recipe for a certain IC device, multiple tools and chambers can be deployed for the production of this device, during which multiple time series can be collected, such as temperature, impedance, gas flow, electric bias, etc. These time series naturally fit into a two-dimensional array (matrix), i.e., Each element in this array corresponds to a time series for one process variable from one chamber. To leverage the rich structural information in such temporal data, in this paper, we propose a novel framework named C-Struts to simultaneously cluster on the two dimensions of this array. In this framework, we interpret the structural information as a set of constraints on the cluster membership, introduce an auxiliary probability distribution accordingly, and design an iterative algorithm to assign each time series to a certain cluster on each dimension. To the best of our knowledge, we are the first to address this problem. Extensive experiments on benchmark and manufacturing data sets demonstrate the effectiveness of the proposed method.

【Keywords】: instrumentation; manufacturing data processing; pattern clustering; semiconductor industry; semiconductor technology; statistical distributions; time series; 2D array; IC device; auxiliary probability distribution; cluster membership; coclustering structural temporal data; data explosion; electric bias; instrumentation; iterative algorithm; manufacturing data sets; semiconductor manufacturing; storage techniques; structural information; time series; Arrays; Clustering algorithms; Manufacturing; Probability distribution; Process control; Prototypes; Time series analysis; co-clustering; structural; temporal

142. Mining Query-Based Subnetwork Outliers in Heterogeneous Information Networks.

【Paper Link】【Pages】:1127-1132

【Authors】: Honglei Zhuang ; Jing Zhang ; George Brova ; Jie Tang ; Hasan Çam ; Xifeng Yan ; Jiawei Han

【Abstract】: Mining outliers in a heterogeneous information network is a challenging problem: It is even unclear what should be outliers in a large heterogeneous network (e.g., Outliers in the entire bibliographic network consisting of authors, titles, papers and venues). In this study, we propose an interesting class of outliers, query-based sub network outliers: Given a heterogeneous network, a user raises a query to retrieve a set of task-relevant sub networks, among which, sub network outliers are those that significantly deviate from others (e.g., Outliers of author groups among those studying "topic modeling"). We formalize this problem and propose a general framework, where one can query for finding sub network outliers with respect to different semantics. We introduce the notion of sub network similarity that captures the proximity between two sub networks by their membership distributions. We propose an outlier detection algorithm to rank all the sub networks according to their outlierness without tuning parameters. Our quantitative and qualitative experiments on both synthetic and real data sets show that the proposed method outperforms other baselines.

【Keywords】: bibliographic systems; data mining; information networks; query processing; bibliographic network; heterogeneous information networks; membership distributions; outlier detection algorithm; query-based subnetwork outlier mining; subnetwork similarity; task-relevant subnetworks; topic modeling; tuning parameters; Computer science; Data mining; Linear programming; Mathematical model; Patents; Silicon; Vectors; heterogeneous information network; outlier detection; query-based

143. Automated Essay Evaluation Augmented with Semantic Coherence Measures.

【Paper Link】【Pages】:1133-1138

【Authors】: Kaja Zupanc ; Zoran Bosnic

【Abstract】: Manual grading of students' essays is a time-consuming, labor-intensive and expensive activity for educational institutions. It is nevertheless necessary since essays are considered to be the most useful tool to assess learning outcomes. Automated essay evaluation represents a practical solution to this task, however, its main weakness is predominant focus on vocabulary and text syntax, and limited consideration of text semantics. In this work, we propose an extension to existing automated essay evaluation systems that incorporates additional semantic attributes. We design the novel attributes by transforming sequential parts of an essay into the semantic space and measuring changes between them to estimate coherence of the text. The resulting system (called SAGE - Semantic Automated Grader for Essays) achieves significantly higher grading accuracy compared with 8 other state-of-the-art automated essay evaluation systems.

【Keywords】: computer aided instruction; educational institutions; natural language processing; text analysis; SAGE; automated essay evaluation; educational institution; manual grading; semantic automated grader for essay; semantic coherence measures; student essay; text semantics; text syntax; vocabulary; Coherence; Correlation; Dispersion; Extraterrestrial measurements; Pragmatics; Semantics; Weight measurement; Automated Scoring; Essay Evaluation; Natural Language Processing; Semantic Attributes