11. ICDM 2011:Vancouver, BC, Canada

11th IEEE International Conference on Data Mining, ICDM 2011, Vancouver, BC, Canada, December 11-14, 2011. IEEE Computer Society 【DBLP Link】

Paper Num: 148 || Session Num: 2

Regular Papers 101
Short Papers 47

Regular Papers 101

1. Algorithms for Mining the Evolution of Conserved Relational States in Dynamic Networks.

【Paper Link】【Pages】:1-10

【Authors】: Rezwan Ahmed ; George Karypis

【Abstract】: Dynamic networks have recently being recognized as a powerful abstraction to model and represent the temporal changes and dynamic aspects of the data underlying many complex systems. Significant insights regarding the stable relational patterns among the entities can be gained by analyzing temporal evolution of the complex entity relations. This can help identify the transitions from one conserved state to the next and may provide evidence to the existence of external factors that are responsible for changing the stable relational patterns in these networks. This paper presents a new data mining method that analyzes the time-persistent relations or states between the entities of the dynamic networks and captures all maximal non-redundant evolution paths of the stable relational states. Experimental results based on multiple datasets from real world applications show that the method is efficient and scalable.

【Keywords】: data mining; complex entity relations; complex systems; conserved relational states; data mining method; dynamic networks; maximal nonredundant evolution paths; relational patterns; temporal evolution mining; time-persistent relations; Algorithm design and analysis; Data mining; Electronic mail; Equations; Heuristic algorithms; Patents; Silicon; Dynamic network; evolution; relational state

2. Infrastructure Pattern Discovery in Configuration Management Databases via Large Sparse Graph Mining.

【Paper Link】【Pages】:11-20

【Authors】: Pranay Anchuri ; Mohammed J. Zaki ; Omer Barkol ; Ruth Bergman ; Yifat Felder ; Shahar Golan ; Arik Sityon

【Abstract】: A configuration management database (CMDB) can be considered to be a large graph representing the IT infrastructure entities and their inter-relationships. Mining such graphs is challenging because they are large, complex, and multi-attributed, and have many repeated labels. These characteristics pose challenges for graph mining algorithms, due to the increased cost of sub graph isomorphism (for support counting), and graph isomorphism (for eliminating duplicate patterns). The notion of pattern frequency or support is also more challenging in a single graph, since it has to be defined in terms of the number of its (potentially, exponentially many) embeddings. We present CMDB-Miner, a novel two-step method for mining infrastructure patterns from CMDB graphs. It first samples the set of maximal frequent patterns, and then clusters them to extract the representative infrastructure patterns. We demonstrate the effectiveness of CMDB-Miner on real-world CMDB graphs.

【Keywords】: business data processing; data mining; database management systems; graph theory; organisational aspects; CMDB-Miner; IT infrastructure; configuration management databases; infrastructure pattern discovery; large sparse graph mining; maximal frequent patterns; representative infrastructure patterns; Data mining; Databases; Entropy; Image edge detection; Organizations; Servers; Upper bound; configuration management databases; frequent subgraphs; single graph mining

3. Role-Behavior Analysis from Trajectory Data by Cross-Domain Learning.

【Paper Link】【Pages】:21-30

【Authors】: Shin Ando ; Einoshin Suzuki

【Abstract】: Behavior analysis using trajectory data presents a practical and interesting challenge for KDD. Conventional analyses address discriminative tasks of behaviors, e.g., classification and clustering typically using the subsequences extracted from the trajectory of an object as a numerical feature representation. In this paper, we explore further to identify the difference in the high-level semantics of behaviors such as roles and address the task in a cross-domain learning approach. The trajectory, from which the features are sampled, is intuitively viewed as a domain, and we assume that its intrinsic structure is characterized by the underlying role associated with the tracked object. We propose a novel hybrid method of spectral clustering and density approximation for comparing clustering structures of two independently sampled trajectory data and identifying patterns of behaviors unique to a role. We present empirical evaluations of the proposed method in two practical settings using real-world robotic trajectories.

【Keywords】: learning (artificial intelligence); pattern clustering; robots; KDD; behavior discriminative tasks; behavior high-level semantics; behavior pattern identification; cross-domain learning approach; density approximation; robotic trajectories; role-behavior analysis; spectral clustering; trajectory data; Conferences; Data mining; density-based outlier detection; time-series subsequence clustering; trajectory data mining; transfer learning

4. Semi-supervised Feature Importance Evaluation with Ensemble Learning.

【Paper Link】【Pages】:31-40

【Authors】: Hasna Barkia ; Haytham Elghazel ; Alex Aussem

【Abstract】: We consider the problem of using a large amount of unlabeled data to improve the efficiency of feature selection in high dimensional datasets, when only a small set of labeled examples is available. We propose a new semi-supervised feature importance evaluation method (SSFI for short), that combines ideas from co-training and random forests with a new permutation-based out-of-bag feature importance measure. We provide empirical results on several benchmark datasets indicating that SSFI can lead to significant improvement over state-of-the-art semi-supervised and supervised algorithms.

【Keywords】: learning (artificial intelligence); pattern classification; cotraining; ensemble learning; feature selection; high dimensional datasets; permutation-based out-of-bag feature importance measure; random forests; semisupervised feature importance evaluation; unlabeled data; Bagging; Labeling; Manifolds; Prediction algorithms; Radio frequency; Training; Vectors; Bagging; Co-training; Ensemble Method; Feature Selection; Random Subspaces Method; Semi-Supervised Learning

5. COMET: A Recipe for Learning and Using Large Ensembles on Massive Data.

【Paper Link】【Pages】:41-50

【Authors】: Justin D. Basilico ; M. Arthur Munson ; Tamara G. Kolda ; Kevin R. Dixon ; W. Philip Kegelmeyer

【Abstract】: COMET is a single-pass MapReduce algorithm for learning on large-scale data. It builds multiple random forest ensembles on distributed blocks of data and merges them into a mega-ensemble. This approach is appropriate when learning from massive-scale data that is too large to fit on a single machine. To get the best accuracy, IVoting should be used instead of bagging to generate the training subset for each decision tree in the random forest. Experiments with two large datasets (5GB and 50GB compressed) show that COMET compares favorably (in both accuracy and training time) to learning on a sub sample of data using a serial algorithm. Finally, we propose a new Gaussian approach for lazy ensemble evaluation which dynamically decides how many ensemble members to evaluate per data point, this can reduce evaluation cost by 100X or more.

【Keywords】: Gaussian processes; cost reduction; data handling; decision trees; distributed processing; learning (artificial intelligence); COMET; Gaussian approach; IVoting; data distributed blocks; decision tree; evaluation cost reduction; importance-sampled voting; lazy ensemble evaluation; massive data learning; multiple random forest; serial algorithm; single-pass MapReduce algorithm; training subset generation; Accuracy; Bagging; Computational modeling; Decision trees; Distributed databases; Training; Vegetation; Decision Tree Ensembles; Lazy Ensemble Evaluation; MapReduce; Massive Data

6. Overlapping Correlation Clustering.

【Paper Link】【Pages】:51-60

【Authors】: Francesco Bonchi ; Aristides Gionis ; Antti Ukkonen

【Abstract】: We introduce a new approach to the problem of overlapping clustering. The main idea is to formulate overlapping clustering as an optimization problem in which each data point is mapped to a small set of labels, representing membership to different clusters. The objective is to find a mapping so that the distances between data points agree as much as possible with distances taken over their label sets. To define distances between label sets, we consider two measures: a set-intersection indicator function and the Jaccard coefficient. To solve the main optimization problem we propose a local-search algorithm. The iterative step of our algorithm requires solving non-trivial optimization sub problems, which, for the measures of set-intersection and Jaccard, we solve using a greedy method and non-negative least squares, respectively. Since our frameworks uses pair wise similarities of objects as the input, it lends itself naturally to the task of clustering structured objects for which feature vectors can be difficult to obtain. As a proof of concept we show how easily our framework can be applied in two different complex application domains. Firstly, we develop overlapping clustering of animal trajectories, obtaining zoologically meaningful results. Secondly, we apply our framework for overlapping clustering of proteins based on pair wise similarities of amino acid sequences, outperforming the of state-of-the-art method in matching a ground truth taxonomy.

【Keywords】: greedy algorithms; least squares approximations; optimisation; pattern clustering; search problems; set theory; Jaccard coefficient; data points; greedy method; label sets; local-search algorithm; membership representation; nonnegative least squares; nontrivial optimization sub problems; optimization problem; overlapping correlation clustering; set-intersection indicator function; Algorithm design and analysis; Clustering algorithms; Correlation; Equations; Labeling; Optimization; Proteins

7. Learning with Minimum Supervision: A General Framework for Transductive Transfer Learning.

【Paper Link】【Pages】:61-70

【Authors】: Mohammad Taha Bahadori ; Yan Liu ; Dan Zhang

【Abstract】: Transductive transfer learning is one special type of transfer learning problem, in which abundant labeled examples are available in the source domain and only unlabeled examples are available in the target domain. It easily finds applications in spam filtering, microblogging mining and so on. In this paper, we propose a general framework to solve the problem by mapping the input features in both the source domain and target domain into a shared latent space and simultaneously minimizing the feature reconstruction loss and prediction loss. We develop one specific example of the framework, namely latent large-margin transductive transfer learning (LATTL) algorithm, and analyze its theoretic bound of classification loss via Rademacher complexity. We also provide a unified view of several popular transfer learning algorithms under our framework. Experiment results on one synthetic dataset and three application datasets demonstrate the advantages of the proposed algorithm over the other state-of-the-art ones.

【Keywords】: computational complexity; information networks; learning (artificial intelligence); LATTL; Rademacher complexity; large-margin transductive transfer learning algorithm; microblogging mining; source domain; spam filtering; target domain; Conferences; Data mining

8. Confidence in Predictions from Random Tree Ensembles.

【Paper Link】【Pages】:71-80

【Authors】: Siddhartha Bhattacharyya

【Abstract】: Obtaining an indication of confidence of predictions is desirable for many data mining applications. Such confidence levels, together with the predicted value, can inform on the certainty or extent of reliability that may be associated with the prediction. This can be useful, for example, where model outputs are used in making potentially costly decisions, and one may then focus on the higher confidence predictions, and in general across risk sensitive applications. The conformal prediction framework presents a novel approach for complementing predictions from machine learning algorithms with valid confidence measures. Confidence levels are obtained from the underlying algorithm, using a non-conformity measure which indicates how 'atypical' a given example set is. The non-conformity measure is key to determining the usefulness and efficiency of the approach. This paper considers inductive conformal prediction in the context of random tree ensembles like random forests, which have been noted to perform favorably across problems. Focusing on classification tasks, and considering realistic data contexts including class imbalance, we develop non-conformity measures for assessing the confidence of predicted class labels from random forests. We examine the performance of these measures on multiple datasets. Results demonstrate the usefulness and validity of the measures, their relative differences, and highlight the effectiveness of conformal prediction random forests for obtaining predictions with associated confidence.

【Keywords】: data mining; pattern classification; classification tasks; conformal prediction framework; data mining applications; prediction confidence; random forests; random tree ensembles; Calibration; Data mining; Extraterrestrial measurements; Predictive models; Training; Training data; Vegetation; Confidence; classification; conformal prediction; data mining; random forests

9. Mining Heavy Subgraphs in Time-Evolving Networks.

【Paper Link】【Pages】:81-90

【Authors】: Petko Bogdanov ; Misael Mongiovì ; Ambuj K. Singh

【Abstract】: Networks from different genres are not static entities, but exhibit dynamic behavior. The congestion level of links in transportation networks varies in time depending on the traffic. Similarly, social and communication links are employed at varying rates as information cascades unfold. In recent years there has been an increase of interest in modeling and mining dynamic networks. However, limited attention has been placed in high-scoring sub graph discovery in time-evolving networks. We define the problem of finding the highest-scoring temporal sub graph in a dynamic network, termed Heaviest Dynamic Sub graph (HDS). We show that HDS is NP-hard even with edge weights in {-1,1} and devise an efficient approach for large graph instances that evolve over long time periods. While a naive approach would enumerate all O(t²) sub-intervals, our solution performs an effective pruning of the sub-interval space by considering O(t·log(t)) groups of sub-intervals and computing an aggregate of each group in logarithmic time. We also define a fast heuristic and a tight upper bound for approximating the static version of HDS, and use them for further pruning the sub-interval space and quickly verifying candidate sub-intervals. We perform an extensive experimental evaluation of our algorithm on transportation, communication and social media networks for discovering sub graphs that correspond to traffic congestions, communication overflow and localized social discussions. Our method is two orders of magnitude faster than a naive approach and scales well with network size and time length.

【Keywords】: graph theory; optimisation; transportation; HDS; NP-hard problem; dynamic behavior; dynamic networks; graph discovery; heaviest dynamic sub graph; mining heavy subgraphs; social media networks; time evolving networks; traffic congestions; transportation networks; Approximation algorithms; Biology; Complexity theory; Data mining; Steiner trees; Transportation; Upper bound; dynamic networks; heavy subgraph; pattern mining

10. Multi-Class L2, 1-Norm Support Vector Machine.

【Paper Link】【Pages】:91-100

【Authors】: Xiao Cai ; Feiping Nie ; Heng Huang ; Chris H. Q. Ding

【Abstract】: Feature selection is an essential component of data mining. In many data analysis tasks where the number of data point is much less than the number of features, efficient feature selection approaches are desired to extract meaningful features and to eliminate redundant ones. In the previous study, many data mining techniques have been applied to tackle the above challenging problem. In this paper, we propose a new ℓ_2,1-norm SVM, that is, multi-class hinge loss with a structured regularization term for all the classes to naturally select features for multi-class without bothering further heuristic strategy. Rather than directly solving the multi-class hinge loss with ℓ_2,1-norm regularization minimization, which has not been solved before due to its optimization difficulty, we are the first to give an efficient algorithm bridging the new problem with a previous solvable optimization problem to do multi-class feature selection. A global convergence proof for our method is also presented. Via the proposed efficient algorithm, we select features across multiple classes with jointly sparsity, i.e., each feature has either small or large score over all classes. Comprehensive experiments have been performed on six bioinformatics data sets to show that our method can obtain better or competitive performance compared with exiting state-of-art multi-class feature selection approaches.

【Keywords】: bioinformatics; data analysis; data mining; optimisation; support vector machines; bioinformatics data sets; data analysis tasks; data mining; multiclass L2,1-norm support vector machine; multiclass feature selection; multiclass hinge loss; optimization difficulty; solvable optimization problem; structured regularization term; Convergence; Data mining; Fasteners; Feature extraction; Lungs; Optimization; Support vector machines; Support Vector Machine; feature selection; multi-class feature selection

11. SolarMap: Multifaceted Visual Analytics for Topic Exploration.

【Paper Link】【Pages】:101-110

【Authors】: Nan Cao ; David Gotz ; Jimeng Sun ; Yu-Ru Lin ; Huamin Qu

【Abstract】: Documents in rich text corpora often contain multiple facets of information. For example, an article from a medical document collection might consist of multifaceted information about symptoms, treatments, causes, diagnoses, prognoses, and preventions. Thus, documents in the collection may have different relations across each of these various facets. Topic analysis and exploration for such multi-relational corpora is a challenging visual analytic task. This paper presents Solar Map, a multifaceted visual analytic technique for visually exploring topics in multi-relational data. Solar Map simultaneously visualizes the topic distribution of the underlying entities from one facet together with keyword distributions that convey the semantic definition of each cluster along a secondary facet. Solar Map combines several visual techniques including 1) topic contour clusters and interactive multifaceted keyword topic rings, 2) a global layout optimization algorithm that aligns each topic cluster with its corresponding keywords, and 3) 2) an optimal temporal network segmentation and layout method that renders temporal evolution of clusters. Finally, the paper concludes with two case studies and quantitative user evaluation which show the power of the Solar Map technique.

【Keywords】: data visualisation; text analysis; SolarMap; keyword distributions; medical document collection; multifaceted visual analytic technique; optimal temporal network segmentation; rich text corpora; topic analysis; topic exploration; Data models; Data visualization; Diseases; Kernel; Layout; Tag clouds; Visualization; Multifaceted Information Visualization; Temporal topic visualization; Visual Analytics

12. Efficiently Mining Unordered Trees.

【Paper Link】【Pages】:111-120

【Authors】: Mostafa Haghir Chehreghani

【Abstract】: Frequent tree patterns have many applications in different domains such as XML document mining, user web log analysis, network routing and bioinformatics. In this paper, we first introduce three new tree encodings and accordingly present an efficient algorithm for finding frequent patterns from rooted unordered trees with the assumption that children of every node in database trees are identically labeled. Then, we generalize the method and propose the UITree algorithm to find frequent patterns from rooted unordered trees without any restriction. Compared to other algorithms in the literature, UItree manages occurrences of a candidate tree in database trees more efficiently. Our extensive experiments on both real and synthetic datasets show that UITree significantly outperforms the most efficient existing works on mining unordered trees.

【Keywords】: XML; data mining; tree data structures; UITree algorithm; XML document mining; bioinformatics; database trees; efficiently mining unordered trees; frequent tree patterns; network routing; user web log analysis; Bioinformatics; Clustering algorithms; Data mining; Databases; Encoding; Routing; XML; Frequent tree patterns; candidate generation; frequency counting; rooted unordered trees; tree encoding

13. CEMiner - An Efficient Algorithm for Mining Closed Patterns from Time Interval-Based Data.

【Paper Link】【Pages】:121-130

【Authors】: Yi-Cheng Chen ; Wen-Chih Peng ; Suh-Yin Lee

【Abstract】: The mining of closed sequential patterns has attracted researchers for its capability of using compact results to preserve the same expressive power as conventional mining. However, existing studies only focus on time point-based data. Few research efforts have elaborated on discovering closed sequential patterns from time interval-based data, where each data persists for a period of time. Mining closed time interval-based patterns, also called closed temporal patterns, is an arduous problem since the pair wise relationships between two interval-based events are intrinsically complex. In this paper, an efficient algorithm, CEMiner is developed to discover closed temporal patterns from interval-based data. Algorithm CEMiner employs some optimization techniques to effectively reduce the search space. The experimental results on both synthetic and real datasets indicate that CEMiner not only significantly outperforms the prior interval-based mining algorithms in terms of execution time but also possesses graceful scalability. The experiment conducted on real dataset shows the practicability of time interval-based closed pattern mining.

【Keywords】: data mining; optimisation; CEMiner; closed sequential patterns mining; interval based events; optimization techniques; temporal patterns; time interval based data; time point based data; Algorithm design and analysis; Data mining; Diseases; Finishing; Itemsets; Scalability; closed temporal pattern; endpoint representation; sequential pattern mining; time interval-based data

14. LinkBoost: A Novel Cost-Sensitive Boosting Framework for Community-Level Network Link Prediction.

【Paper Link】【Pages】:131-140

【Authors】: Prakash Mandayam Comar ; Pang-Ning Tan ; Anil K. Jain

【Abstract】: Link prediction is a challenging task due to the inherent skew ness of network data. Typical link prediction methods can be categorized as either local or global. Local methods consider the link structure in the immediate neighborhood of a node pair to determine the presence or absence of a link, whereas global methods utilize information from the whole network. This paper presents a community (cluster) level link prediction method without the need to explicitly identify the communities in a network. Specifically, a variable-cost loss function is defined to address the data skew ness problem. We provide theoretical proof that shows the equivalence between maximizing the well-known modularity measure used in community detection and minimizing a special case of the proposed loss function. As a result, any link prediction method designed to optimize the loss function would result in more links being predicted within a community than between communities. We design a boosting algorithm to minimize the loss function and present an approach to scale-up the algorithm by decomposing the network into smaller partitions and aggregating the weak learners constructed from each partition. Experimental results show that our proposed Link Boost algorithm consistently performs as good as or better than many existing methods when evaluated on 4 real-world network datasets.

【Keywords】: data handling; network theory (graphs); LinkBoost; boosting algorithm; community level network link prediction; novel cost sensitive boosting framework; Algorithm design and analysis; Boosting; Communities; Loss measurement; Partitioning algorithms; Prediction algorithms; Predictive models; Boosting; Community Detection; Link Prediction; Modularity; Social Networks

15. Learning Dirichlet Processes from Partially Observed Groups.

【Paper Link】【Pages】:141-150

【Authors】: Avinava Dubey ; Indrajit Bhattacharya ; Mrinal Kanti Das ; Tanveer A. Faruquie ; Chiranjib Bhattacharyya

【Abstract】: Motivated by the task of vernacular news analysis using known news topics from national news-papers, we study the task of topic analysis, where given source datasets with observed topics, data items from a target dataset need to be assigned either to observed source topics or to new ones. Using Hierarchical Dirichlet Processes for addressing this task imposes unnecessary and often inappropriate generative assumptions on the observed source topics. In this paper, we explore Dirichlet Processes with partially observed groups (POG-DP). POG-DP avoids modeling the given source topics. Instead, it directly models the conditional distribution of the target data as a mixture of a Dirichlet Process and the posterior distribution of a Hierarchical Dirichlet Process with known groups and topics. This introduces coupling between selection probabilities of all topics within a source, leading to effective identification of source topics. We further improve on this with a Combinatorial Dirichlet Process with partially observed groups (POG-CDP) that captures finer grained coupling between related topics by choosing intersections between sources. We evaluate our models in three different real-world applications. Using extensive experimentation, we compare against several baselines to show that our model performs significantly better in all three applications.

【Keywords】: information resources; probability; combinatorial dirichlet process; combinatorial dirichlet process with partially observed groups; national news-papers; news topics; selection probabilities; topic analysis; vernacular news analysis; Companies; Couplings; Data models; Equations; Hidden Markov models; Inference algorithms; Mathematical model; Dirichlet Process; grouped data; partial observations; topic analysis

16. Exploiting False Discoveries - Statistical Validation of Patterns and Quality Measures in Subgroup Discovery.

【Paper Link】【Pages】:151-160

【Authors】: Wouter Duivesteijn ; Arno J. Knobbe

【Abstract】: Subgroup discovery suffers from the multiple comparisons problem: we search through a large space, hence whenever we report a set of discoveries, this set will generally contain false discoveries. We propose a method to compare subgroups found through subgroup discovery with a statistical model we build for these false discoveries. We determine how much the subgroups we find deviate from the model, and hence statistically validate the found subgroups. Furthermore we propose to use this subgroup validation to objectively compare quality measures used in subgroup discovery, by determining how much the top subgroups we find with each measure deviate from the statistical model generated with that measure. We thus aim to determine how good individual measures are in selecting significant findings. We invoke our method to experimentally compare popular quality measures in several subgroup discovery settings.

【Keywords】: data mining; statistical analysis; false discoveries; quality measures; statistical model; statistical pattern validation; subgroup discovery; Association rules; Complexity theory; Histograms; Search problems; Silicon; Size measurement; Statistical validation; subgroup discovery

17. An Efficient Greedy Method for Unsupervised Feature Selection.

【Paper Link】【Pages】:161-170

【Authors】: Ahmed K. Farahat ; Ali Ghodsi ; Mohamed S. Kamel

【Abstract】: In data mining applications, data instances are typically described by a huge number of features. Most of these features are irrelevant or redundant, which negatively affects the efficiency and effectiveness of different learning algorithms. The selection of relevant features is a crucial task which can be used to allow a better understanding of data or improve the performance of other learning tasks. Although the selection of relevant features has been extensively studied in supervised learning, feature selection with the absence of class labels is still a challenging task. This paper proposes a novel method for unsupervised feature selection, which efficiently selects features in a greedy manner. The paper first defines an effective criterion for unsupervised feature selection which measures the reconstruction error of the data matrix based on the selected subset of features. The paper then presents a novel algorithm for greedily minimizing the reconstruction error based on the features selected so far. The greedy algorithm is based on an efficient recursive formula for calculating the reconstruction error. Experiments on real data sets demonstrate the effectiveness of the proposed algorithm in comparison to the state-of-the-art methods for unsupervised feature selection.

【Keywords】: data mining; feature extraction; greedy algorithms; matrix algebra; unsupervised learning; data instances; data matrix reconstruction error; data mining; greedy method; learning algorithms; learning tasks; performance improvement; supervised learning; unsupervised feature selection; Clustering algorithms; Feature extraction; Greedy algorithms; Laplace equations; Matrices; Principal component analysis; Vectors; Feature Selection; Greedy Algorithms; Unsupervised Learning

18. Structured Feature Selection and Task Relationship Inference for Multi-task Learning.

【Paper Link】【Pages】:171-180

【Authors】: Hongliang Fei ; Jun Huan

【Abstract】: Multi-task Learning (MTL) aims to enhance the generalization performance of supervised regression or classification by learning multiple related tasks simultaneously. In this paper, we aim to extend the current MTL techniques to high dimensional data sets with structured input and structured output (SISO), where the SI means the input features are structured and the SO means the tasks are structured. We investigate a completely ignored problem in MTL with SISO data: the interaction of structured feature selection and task relationship modeling. We hypothesize that combining the structure information of features and task relationship inference enables us to build more accurate MTL models. Based on the hypothesis, we have designed an efficient learning algorithm, in which we utilize a task covariance matrix related to the model parameters to capture the task relationship. In addition, we design a regularization formulation for incorporating the structure of features in MTL. We have developed an efficient iterative optimization algorithm to solve the corresponding optimization problem. Our algorithm is based on the accelerated first order gradient method in conjunction with the projected gradient scheme. Using two real-world data sets, the experimental results demonstrate the utility of the proposed learning methods.

【Keywords】: covariance matrices; data handling; feature extraction; gradient methods; inference mechanisms; learning (artificial intelligence); optimisation; regression analysis; MTL techniques; SISO data; covariance matrix; data sets; first order gradient method; iterative optimization algorithm; multitask learning; projected gradient scheme; regularization formulation; structured feature selection; supervised classification; supervised regression; task relationship inference; task relationship modeling; Acceleration; Cancer; Convergence; Covariance matrix; Data models; Optimization; Silicon; Multi-task Learning; Structural Sparsity; Structured Input and Structured Output; Task Relationship Inference

19. A Taxi Driving Fraud Detection System.

【Paper Link】【Pages】:181-190

【Authors】: Yong Ge ; Hui Xiong ; Chuanren Liu ; Zhi-Hua Zhou

【Abstract】: Advances in GPS tracking technology have enabled us to install GPS tracking devices in city taxis to collect a large amount of GPS traces under operational time constraints. These GPS traces provide unparallel opportunities for us to uncover taxi driving fraud activities. In this paper, we develop a taxi driving fraud detection system, which is able to systematically investigate taxi driving fraud. In this system, we first provide functions to find two aspects of evidences: travel route evidence and driving distance evidence. Furthermore, a third function is designed to combine the two aspects of evidences based on Dempster-Shafer theory. To implement the system, we first identify interesting sites from a large amount of taxi GPS logs. Then, we propose a parameter-free method to mine the travel route evidences. Also, we introduce route mark to represent a typical driving path from an interesting site to another one. Based on route mark, we exploit a generative statistical model to characterize the distribution of driving distance and identify the driving distance evidences. Finally, we evaluate the taxi driving fraud detection system with large scale real-world taxi GPS logs. In the experiments, we uncover some regularity of driving fraud activities and investigate the motivation of drivers to commit a driving fraud by analyzing the produced taxi fraud data.

【Keywords】: Global Positioning System; driver information systems; fraud; statistical analysis; Dempster-Shafer theory; driving distance evidence; generative statistical model; parameter-free method; taxi driving fraud detection system; travel route evidence; Encoding; Estimation; Global Positioning System; Independent component analysis; Trajectory; Vectors; Vehicles; Dempster-Shafer Theory; Location Traces; Taxi Driving Fraud

20. Isograph: Neighbourhood Graph Construction Based on Geodesic Distance for Semi-supervised Learning.

【Paper Link】【Pages】:191-200

【Authors】: Marjan Ghazvininejad ; Mostafa Mahdieh ; Hamid R. Rabiee ; Parisa Khanipour Roshan ; Mohammad Hossein Rohban

【Abstract】: Semi-supervised learning based on manifolds has been the focus of extensive research in recent years. Convenient neighbourhood graph construction is a key component of a successful semi-supervised classification method. Previous graph construction methods fail when there are pairs of data points that have small Euclidean distance, but are far apart over the manifold. To overcome this problem, we start with an arbitrary neighbourhood graph and iteratively update the edge weights by using the estimates of the geodesic distances between points. Moreover, we provide theoretical bounds on the values of estimated geodesic distances. Experimental results on real-world data show significant improvement compared to the previous graph construction methods.

【Keywords】: graph theory; learning (artificial intelligence); Euclidean distance; Isograph; data points; geodesic distance; neighbourhood graph construction; semisupervised learning; Data mining; Estimation; Euclidean distance; Image edge detection; Joining processes; Labeling; Manifolds; Geodesic distance; Graph Construction; Manifold; Semi-supervised Learning

21. D-cores: Measuring Collaboration of Directed Graphs Based on Degeneracy.

【Paper Link】【Pages】:201-210

【Authors】: Christos Giatsidis ; Dimitrios M. Thilikos ; Michalis Vazirgiannis

【Abstract】: Community detection and evaluation is an important task in graph mining. In many cases, a community is defined as a sub graph characterized by dense connections or interactions among its nodes. A large variety of measures have been proposed to evaluate the quality of such communities - in most cases ignoring the directed nature of edges. In this paper, we introduce novel metrics for evaluating the collaborative nature of directed graphs - a property not captured by the single node metrics or by other established community evaluation metrics. In order to accomplish this objective, we capitalize on the concept of graph degeneracy and define a novel D-core framework, extending the classic graph-theoretic notion of k-cores for undirected graphs to directed ones. Based on the D-core, which essentially can be seen as a measure of the robustness of a community under degeneracy, we devise a wealth of novel metrics used to evaluate graph collaboration features of directed graphs. We applied the D-core approach on large real-world graphs such as Wikipedia and DBLP and report interesting results at the graph as well at node level.

【Keywords】: graph theory; D-cores; collaboration measurement; community detection; community evaluation; directed graphs; graph degeneracy; graph mining; k-cores; Collaboration; Communities; Encyclopedias; Indexes; Internet; Measurement; Community evaluation metrics; Cores; Degeneracy; Graph Mining

22. SIMPATH: An Efficient Algorithm for Influence Maximization under the Linear Threshold Model.

【Paper Link】【Pages】:211-220

【Authors】: Amit Goyal ; Wei Lu ; Laks V. S. Lakshmanan

【Abstract】: There is significant current interest in the problem of influence maximization: given a directed social network with influence weights on edges and a number k, find k seed nodes such that activating them leads to the maximum expected number of activated nodes, according to a propagation model. Kempe et al. showed, among other things, that under the Linear Threshold Model, the problem is NP-hard, and that a simple greedy algorithm guarantees the best possible approximation factor in PTIME. However, this algorithm suffers from various major performance drawbacks. In this paper, we propose SIMPATH, an efficient and effective algorithm for influence maximization under the linear threshold model that addresses these drawbacks by incorporating several clever optimizations. Through a comprehensive performance study on four real data sets, we show that SIMPATH consistently outperforms the state of the art w.r.t. running time, memory consumption and the quality of the seed set chosen, measured in terms of expected influence spread achieved.

【Keywords】: approximation theory; computational complexity; greedy algorithms; optimisation; social networking (online); NP-hard problem; PTIME; SlMPATH; approximation factor; directed social network; greedy algorithm; influence maximization; linear threshold model; memory consumption; optimizations; propagation model; seed set quality; Algorithms; Computational modeling; Estimation; Greedy algorithms; Integrated circuit modeling; Mathematical model; Optimization; Influence Spread; Linear Threshold Model; Simple Path Enumeration; Social Networks; Viral Marketing

23. Cross Domain Random Walk for Query Intent Pattern Mining from Search Engine Log.

【Paper Link】【Pages】:221-230

【Authors】: Siyu Gu ; Jun Yan ; Lei Ji ; Shuicheng Yan ; Junshi Huang ; Ning Liu ; Ying Chen ; Zheng Chen

【Abstract】: Understanding search intents of users through their condensed short queries has attracted much attention both in academia and industry. The search intents of users are generally assumed to be associated with various query patterns, such as "MobileName price", where "MobileName" could be any named entity of mobile phone model and this pattern indicates that the user intends to buy a mobile phone. However, discovering the query intent patterns for general search is challenging mainly due to the difficulty in collecting sufficient training data for learning query patterns across a large number of searchable domains. In this work, we propose Cross Domain Random Walk (CDRW) algorithm, which is semi-supervised, to discover the query intent patterns across different domains from search engine click-through log data. Starting with some manually tagged seed queries in one or more independent domains, CDRW takes the query patterns as bridge and propagates the transition probability across domains to collect the query intent patterns among different domains based on the assumption that "users who have similar intent in different but similar domains will have high probability to share similar query patterns across domains". Different from classical random walk algorithms, CDRW walks across different domains to disseminate the shared knowledge in a transfer learning manner. Extensive experiment results on real log data of a commercial search engine well validate the effectiveness and efficiency of the proposed algorithm.

【Keywords】: data mining; learning (artificial intelligence); probability; query processing; random processes; search engines; MobileName price; cross domain random walk algorithm; mobile phone model; query intent pattern mining; search engine click-through log data; semisupervised learning; transfer learning; transition probability; Algorithm design and analysis; Bridges; Data mining; Manuals; Mobile handsets; Search engines; Training data; query intent pattern; random walk; semi-supervised learning; transfer learning

24. Flexible Fault Tolerant Subspace Clustering for Data with Missing Values.

【Paper Link】【Pages】:231-240

【Authors】: Stephan Günnemann ; Emmanuel Müller ; Sebastian Raubach ; Thomas Seidl

【Abstract】: In today's applications, data analysis tasks are hindered by many attributes per object as well as by faulty data with missing values. Subspace clustering tackles the challenge of many attributes by cluster detection in any subspace projection of the data. However, it poses novel challenges for handling missing values of objects, which are part of multiple subspace clusters in different projections of the data. In this work, we propose a general fault tolerance definition enhancing subspace clustering models to handle missing values. We introduce a flexible notion of fault tolerance that adapts to the individual characteristics of subspace clusters and ensures a robust parameterization. Allowing missing values in our model increases the computational complexity of subspace clustering. Thus, we prove novel monotonicity properties for an efficient computation of fault tolerant subspace clusters. Experiments on real and synthetic data show that our fault tolerance model yields high quality results even in the presence of many missing values. For repeatability, we provide all datasets and executables on our website.

【Keywords】: Web sites; computational complexity; data analysis; fault tolerant computing; pattern clustering; Website; cluster detection; computational complexity; data analysis tasks; fault tolerant subspace data clustering; missing values; monotonicity properties; robust parameterization; Adaptation models; Approximation methods; Computational modeling; Data mining; Databases; Fault tolerance; Fault tolerant systems; fault tolerance; incomplete data; missing values; subspace clustering

25. Heuristic Updatable Weighted Random Subspaces for Non-stationary Environments.

【Paper Link】【Pages】:241-250

【Authors】: T. Ryan Hoens ; Nitesh V. Chawla ; Robi Polikar

【Abstract】: Learning in non-stationary environments is an increasingly important problem in a wide variety of real-world applications. In non-stationary environments data arrives incrementally, however the underlying generating function may change over time. While there is a variety of research into such environments, the research mainly consists of detecting concept drift (and then relearning the model), or developing classifiers which adapt to drift incrementally. We introduce Heuristic Up datable Weighted Random Subspaces (HUWRS), a new technique based on the Random Subspace Method that detects drift in individual features via the use of Hellinger distance, a distributional divergence metric. Through the use of subspaces, HUWRS allows for a more fine-grained approach to dealing with concept drift which is robust to feature drift even without class labels. We then compare our approach to two state of the art algorithms, concluding that for a wide range of datasets and window sizes HUWRS outperforms the other methods.

【Keywords】: data mining; learning (artificial intelligence); random processes; Hellinger distance; distributional divergence matrix; fine-grained approach; heuristic updatable weighted random subspaces; nonstationary learning environments; Accuracy; Classification algorithms; Context; Data mining; Feature extraction; Testing; Training; Concept Drift; Hellinger Distance; Non-stationary learning; Random Subspaces

26. Learning Tags from Unsegmented Videos of Multiple Human Actions.

【Paper Link】【Pages】:251-259

【Authors】: Timothy M. Hospedales ; Shaogang Gong ; Tao Xiang

【Abstract】: Providing methods to support semantic interaction with growing volumes of video data is an increasingly important challenge for data mining. To this end, there has been some success in recognition of simple objects and actions in video, however most of this work requires strongly supervised training data. The supervision cost of these approaches therefore renders them economically non-scalable for real world applications. In this paper we address the problem of learning to annotate and retrieve semantic tags of human actions in realistic video data with sparsely provided tags of semantically salient activities. This is challenging because of (1) the multi-label nature of the learning problem and (2) realistic videos are often dominated by (semantically uninteresting) background activity un-supported by any tags of interest, leading to a strong irrelevant data problem. To address these challenges, we introduce a new topic model based approach to video tag annotation. Our model simultaneously learns a low dimensional representation of the video data, which dimensions are semantically relevant (supported by tags), and how to annotate videos with tags. Experimental evaluation on three different video action/activity datasets demonstrate the challenge of this problem, and value of our contribution.

【Keywords】: data mining; image segmentation; learning (artificial intelligence); object recognition; video retrieval; video signal processing; action recognition; activity dataset; background activity; human action; learning problem; object recognition; realistic video data mining; semantic tag retrieval; unsegmented video; video data mining; video data representation; video tag annotation; Data models; Humans; Predictive models; Training; Vectors; Videos; Visualization; action recognition; annotation; tag learning; topic model

27. Generating Breakpoint-based Timeline Overview for News Topic Retrospection.

【Paper Link】【Pages】:260-269

【Authors】: Po Hu ; Minlie Huang ; Peng Xu ; Weichang Li ; Adam K. Usadi ; Xiaoyan Zhu

【Abstract】: Though news readers can easily access a large number of news articles from the Internet, they can be overwhelmed by the quantity of information available, making it hard to get a concise, global picture of a news topic. In this paper we propose a novel method to address this problem. Given a set of articles for a given news topic, the proposed method models theme variation through time and identifies the breakpoints, which are time points when decisive changes occur. For each breakpoint, a brief summary is automatically constructed based on articles associated with the particular time point. Summaries are then ordered chronologically to form a timeline overview of the news topic. In this fashion, readers can easily track various news topics efficiently. We have conducted experiments on 15 popular topics in 2010. Empirical experiments show the effectiveness of our approach and its advantages over other approaches.

【Keywords】: Internet; information resources; Internet; breakpoint based timeline overview; news readers; news topic retrospection; Analytical models; Dispersion; Electronic mail; Google; Hidden Markov models; Markov processes; Search engines; Breakpoint; Data Mining; News Topic Retrospection; Text Mining

28. A Robust Clustering Algorithm Based on Aggregated Heat Kernel Mapping.

【Paper Link】【Pages】:270-279

【Authors】: Hao Huang ; Shinjae Yoo ; Hong Qin ; Dantong Yu

【Abstract】: Current spectral clustering algorithms suffer from both sensitivity to scaling parameter selection in similarity matrix construction, and data perturbation. This paper aims to improve robustness in clustering algorithms and combat these two limitations based on heat kernel theory. Heat kernel can statistically depict traces of random walk, so it has an intrinsic connection with diffusion distance, with which we can ensure robustness during any clustering process. By integrating heat distributed along time scale, we propose a novel method called Aggregated Heat Kernel (AHK) to measure the distance between each point pair in their eigen space. Using AHK and Laplace-Beltrami Normalization (LBN) we are able to apply an advanced noise-resisting robust spectral mapping to original dataset. Moreover it offers stability on scaling parameter tuning. Experimental results show that, compared to other popular spectral clustering methods, our algorithm can achieve robust clustering results on both synthetic and UCI real datasets.

【Keywords】: data mining; matrix algebra; pattern clustering; unsupervised learning; Laplace-Beltrami normalization; aggregated heat kernel mapping; data mining; data perturbation; diffusion distance; distance measurement; intrinsic connection; knowledge discovery; noise-resisting robust spectral mapping; robust clustering algorithm; scaling parameter selection; sensitivity parameter selection; similarity matrix construction; spectral clustering algorithms; unsupervised knowledge exploration; Clustering algorithms; Heating; Kernel; Laplace equations; Noise; Robustness; Sensitivity; Diffusion processes; Green's function methods; Spectral analysis

29. Patent Maintenance Recommendation with Patent Information Network Model.

【Paper Link】【Pages】:280-289

【Authors】: Xin Jin ; W. Scott Spangler ; Ying Chen ; Keke Cai ; Rui Ma ; Li Zhang ; Xian Wu ; Jiawei Han

【Abstract】: Patents are of crucial importance for businesses, because they provide legal protection for the invented techniques, processes or products. A patent can be held for up to 20 years. However, large maintenance fees need to be paid to keep it enforceable. If the patent is deemed not valuable, the owner may decide to abandon it by stopping paying the maintenance fees to reduce the cost. For large companies or organizations, making such decisions is difficult because too many patents need to be investigated. In this paper, we introduce the new patent mining problem of automatic patent maintenance prediction, and propose a systematic solution to analyze patents for recommending patent maintenance decision. We model the patents as a heterogeneous time-evolving information network and propose new patent features to build model for a ranked prediction on whether to maintain or abandon a patent. In addition, a network-based refinement approach is proposed to further improve the performance. We have conducted experiments on the large scale United States Patent and Trademark Office (USPTO) database which contains over four million granted patents. The results show that our technique can achieve high performance.

【Keywords】: data mining; decision support systems; organisational aspects; patents; recommender systems; United States patent and trademark office database; automatic patent maintenance prediction; businesses; invented techniques; legal protection; maintenance fees; organizations; patent information network model; patent maintenance decision; patent maintenance recommendation; patent mining problem; Companies; Feature extraction; Maintenance engineering; Patents; Predictive models; Writing; patent information network; patent maintenance; patent mining; prediction; ranking

30. S-preconditioner for Multi-fold Data Reduction with Guaranteed User-Controlled Accuracy.

【Paper Link】【Pages】:290-299

【Authors】: Ye Jin ; Sriram Lakshminarasimhan ; Neil Shah ; Zhenhuan Gong ; Choong-Seock Chang ; Jackie Chen ; Stéphane Ethier ; Hemanth Kolla ; Seung-Hoe Ku ; Scott Klasky ; Robert Latham ; Robert B. Ross ; Karen Schuchardt ; Nagiza F. Samatova

【Abstract】: The growing gap between the massive amounts of data generated by petascale scientific simulation codes and the capability of system hardware and software to effectively analyze this data necessitates data reduction. Yet, the increasing data complexity challenges most, if not all, of the existing data compression methods. In fact, lossless compression techniques offer no more than 10% reduction on scientific data that we have experience with, which is widely regarded as effectively incompressible. To bridge this gap, in this paper, we advocate a transformative strategy that enables fast, accurate, and multi-fold reduction of double-precision floating-point scientific data. The intuition behind our method is inspired by an effective use of preconditioners for linear algebra solvers optimized for a particular class of computational "dwarfs" (e.g., dense or sparse matrices). Focusing on a commonly used multi-resolution wavelet compression technique as the underlying "solver" for data reduction we propose the S-preconditioner, which transforms scientific data into a form with high global regularity to ensure a significant decrease in the number of wavelet coefficients stored for a segment of data. Combined with the subsequent EQ-calibrator, our resultant method (called S-Preconditioned EQ-Calibrated Wavelets (SPEQC-Wavelets)), robustly achieved a 4- to 5-fold data reduction-while guaranteeing user-defined accuracy of reconstructed data to be within 1% point-by-point relative error, lower than 0.01 Normalized RMSE, and higher than 0.99 Pearson Correlation. In this paper, we show the results we obtained by testing our method on six petascale simulation codes including fusion, combustion, climate, astrophysics, and subsurface groundwater in addition to 13 publicly available scientific datasets. We also demonstrate that application-driven data mining tasks performed on decompressed variables or their derived quantities produce results of comparable quality with the ones for- the original data.

【Keywords】: data analysis; data mining; data reduction; mean square error methods; wavelet transforms; Pearson correlation; S-preconditioner; application driven data mining; data analysis; data complexity; data reconstruction; double-precision floating-point scientific data; linear algebra solvers; lossless data compression techniques; multifold data reduction; multiresolution wavelet compression technique; petascale scientific simulation codes; system hardware capability; system software capability; transformative strategy; user-controlled accuracy; wavelet coefficients; Accuracy; Correlation; Data models; Indexes; Sparse matrices; Vectors; Wavelet transforms; data mining over decompressed data; data reduction; extreme-scale data analytics; in situ data analytics; preconditioners for data mining

31. Beyond 'Caveman Communities': Hubs and Spokes for Graph Compression and Mining.

【Paper Link】【Pages】:300-309

【Authors】: U. Kang ; Christos Faloutsos

【Abstract】: Given a real world graph, how should we lay-out its edges? How can we compress it? These questions are closely related, and the typical approach so far is to find clique-like communities, like the cavemen graph', and compress them. We show that the block-diagonal mental image of thecavemen graph' is the wrong paradigm, in full agreement with earlier results that real world graphs have no good cuts. Instead, we propose to envision graphs as a collection of hubs connecting spokes, with super-hubs connecting the hubs, and so on, recursively. Based on the idea, we propose the Slash Burn method (burn the hubs, and slash the remaining graph into smaller connected components). Our view point has several advantages: (a) it avoids the `no good cuts' problem, (b) it gives better compression, and (c) it leads to faster execution times for matrix-vector operations, which are the back-bone of most graph processing tools. Experimental results show that our Slash Burn method consistently outperforms other methods on all datasets, giving good compression and faster running time.

【Keywords】: data compression; data mining; graph theory; block-diagonal mental image; caveman community; cavemen graph; graph compression; graph mining; slashburn method; Algorithm design and analysis; Communities; Complexity theory; Cost function; Equations; Matrix decomposition; Vectors; Graph Compression; Graph Mining; Hubs and Spokes

32. Improving Product Classification Using Images.

【Paper Link】【Pages】:310-319

【Authors】: Anitha Kannan ; Partha Pratim Talukdar ; Nikhil Rasiwasia ; Qifa Ke

【Abstract】: Product classification in Commerce search (e.g., Google Product Search, Bing Shopping) involves associating categories to offers of products from a large number of merchants. The categorized offers are used in many tasks including product taxonomy browsing and matching merchant offers to products in the catalog. Hence, learning a product classifier with high precision and recall is of fundamental importance in order to provide high quality shopping experience. A product offer typically consists of a short textual description and an image depicting the product. Traditional approaches to this classification task is to learn a classifier using only the textual descriptions of the products. In this paper, we show that the use of images, a weaker signal in our setting, in conjunction with the textual descriptions, a more discriminative signal, can considerably improve the precision of the classification task, irrespective of the type of classifier being used. We present a novel classification approach, Confusion Driven Probabilistic Fusion++ (CDPF++), that is cognizant of the disparity in the discriminative power of different types of signals and hence makes use of the confusion matrix of dominant signal (text in our setting) to prudently leverage the weaker signal (image), for an improved performance. Our evaluation performed on data from a major Commerce search engine's catalog shows a 12% (absolute) improvement in precision at 100% coverage, and a 16% (absolute) improvement in recall at 90% precision compared to classifiers that only use textual description of products. In addition, CDPF++ also provides a more accurate classifier based only on the dominant signal (text) that can be used in situations in which only the dominant signal is available during application time.

【Keywords】: Internet; cataloguing; cognition; electronic commerce; image classification; learning (artificial intelligence); probability; product quality; retail data processing; search engines; text analysis; CDPF++; commerce search engine catalog; confusion driven probabilistic fusion++; confusion matrix; discriminative signal; disparity cognizant; dominant signal; product classification task; product classifier; product taxonomy browsing; shopping experience; textual description; Business; Catalogs; Computers; Probabilistic logic; Taxonomy; Training; Vocabulary; e-commerce; image; product classification; text

33. Learning Markov Logic Networks via Functional Gradient Boosting.

【Paper Link】【Pages】:320-329

【Authors】: Tushar Khot ; Sriraam Natarajan ; Kristian Kersting ; Jude W. Shavlik

【Abstract】: Recent years have seen a surge of interest in Statistical Relational Learning (SRL) models that combine logic with probabilities. One prominent example is Markov Logic Networks (MLNs). While MLNs are indeed highly expressive, this expressiveness comes at a cost. Learning MLNs is a hard problem and therefore has attracted much interest in the SRL community. Current methods for learning MLNs follow a two-step approach: first, perform a search through the space of possible clauses and then learn appropriate weights for these clauses. We propose to take a different approach, namely to learn both the weights and the structure of the MLN simultaneously. Our approach is based on functional gradient boosting where the problem of learning MLNs is turned into a series of relational functional approximation problems. We use two kinds of representations for the gradients: clause-based and tree-based. Our experimental evaluation on several benchmark data sets demonstrates that our new approach can learn MLNs as good or better than those found with state-of-the-art methods, but often in a fraction of the time.

【Keywords】: Markov processes; approximation theory; data handling; gradient methods; learning (artificial intelligence); statistical analysis; Markov logic networks; benchmark data sets; clause based gradients; functional gradient boosting; relational functional approximation problems; space search; statistical relational learning models; tree based gradients; Boosting; Grounding; Joints; Markov random fields; Regression tree analysis; Training; Ensemble learning; Probabilistic models; Relational Learning; Statistical Relational Learning

34. Signature Pattern Covering via Local Greedy Algorithm and Pattern Shrink.

【Paper Link】【Pages】:330-339

【Authors】: Hyungsul Kim ; Sungjin Im ; Tarek F. Abdelzaher ; Jiawei Han ; David Sheridan ; Shobha Vasudevan

【Abstract】: Pattern mining is a fundamental problem that has a wide range of applications. In this paper, we study the problem of finding a minimum set of signature patterns that explain all data. In the problem, we are given objects where each object has an item set and a label. A pattern is called a signature pattern if all objects with the pattern have the same label. This problem has many interesting applications such as assertion mining in hardware design and identifying failure causes from various log data. We show that the previous pattern mining methods are not suitable for mining signature patterns and identify the problems. Then we propose a novel pattern enumeration method which we call Pattern Shrink. Our method is strongly coupled with another novel method that is very similar to finding a local optimum with a negligible loss in performance. Our proposed methods show a speedup of more than ten times over the previous methods. Our methods are flexible enough to be extended to mining high confidence patterns, instead of signature patterns.

【Keywords】: data mining; greedy algorithms; set theory; assertion mining; failure cause identification; hardware design; local greedy algorithm; pattern mining methods; pattern shrink; set covering; signature pattern covering; Algorithm design and analysis; Approximation algorithms; Approximation methods; Data mining; Greedy algorithms; Hardware; Search problems; Local Greedy Algorithm; Pattern Covering; Set Covering; Signature Mining

35. TWITOBI: A Recommendation System for Twitter Using Probabilistic Modeling.

【Paper Link】【Pages】:340-349

【Authors】: Younghoon Kim ; Kyuseok Shim

【Abstract】: Twitter provides search services to help people find new users to follow by recommending popular users or their friends' friends. However, these services do not offer the most relevant users to follow for a user. Furthermore, Twitter does not provide yet the search services to find the most interesting tweet messages for a user either. In this paper, we propose TWITOBI, a recommendation system for Twitter using probabilistic modeling for collaborative filtering which can recommend top-K users to follow and top-K tweets to read for a user. Our novel probabilistic model utilizes not only tweet messages but also the relationships between users. We develop an estimation algorithm for learning our model parameters and present its parallelized algorithm using MapReduce to handle large data. Our performance study with real-life data sets confirms the effectiveness and scalability of our algorithms.

【Keywords】: collaborative filtering; learning (artificial intelligence); probability; recommender systems; social networking (online); MapReduce; TWITOBI; Twitter; collaborative filtering; estimation algorithm; learning; probabilistic modeling; real life data sets; recommendation system; search services; top-k users; Computational modeling; Data models; Equations; Mathematical model; Probabilistic logic; Probability distribution; Twitter; MapReduce; Twitter; collaborative filtering; probabilistic model; recommendation system

36. Maximum Entropy Modelling for Assessing Results on Real-Valued Data.

【Paper Link】【Pages】:350-359

【Authors】: Kleanthis-Nikolaos Kontonasios ; Jilles Vreeken ; Tijl De Bie

【Abstract】: Statistical assessment of the results of data mining is increasingly recognised as a core task in the knowledge discovery process. It is of key importance in practice, as results that might seem interesting at first glance can often be explained by well-known basic properties of the data. In pattern mining, for instance, such trivial results can be so overwhelming in number that filtering them out is a necessity in order to identify the truly interesting patterns. In this paper, we propose an approach for assessing results on real-valued rectangular databases. More specifically, using our analytical model we are able to statistically assess whether or not a discovered structure may be the trivial result of the row and column marginal distributions in the database. Our main approach is to use the Maximum Entropy principle to fit a background model to the data while respecting its marginal distributions. To find these distributions, we employ an MDL based histogram estimator, and we fit these in our model using efficient convex optimization techniques. Subsequently, our model can be used to calculate probabilities directly, as well as to efficiently sample data with the purpose of assessing results by means of empirical hypothesis testing. Notably, our approach is efficient, parameter-free, and naturally deals with missing values. As such, it represents a well-founded alternative to swap randomisation.

【Keywords】: data mining; maximum entropy methods; optimisation; probability; statistical analysis; MDL based histogram estimator; convex optimization technique; data mining; knowledge discovery process; marginal distribution; maximum entropy modelling; real-valued rectangular database; statistical assessment; Data mining; Data models; Databases; Entropy; Histograms; Probabilistic logic; Testing; Maximum Entropy modelling; background knowledge; hypothesis testing; swap randomizations

37. Local Models for Expectation-Driven Subgroup Discovery.

【Paper Link】【Pages】:360-369

【Authors】: Florian Lemmerich ; Frank Puppe

【Abstract】: Subgroup discovery (also known as Pattern Mining or Supervised Descriptive Rule Discovery) searches for descriptions of subsets in a dataset that differ from the total population with respect to a given target concept. In this paper we argue that in the traditional approach potentially interesting complex patterns with an unexpected relative increase of the target share remain undiscovered while on the other hand less surprising patterns are returned. Therefore, we present a generalized approach on subgroup discovery, in which the target share in the subgroup is not compared to the target share in the total population, but to the expectations a user has given the knowledge of more general (simpler) patterns. We claim that the resulting complex patterns are more interesting for the user and are less biased towards simpler patterns with a positive influence on the target concept. In order to estimate these expectations we utilize local models, i.e., fragments of Bayesian Networks. The proposed approach is evaluated using data from the UCI repository as well as on two totally different real world applications that investigate university student drop-out rates and identify spammers in a social book marking system.

【Keywords】: belief networks; data mining; learning (artificial intelligence); Bayesian Networks; complex patterns; expectation driven subgroup discovery; pattern mining; simpler patterns; social book marking system; supervised descriptive rule discovery; Additives; Bayesian methods; Data mining; Estimation; Itemsets; Knowledge engineering; Mathematical model; interestingness measures; pattern mining; subgroup discovery

38. Context-Aware Multi-instance Learning Based on Hierarchical Sparse Representation.

【Paper Link】【Pages】:370-377

【Authors】: Bing Li ; Weihua Xiong ; Weiming Hu

【Abstract】: Multi-instance learning (MIL), a variant of supervised learning framework, has been applied in many applications. More recently, researchers focus on two important issues for MIL: Instances' contextual structures representation in the same bag and online MIL schemes. In this paper, we present an effective context-aware multi-instance learning technique using a hierarchical sparse representation (HSR-MIL) that addresses the two challenges simultaneously. We firstly construct the inner contextual structure among instances in the same bag based on a novel sparse ε-graph. We then propose a graph kernel based sparse bag classifier through a modified kernel sparse coding in higher-dimension feature space. At last, the HSR-MIL approach is extended to achieve online learning manner with an incremental kernel matrix update scheme. The experiments on several data sets demonstrate that our method has better performances and online learning ability.

【Keywords】: graph theory; learning (artificial intelligence); ubiquitous computing; MIL; context aware multiinstance learning; contextual structures representation; graph kernel; hierarchical sparse representation; kernel matrix; online learning; sparse ε-graph; sparse bag classifier; supervised learning framework; Encoding; Frequency selective surfaces; Kernel; Sparse matrices; Support vector machines; Training; Vectors; Context-aware; Hierarchical Sparse Representation; Multi-Instance Learning

【Paper Link】【Pages】:378-387

【Authors】: Cindy Xide Lin ; Qiaozhu Mei ; Jiawei Han ; Yunliang Jiang ; Marina Danilevsky

【Abstract】: The prevalence of Web 2.0 techniques has led to the boom of various online communities, where topics spread ubiquitously among user-generated documents. Working together with this diffusion process is the evolution of topic content, where novel contents are introduced by documents which adopt the topic. Unlike explicit user behavior (e.g., buying a DVD), both the diffusion paths and the evolutionary process of a topic are implicit, making their discovery challenging. In this paper, we track the evolution of an arbitrary topic and reveal the latent diffusion paths of that topic in a social community. A novel and principled probabilistic model is proposed which casts our task as an joint inference problem, which considers textual documents, social influences, and topic evolution in a unified way. Specifically, a mixture model is introduced to model the generation of text according to the diffusion and the evolution of the topic, while the whole diffusion process is regularized with user-level social influences through a Gaussian Markov Random Field. Experiments on both synthetic data and real world data show that the discovery of topic diffusion and evolution benefits from this joint inference, and the probabilistic model we propose performs significantly better than existing methods.

【Keywords】: Gaussian processes; Internet; Markov processes; inference mechanisms; probability; random processes; social networking (online); text analysis; Gaussian Markov random field; Web 2.0 techniques; joint inference problem; latent diffusion paths; mixture model; online communities; probabilistic model; real world data; social community; social influence; synthetic data; text generation; textual document; topic content; topic diffusion process; topic evolution; user generated document; user level social influence; Communities; Computational modeling; Diffusion processes; Joints; Social network services; Tides; Vectors

40. Towards Optimal Discriminating Order for Multiclass Classification.

【Paper Link】【Pages】:388-397

【Authors】: Dong Liu ; Shuicheng Yan ; Yadong Mu ; Xian-Sheng Hua ; Shih-Fu Chang ; Hong-Jiang Zhang

【Abstract】: In this paper, we investigate how to design an optimized discriminating order for boosting multiclass classification. The main idea is to optimize a binary tree architecture, referred to as Sequential Discriminating Tree (SDT), that performs the multiclass classification through a hierarchical sequence of coarse-to-fine binary classifiers. To infer such a tree architecture, we employ the constrained large margin clustering procedure which enforces samples belonging to the same class to locate at the same side of the hyper plane while maximizing the margin between these two partitioned class subsets. The proposed SDT algorithm has a theoretic error bound which is shown experimentally to effectively guarantee the generalization performance. Experiment results indicate that SDT clearly beats the state-of-the-art multiclass classification algorithms.

【Keywords】: pattern classification; trees (mathematics); SDT; binary tree architecture; multiclass classification; optimal discriminating order; sequential discriminating tree; Algorithm design and analysis; Clustering algorithms; Optimization; Partitioning algorithms; Support vector machines; Testing; Training; Discriminating Order; Multiclass; Sequential Discriminating Tree

41. A Hypergraph-based Method for Discovering Semantically Associated Itemsets.

【Paper Link】【Pages】:398-406

【Authors】: Haishan Liu ; Paea LePendu ; Ruoming Jin ; Dejing Dou

【Abstract】: In this paper, we address an interesting data mining problem of finding semantically associated itemsets, i.e., items connected via indirect links. We propose a novel method for discovering semantically associated itemsets based on a hypergraph representation of the database. We describe two similarity measures to compute the strength of associations between items. Specifically, we introduce the average commute time similarity, s_CT, based on the random walk model on hypergraph, and the inner-product similarity, s_L+, based on the Moore-Penrose pseudoinverse of the hypergraph Laplacian matrix. Given semantically associated 2-itemsets generated by these measures, we design a hypergraph expansion method with two search strategies, namely, the clique and connected component search, to generate k-itemsets (k >; 2). We show the proposed method is indeed capable of capturing semantically associated itemsets through experiments performed on three datasets ranging from low to high dimensionality. The semantically associated itemsets discovered in our experiment is promising to provide valuable insights on interrelationship between medical concepts and other domain specific concepts.

【Keywords】: data mining; database management systems; graph theory; search problems; Moore-Penrose pseudoinverse; average commute time similarity; clique search; connected component search; data mining problem; domain specific concepts; hypergraph Laplacian matrix; hypergraph based method; hypergraph database representation; hypergraph expansion method; inner product similarity; medical concepts; search strategies; semantically associated itemset discovery; Blood; Equations; Itemsets; Joining processes; Laplace equations; Mathematical model; Semantics; Semantically associated itemset; hypergraph; random walk

42. Personalized Travel Package Recommendation.

【Paper Link】【Pages】:407-416

【Authors】: Qi Liu ; Yong Ge ; Zhongmou Li ; Enhong Chen ; Hui Xiong

【Abstract】: As the worlds of commerce, entertainment, travel, and Internet technology become more inextricably linked, new types of business data become available for creative use and formal analysis. Indeed, this paper provides a study of exploiting online travel information for personalized travel package recommendation. A critical challenge along this line is to address the unique characteristics of travel data, which distinguish travel packages from traditional items for recommendation. To this end, we first analyze the characteristics of the travel packages and develop a Tourist-Area-Season Topic (TAST) model, which can extract the topics conditioned on both the tourists and the intrinsic features (i.e. locations, travel seasons) of the landscapes. Based on this TAST model, we propose a cocktail approach on personalized travel package recommendation. Finally, we evaluate the TAST model and the cocktail approach on real-world travel package data. The experimental results show that the TAST model can effectively capture the unique characteristics of the travel data and the cocktail approach is thus much more effective than traditional recommendation methods for travel package recommendation.

【Keywords】: Internet; data analysis; electronic commerce; entertainment; recommender systems; travel industry; Internet technology; TAST model; business data; cocktail approach; commerce; entertainment; formal analysis; landscapes; online travel information; personalized travel package recommendation; real-world travel package data; tourist-area-season topic model; tourists; travel data; Collaboration; Companies; Mathematical model; Motion pictures; Recommender systems; Vectors

【Paper Link】【Pages】:417-426

【Authors】: Yang Liu ; Fei Wu ; Yin Zhang ; Jian Shao ; Yueting Zhuang

【Abstract】: Recently, there has been extensive research towards the user-provided tags on photo sharing websites which can greatly facilitate image retrieval and management. However, due to the arbitrariness of the tagging activities, these tags are often imprecise and incomplete. As a result, quite a few technologies has been proposed to improve the user experience on these photo sharing systems, including tag clustering and refinement, etc. In this work, we propose a novel framework to model the relationships among tags and images which can be applied to many tag based applications. Different from previous approaches which model images and tags as heterogeneous objects, images and their tags are uniformly viewed as compositions of Semantic Unities in our framework. Then Semantic Unity Graph (SUG) is introduced to represent the complex and high-order relationships among these Semantic Unities. Based on the representation of Semantic Unity Graph, the relevance of images and tags can be naturally measured in terms of the similarity of their Semantic Unities. Then Tag clustering and refinement can then be performed on SUG and the polysemy of images and tags is explicitly considered in this framework. The experiment results conducted on NUS-WIDE and MIR-Flickr datasets demonstrate the effectiveness and efficiency of the proposed approach.

【Keywords】: Web sites; graph theory; image retrieval; pattern clustering; MIR-Flickr datasets; NUS-WIDE; SUG; Web sites; image retrieval; photo sharing systems; semantic unities; semantic unity graph; tag clustering; tag refinement; Bipartite graph; Image edge detection; Laplace equations; Media; Semantics; Tagging; Visualization; Clustering; Hypergraph; Tag refinement

44. Minimizing Seed Set for Viral Marketing.

【Paper Link】【Pages】:427-436

【Authors】: Cheng Long ; Raymond Chi-Wing Wong

【Abstract】: Viral marketing has attracted considerable concerns in recent years due to its novel idea of leveraging the social network to propagate the awareness of products. Specifically, viral marketing is to first target a limited number of users (seeds) in the social network by providing incentives, and these targeted users would then initiate the process of awareness spread by propagating the information to their friends via their social relationships. Extensive studies have been conducted for maximizing the awareness spread given the number of seeds. However, all of them fail to consider the common scenario of viral marketing where companies hope to use as few seeds as possible yet influencing at least a certain number of users. In this paper, we propose a new problem, called J-MIN-Seed, whose objective is to minimize the number of seeds while at least J users are influenced. J-MIN-Seed, unfortunately, is proved to be NP-hard in this work. In such case, we develop a greedy algorithm that can provide error guarantees for J-MIN-Seed. Furthermore, for the problem setting where J is equal to the number of all users in the social network, denoted by Full-Coverage, we design other efficient algorithms. Extensive experiments were conducted on real datasets to verify our algorithm.

【Keywords】: Internet; marketing data processing; social networking (online); J-MIN-Seed; NP-hard problem; least J users; minimizing seed set; social network; social relationships; viral marketing; Algorithm design and analysis; Approximation algorithms; Greedy algorithms; Integrated circuit modeling; Probabilistic logic; Social network services; Seeds; Viral Marketing

【Paper Link】【Pages】:437-446

【Authors】: Nigel Medforth ; Ke Wang

【Abstract】: To understand how social networks evolve over time, graphs representing the networks need to be published periodically or on-demand. The identity of the participants (nodes) must be anonymized to protect the privacy of the individuals and their relationships (edges) to the other members in the social network. We identify a new form of privacy attack, which we name the degree-trail attack. This attack re-identifies the nodes belonging to a target participant from a sequence of published graphs by comparing the degree of the nodes in the published graphs with the degree evolution of a target. The power of this attack is that the adversary can actively influence the degree of the target individual by interacting with the social network. We show that the adversary can succeed with a high probability even if published graphs are anonymized by strongest known privacy preserving techniques in the literature. Moreover, this success does not depend on the distinctiveness of the target nodes nor require the adversary to behave differently from a normal participant. One of our contributions is a formal method to assess the privacy risk of this type of attacks and empirically study the severity on real social network data.

【Keywords】: data privacy; graph theory; publishing; risk management; social networking (online); degree-trail attack; graph representation; graph stream publishing; privacy attack; privacy preserving techniques; privacy protection; privacy risk; social network data; Data privacy; Educational institutions; Handheld computers; Pins; Privacy; Social network services; anonymity; data publishing; privacy; social network

46. Boolean Tensor Factorizations.

【Paper Link】【Pages】:447-456

【Authors】: Pauli Miettinen

【Abstract】: Tensors are multi-way generalizations of matrices, and similarly to matrices, they can also be factorized, that is, represented (approximately) as a product of factors. These factors are typically either all matrices or a mixture of matrices and tensors. With the widespread adoption of matrix factorization techniques in data mining, also tensor factorizations have started to gain attention. In this paper we study the Boolean tensor factorizations. We assume that the data is binary multi-way data, and we want to factorize it to binary factors using Boolean arithmetic (i.e. defining that 1+1=1). Boolean tensor factorizations are, therefore, natural generalization of the Boolean matrix factorizations. We will study the theory of Boolean tensor factorizations and show that at least some of the benefits Boolean matrix factorizations have over normal matrix factorizations carry over to the tensor data. We will also present algorithms for Boolean variations of CP and Tucker decompositions, the two most-common types of tensor factorizations. With experimentation done with synthetic and real-world data, we show that Boolean tensor factorizations are a viable alternative when the data is naturally binary.

【Keywords】: Boolean algebra; data mining; matrix decomposition; tensors; Boolean CP decomposition; Boolean Tucker decomposition; Boolean matrix factorization; Boolean tensor factorization; binary multiway data; data mining; matrices; tensor data; Approximation algorithms; Approximation methods; Argon; Data mining; Matrix decomposition; Tensile stress; Vectors; Boolean matrix factorization; Boolean tensor factorization; CP factorization; Tensor factorization; Tucker factorization

47. Sparse Domain Adaptation in Projection Spaces Based on Good Similarity Functions.

【Paper Link】【Pages】:457-466

【Authors】: Emilie Morvant ; Amaury Habrard ; Stéphane Ayache

【Abstract】: We address the problem of domain adaptation for binary classification which arises when the distributions generating the source learning data and target test data are somewhat different. We consider the challenging case where no target labeled data is available. From a theoretical standpoint, a classifier has better generalization guarantees when the two domain marginal distributions are close. We study a new direction based on a recent framework of Balcan et al. allowing to learn linear classifiers in an explicit projection space based on similarity functions that may be not symmetric and not positive semi-definite. We propose a general method for learning a good classifier on target data with generalization guarantees and we improve its efficiency thanks to an iterative procedure by reweighting the similarity function - compatible with Balcan et al. framework - to move closer the two distributions in a new projection space. Hyper parameters and reweighting quality are controlled by a reverse validation procedure. Our approach is based on a linear programming formulation and shows good adaptation performances with very sparse models. We evaluate it on a synthetic problem and on real image annotation task.

【Keywords】: data handling; iterative methods; learning (artificial intelligence); pattern classification; good similarity functions; hyper parameters; iterative procedure; linear programming; marginal distributions; projection spaces; source learning data; sparse domain adaptation; target test data; Adaptation models; Aerospace electronics; Joints; Kernel; Machine learning; Robustness; Training; Binary Classification; Domain Adaptation; Machine Learning; Similarity Functions; Transfer Learning

48. Incremental Elliptical Boundary Estimation for Anomaly Detection in Wireless Sensor Networks.

【Paper Link】【Pages】:467-476

【Authors】: Masud Moshtaghi ; Christopher Leckie ; Shanika Karunasekera ; James C. Bezdek ; Sutharshan Rajasegarar ; Marimuthu Palaniswami

【Abstract】: Wireless Sensor Networks (WSNs) provide a low cost option for gathering spatially dense data from different environments. However, WSNs have limited energy resources that hinder the dissemination of the raw data over the network to a central location. This has stimulated research into efficient data mining approaches, which can exploit the restricted computational capabilities of the sensors to model their normal behavior. Having a normal model of the network, sensors can then forward anomalous measurements to the base station. Most of the current data modeling approaches proposed for WSNs require a fixed offline training period and use batch training in contrast to the real streaming nature of data in these networks. In addition they usually work in stationary environments. In this paper we present an efficient online model construction algorithm that captures the normal behavior of the system. Our model is capable of tracking changes in the data distribution in the monitored environment. We illustrate the proposed algorithm with numerical results on both real-life and simulated data sets, which demonstrate the efficiency and accuracy of our approach compared to existing methods.

【Keywords】: data mining; security of data; wireless sensor networks; anomaly detection; batch training; data mining approaches; data modeling approaches; fixed offline training period; incremental elliptical boundary estimation; spatially dense data; wireless sensor networks; Adaptation models; Computational modeling; Covariance matrix; Data models; Sensors; Training; Wireless sensor networks; Anomaly Detection; IDCAD; Incremental Elliptical Boundary Estimation; Streaming Data Analysis

49. Learning Classification with Auxiliary Probabilistic Information.

【Paper Link】【Pages】:477-486

【Authors】: Quang Nguyen ; Hamed Valizadegan ; Milos Hauskrecht

【Abstract】: Finding ways of incorporating auxiliary information or auxiliary data into the learning process has been the topic of active data mining and machine learning research in recent years. In this work we study and develop a new framework for classification learning problem in which, in addition to class labels, the learner is provided with an auxiliary (probabilistic) information that reflects how strong the expert feels about the class label. This approach can be extremely useful for many practical classification tasks that rely on subjective label assessment and where the cost of acquiring additional auxiliary information is negligible when compared to the cost of the example analysis and labelling. We develop classification algorithms capable of using the auxiliary information to make the learning process more efficient in terms of the sample complexity. We demonstrate the benefit of the approach on a number of synthetic and real world data sets by comparing it to the learning with class labels only.

【Keywords】: data mining; learning (artificial intelligence); pattern classification; probability; active data mining; auxiliary data; auxiliary probabilistic information; classification algorithm; classification learning problem; learning process; machine learning research; Concrete; Data models; Humans; Logistics; Machine learning; Noise; Probabilistic logic; classification learning; learning with auxiliary label information; sample complexity

50. Word Cloud Model for Text Categorization.

【Paper Link】【Pages】:487-496

【Authors】: Tam T. Nguyen ; Kuiyu Chang ; Siu Cheung Hui

【Abstract】: In centroid-based classification, each class is represented by a prototype or centroid document vector that is formed by averaging all member vectors during the training phase. In the prediction phase, the label of a test document vector is assigned to that of its nearest class prototype. Recently there has been revived interest in reformulating the prototype/centroid to improve classification performance. In this paper, we study the theoretical properties of the recently proposed Class Feature Centroid (CFC) classifier by considering the rate of change of each prototype vector with respect to individual dimensions (terms). The implication of our theoretical finding is that CFC is inherently biased towards large (dominant majority) classes, which means it is destined to perform poorly for highly class-imbalanced data. Another practical concern about CFC lies in its overly-aggressive design in weeding out terms that appear in all classes. To overcome these CFC limitations while retaining its intrinsic and worthy design goals, we propose an improved and robust centroid-based classifier that uses precise term-class distribution properties instead of simple presence or absence of terms in classes. Specifically, terms are weighted based on the Kullback-Leibler divergence measure between pairs of class-conditional term probabilities, we call this the CFC-KL centroid classifier. We then generalized CFC-KL to handle multi-class data by summing pair wise class-conditioned word probability ratios. Our proposed approach has been evaluated on 5 datasets, on which it consistently outperforms CFC and the baseline Support Vector Machine classifier. We also devise a word cloud visualization approach to highlight the important class-specific words picked out by our CFC-KL, and visually compare it with other popular term weigthing approaches. Our encouraging results show that the centroid based generalized CFC-KL classifier is both robust and efficient to deal with real-world text c- assification.

【Keywords】: classification; data mining; data visualisation; probability; text analysis; CFC classifier; CFC-KL centroid classifier; Kullback-Leibler divergence measure; centroid-based classification; class feature centroid; class-conditional term probability; pair wise class-conditioned word probability ratio; term-class distribution; test document vector; text categorization; text classification; word cloud model; word cloud visualization approach; Motion pictures; Prototypes; Radio frequency; Support vector machines; Tag clouds; Text categorization; Vectors; Centroid-based Classification; Text Categorization

51. SLIM: Sparse Linear Methods for Top-N Recommender Systems.

【Paper Link】【Pages】:497-506

【Authors】: Xia Ning ; George Karypis

【Abstract】: This paper focuses on developing effective and efficient algorithms for top-N recommender systems. A novel Sparse Linear Method (SLIM) is proposed, which generates top-N recommendations by aggregating from user purchase/rating profiles. A sparse aggregation coefficient matrix W is learned from SLIM by solving an ℓ₁-norm and ℓ₂-norm regularized optimization problem. W is demonstrated to produce high quality recommendations and its sparsity allows SLIM to generate recommendations very fast. A comprehensive set of experiments is conducted by comparing the SLIM method and other state-of-the-art top-N recommendation methods. The experiments show that SLIM achieves significant improvements both in run time performance and recommendation quality over the best existing methods.

【Keywords】: matrix algebra; optimisation; recommender systems; ℓ₁-norm regularized optimization problem; ℓ₂-norm regularized optimization problem; SLIM; purchase profiles; rating profiles; recommendation quality; run time performance; sparse aggregation coefficient matrix; sparse linear methods; top-N recommender systems; Equations; Mathematical model; Measurement; Optimization; Recommender systems; Sparse matrices; Vectors; Sparse Linear Methods; Top-N Recommender Systems; l1-norm Regularization

52. Novel Recommendation Based on Personal Popularity Tendency.

【Paper Link】【Pages】:507-516

【Authors】: Jinoh Oh ; Sun Park ; Hwanjo Yu ; Min Song ; Seung-Taek Park

【Abstract】: Recently, novel recommender systems have attracted considerable attention in the research community. Recommending popular items may not always satisfy users. For example, although most users likely prefer popular items, such items are often not very surprising or novel because users may already know about the items. Also, such recommender systems hardly satisfy a group of users who prefer relatively obscure items. Existing novel recommender systems, however, still recommend mainly popular items or degrade the quality of recommendation. They do so because they do not consider the balance between novelty and preference-based recommendation. This paper proposes an efficient novel-recommendation method called Personal Popularity Tendency Matching (PPTM) which recommends novel items by considering an individual's Personal Popularity Tendency (or PPT). Considering PPT helps to diversify recommendations by reasonably penalizing popular items while improving the recommendation accuracy. We experimentally show that the proposed method, PPTM, is better than other methods in terms of both novelty and accuracy.

【Keywords】: recommender systems; personal popularity tendency matching; popular items; recommendation accuracy; recommendation quality; recommender systems; Atmospheric measurements; Collaboration; Communities; Degradation; Motion pictures; Particle measurements; Recommender systems; EMD; Novel recommendation; Personal Popularity Tendency

53. An Analysis of Performance Measures for Binary Classifiers.

【Paper Link】【Pages】:517-526

【Authors】: Charles Parker

【Abstract】: If one is given two binary classifiers and a set of test data, it should be straightforward to determine which of the two classifiers is the superior. Recent work, however, has called into question many of the methods heretofore accepted as standard for this task. In this paper, we analyze seven ways of determining if one classifier is better than another, given the same test data. Five of these are long established and two are relative newcomers. We review and extend work showing that one of these methods is clearly inappropriate, and then conduct an empirical analysis with a large number of datasets to evaluate the real-world implications of our theoretical analysis. Both our empirical and theoretical results converge strongly towards one of the newer methods.

【Keywords】: data handling; pattern classification; set theory; binary classifiers; empirical analysis; performance measurement analysis; real-world implications; test data set; Accuracy; Communities; Equations; Loss measurement; Vectors; Weight measurement; Classifier Evaluation; Performance Metrics; Supervised Learning

54. Detection of Cross-Channel Anomalies from Multiple Data Channels.

【Paper Link】【Pages】:527-536

【Authors】: Duc-Son Pham ; Budhaditya Saha ; Dinh Q. Phung ; Svetha Venkatesh

【Abstract】: We identify and formulate a novel problem: cross channel anomaly detection from multiple data channels. Cross channel anomalies are common amongst the individual channel anomalies, and are often portent of significant events. Using spectral approaches, we propose a two-stage detection method: anomaly detection at a single-channel level, followed by the detection of cross-channel anomalies from the amalgamation of single channel anomalies. Our mathematical analysis shows that our method is likely to reduce the false alarm rate. We demonstrate our method in two applications: document understanding with multiple text corpora, and detection of repeated anomalies in video surveillance. The experimental results consistently demonstrate the superior performance of our method compared with related state-of-art methods, including the one-class SVM and principal component pursuit. In addition, our framework can be deployed in a decentralized manner, lending itself for large scale data stream analysis.

【Keywords】: data analysis; mathematical analysis; principal component analysis; security of data; support vector machines; text analysis; video surveillance; amalgamation; cross-channel anomaly detection; document understanding; false alarm rate; large scale data stream analysis; mathematical analysis; multiple data channels; multiple text corpora; one-class SVM; principal component pursuit; single channel anomaly; single-channel level; spectral approaches; state-of-art methods; two-stage detection method; video surveillance; Covariance matrix; Data mining; Dictionaries; Nickel; Noise; Peer to peer computing; Vectors; Anomaly detection; Spectral methods; topic detection

55. Threshold Conditions for Arbitrary Cascade Models on Arbitrary Networks.

【Paper Link】【Pages】:537-546

【Authors】: B. Aditya Prakash ; Deepayan Chakrabarti ; Michalis Faloutsos ; Nicholas Valler ; Christos Faloutsos

【Abstract】: Given a network of who-contacts-whom or who links-to-whom, will a contagious virus/product/meme spread and 'take-over' (cause an epidemic) or die-out quickly? What will change if nodes have partial, temporary or permanent immunity? The epidemic threshold is the minimum level of virulence to prevent a viral contagion from dying out quickly and determining it is a fundamental question in epidemiology and related areas. Most earlier work focuses either on special types of graphs or on specific epidemiological/cascade models. We are the first to show the G2-threshold (twice generalized) theorem, which nicely de-couples the effect of the topology and the virus model. Our result unifies and includes as special case older results and shows that the threshold depends on the first eigenvalue of the connectivity matrix, (a) for any graph and (b) for all propagation models in standard literature (more than 25, including H.I.V.) [20], [12]. Our discovery has broad implications for the vulnerability of real, complex networks, and numerous applications, including viral marketing, blog dynamics, influence propagation, easy answers to 'what-if' questions, and simplified design and evaluation of immunization policies. We also demonstrate our result using extensive simulations on one of the biggest available social contact graphs containing more than 31 million interactions among more than 1 million people representing the city of Portland, Oregon, USA.

【Keywords】: complex networks; computer network security; computer viruses; eigenvalues and eigenfunctions; graph theory; network theory (graphs); social networking (online); G2-threshold theorem; arbitrary cascade model; arbitrary network; blog dynamics; complex network vulnerability; connectivity matrix; eigenvalue; epidemic threshold; influence propagation; social contact graph; topology; twice generalized theorem; viral contagion; viral marketing; virus propagation model; Computational modeling; Eigenvalues and eigenfunctions; Immune system; Mathematical model; Terminology; Topology; Vectors; cascade models; epidemic threshold; virus propagation

56. Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data.

【Paper Link】【Pages】:547-556

【Authors】: Thanawin Rakthanmanon ; Eamonn J. Keogh ; Stefano Lonardi ; Scott Evans

【Abstract】: Given the pervasiveness of time series data in all human endeavors, and the ubiquity of clustering as a data mining application, it is somewhat surprising that the problem of time series clustering from a single stream remains largely unsolved. Most work on time series clustering considers the clustering of individual time series, e.g., gene expression profiles, individual heartbeats or individual gait cycles. The few attempts at clustering time series streams have been shown to be objectively incorrect in some cases, and in other cases shown to work only on the most contrived datasets by carefully adjusting a large set of parameters. In this work, we make two fundamental contributions. First, we show that the problem definition for time series clustering from streams currently used is inherently flawed, and a new definition is necessary. Second, we show that the Minimum Description Length (MDL) framework offers an efficient, effective and essentially parameter-free method for time series clustering. We show that our method produces objectively correct results on a wide variety of datasets from medicine, zoology and industrial process analyses.

【Keywords】: data mining; pattern clustering; time series; data mining application; gene expression profiles; human endeavors; industrial process analyses; medicine; minimum description length framework; parameter free method; time series epenthesis; time series streams clustering; zoology; Clustering algorithms; Data mining; Encoding; Entropy; Euclidean distance; Handicapped aids; Time series analysis; MDL; clustering; time series

57. Mining Historical Documents for Near-Duplicate Figures.

【Paper Link】【Pages】:557-566

【Authors】: Thanawin Rakthanmanon ; Qiang Zhu ; Eamonn J. Keogh

【Abstract】: The increasing interest in archiving all of humankind's cultural artifacts has resulted in the digitization of millions of books, and soon a significant fraction of the world's books will be online. Most of the data in historical manuscripts is text, but there is also a significant fraction devoted to images. This fact has driven much of the recent increase in interest in query-by-content systems for images. While querying/indexing systems can undoubtedly be useful, we believe that the historical manuscript domain is finally ripe for true unsupervised discovery of patterns and regularities. To this end, we introduce an efficient and scalable system which can detect approximately repeated occurrences of shape patterns both within and between historical texts. We show that this ability to find repeated shapes allows automatic annotation of manuscripts, and allows users to trace the evolution of ideas. We demonstrate our ideas on datasets of scientific and cultural manuscripts dating back to the fourteenth century.

【Keywords】: data mining; document handling; history; cultural artifacts; cultural manuscripts; historical manuscripts; historical texts; indexing systems; mining historical documents; nearduplicate figures; querying systems; scientific manuscripts; shape patterns; text analysis; Approximation algorithms; Bioinformatics; Data mining; Force; Pathology; Scalability; Shape; cultural artifacts; duplication detection; repeated patterns

58. Analysis of Textual Variation by Latent Tree Structures.

【Paper Link】【Pages】:567-576

【Authors】: Teemu Roos ; Yuan Zou

【Abstract】: We introduce Semstem, a new method for the reconstruction of so called stemmatic trees, i.e., trees encoding the copying relationships among a set of textual variants. Our method is based on a structural expectation-maximization (structural EM) algorithm. It is the first computer-based method able to estimate general latent tree structures, unlike earlier methods that are usually restricted to bifurcating trees where all the extant texts are placed in the leaf nodes. We present experiments on two well known benchmark data sets, showing that the new method outperforms current state-of-the-art both in terms of a numerical score as well as interpretability.

【Keywords】: expectation-maximisation algorithm; text analysis; tree data structures; Semstem; computer-based method; general latent tree structure estimation; numerical score; stemmatic trees; structural expectation-maximization algorithm; textual variation analysis; Analytical models; Bioinformatics; Biological system modeling; Inference algorithms; Phylogeny; Vegetation; EM algorithm; graphical models; latent trees; stemmatology; textual criticism

59. Healing Sample Selection Bias by Source Classifier Selection.

【Paper Link】【Pages】:577-586

【Authors】: Chun-Wei Seah ; Ivor Wai-Hung Tsang ; Yew-Soon Ong

【Abstract】: Domain Adaptation (DA) methods are usually carried out by means of simply reducing the marginal distribution differences between the source and target domains, and subsequently using the resultant trained classifier, namely source classifier, for use in the target domain. However, in many cases, the true predictive distributions of the source and target domains can be vastly different especially when their class distributions are skewed, causing the issues of sample selection bias in DA. Hence, DA methods which leverage the source labeled data may suffer from poor generalization in the target domain, resulting in negative transfer. In addition, we observed that many DA methods use either a source classifier or a linear combination of source classifiers with a fixed weighting for predicting the target unlabeled data. Essentially, the labels of the target unlabeled data are spanned by the prediction of these source classifiers. Motivated by these observations, in this paper, we propose to construct many source classifiers of diverse biases and learn the weight for each source classifier by directly minimizing the structural risk defined on the target unlabeled data so as to heal the possible sample selection bias. Since the weights are learned by maximizing the margin of separation between opposite classes on the target unlabeled data, the proposed method is established here as Maximal Margin Target Label Learning (MMTLL), which is in a form of Multiple Kernel Learning problem with many label kernels. Extensive experimental studies of MMTLL against several state-of-the-art methods on the Sentiment and Newsgroups datasets with various imbalanced class settings showed that MMTLL exhibited robust accuracies on all the settings considered and was resilient to negative transfer, in contrast to other counterpart methods which suffered significantly in prediction accuracy.

【Keywords】: learning (artificial intelligence); pattern classification; domain adaptation methods; marginal distribution differences; maximal margin target label learning; negative transfer; newsgroups datasets; predictive distributions; sample selection bias; sentiment datasets; source classifier selection; structural risk minimization; target unlabeled data; trained classifier; Accuracy; Complexity theory; Joints; Kernel; Machine learning; Support vector machines; Vectors; Classifier Selection; Domain Adaptation; Maximum Margin Separation; Multiple Kernel Learning; Negative Transfer; Sample Selection Bias

60. An In-depth Study of Stochastic Kronecker Graphs.

【Paper Link】【Pages】:587-596

【Authors】: C. Seshadhri ; Ali Pinar ; Tamara G. Kolda

【Abstract】: Graph analysis is playing an increasingly important role in science and industry. Due to numerous limitations in sharing real-world graphs, models for generating massive graphs are critical for developing better algorithms. In this paper, we analyze the stochastic Kronecker graph model (SKG), which is the foundation of the Graph500 supercomputer benchmark due to its many favorable properties and easy parallelization. Our goal is to provide a deeper understanding of the parameters and properties of this model so that its functionality as a benchmark is increased. We develop a rigorous mathematical analysis that shows this model cannot generate a power-law distribution or even a lognormal distribution. However, we formalize an enhanced version of the SKG model that uses random noise for smoothing. We prove both in theory and in practice that this enhancement leads to a lognormal distribution. Additionally, we provide a precise analysis of isolated vertices, showing that graphs that are produced by SKG might be quite different than intended. For example, between 50% and 75% of the vertices in the Graph500 benchmarks will be isolated. Finally, we show that this model tends to produce extremely small core numbers (compared to most social networks and other real graphs) for common parameter choices.

【Keywords】: graph theory; parallel machines; stochastic processes; Graph500 supercomputer benchmark; SKG; graph analysis; lognormal distribution; mathematical analysis; power law distribution; social networks; stochastic Kronecker graphs; Algorithm design and analysis; Analytical models; Approximation methods; Benchmark testing; Mathematical model; Noise; Oscillators; Graph Mining; Graph500; Lognormal Degree Distributions; Random Graph Generation; Social Networks; Stochastic Kronecker Graphs

61. Learning Spectral Embedding for Semi-supervised Clustering.

【Paper Link】【Pages】:597-606

【Authors】: Fanhua Shang ; Yuanyuan Liu ; Fei Wang

【Abstract】: In recent years, semi-supervised clustering (SSC) has aroused considerable interests from the machine learning and data mining communities. In this paper, we propose a novel semi-supervised clustering approach with enhanced spectral embedding (ESE) which not only considers structure information contained in data sets but also makes use of prior side information such as pair wise constraints. Specially, we first construct a symmetry-favored k-NN graph which is highly robust to noisy objects and can reflect the underlying manifold structure of data. Then we learn the enhanced spectral embedding towards an ideal representation as consistent with the pair wise constraints as possible. Finally, through taking advantage of Laplacian regularization, we formulate learning spectral representation as semi definite-quadratic-linear programs (SQLPs) under the squared loss function or small semi definitive programs (SDPs) under the hinge loss function, which both can be efficiently solved. Experimental results on a variety of synthetic and real-world data sets show that our approach outperforms the state-of-the-art SSC algorithms on both vector-based and graph-based clustering.

【Keywords】: data mining; graph theory; learning (artificial intelligence); linear programming; pattern clustering; visual databases; Laplacian regularization; data mining; enhanced spectral embedding; graph-based clustering; machine learning; pairwise constraint; semidefinite-quadratic-linear program; semisupervised clustering; symmetry-favored k-NN graph; vector-based clustering; Algorithm design and analysis; Clustering algorithms; Complexity theory; Fasteners; Kernel; Laplace equations; Optimization; Laplacian regularization; pairwise constraint; semisupervised clustering (SSC); spectral embedding

62. Detection of Arbitrarily Oriented Synchronized Clusters in High-Dimensional Data.

【Paper Link】【Pages】:607-616

【Authors】: Junming Shao ; Claudia Plant ; Qinli Yang ; Christian Böhm

【Abstract】: How to address the challenges of the "curse of dimensionality" in clustering? Clustering is a powerful data mining technique for structuring and organizing vast amounts of data. However, the high-dimensional data space is usually very sparse and meaningful clusters can only be found in lower dimensional subspaces. In many applications the subspaces hosting the clusters provide valuable information for interpreting the major patterns in the data. Detection of subspace clusters is challenging since usually many of the attributes are noisy, some attributes may exhibit correlations among each other and only few of the attributes truly contribute to the cluster structure. In this paper, we propose ORSC (Arbitrarily ORiented Synchronized Clusters), a novel effective and efficient method to subspace clustering inspired by synchronization. Synchronization is a basic phenomenon prevalent in nature, capable of controlling even highly complex processes such as opinion formation in a group. Control of complex processes is achieved by simple operations based on interactions between objects. Relying on the interaction model for synchronization, our approach ORSC (1) naturally detects correlation clusters in arbitrarily oriented subspaces, including (2) arbitrarily shaped non-linear correlation clusters. Our approach is (3) robust against noise points and outliers. In contrast to previous methods, ORSC is (4) easy to parameterize, since there is no need to specify the subspace dimensionality and all interesting subspace clusters can be detected. Finally, (5) ORSC outperforms most comparison methods in terms of runtime efficiency and is highly scalable to large and high-dimensional data sets.

【Keywords】: data mining; pattern clustering; ORSC; arbitrarily oriented subspaces; arbitrarily oriented synchronized cluster detection; complex processes; correlation clusters; data mining technique; high dimensional data; noise points; outliers; subspace dimensionality; Clustering algorithms; Correlation; Covariance matrix; Eigenvalues and eigenfunctions; Oscillators; Principal component analysis; Synchronization; high-dimensional data; interaction model; subspace clustering; synchronization

63. A Generalized Fast Subset Sums Framework for Bayesian Event Detection.

【Paper Link】【Pages】:617-625

【Authors】: Kan Shao ; Yandong Liu ; Daniel B. Neill

【Abstract】: We present Generalized Fast Subset Sums (GFSS), a new Bayesian framework for scalable and accurate detection of irregularly shaped spatial clusters using multiple data streams. GFSS extends the previously proposed Multivariate Bayesian Scan Statistic (MBSS) and Fast Subset Sums (FSS) approaches for detection of emerging events. The detection power of MBSS is primarily limited by computational considerations, which limit it to searching over circular spatial regions. GFSS enables more accurate and timely detection by defining a hierarchical prior over all subsets of the N locations, first selecting a local neighborhood consisting of a center location and its neighbors, and introducing a sparsity parameter p to describe how likely each location in the neighborhood is to be affected. This approach allows us to consider all possible subsets of locations (including irregularly-shaped regions) but also puts higher weight on more compact regions. We demonstrate that MBSS and FSS are both special cases of this general framework (assuming p = 1 and p = 0.5 respectively), but substantially higher detection power can be achieved by choosing an appropriate value of p. Thus we show that the distribution of the sparsity parameter p can be accurately learned from a small number of labeled events. Our evaluation results (on synthetic disease outbreaks injected into real-world hospital data) show that the GFSS method with learned sparsity parameter has higher detection power and spatial accuracy than MBSS and FSS, particularly when the affected region is irregular or elongated. We also show that the learned models can be used for event characterization, accurately distinguishing between two otherwise identical event types based on the sparsity of the affected spatial region.

【Keywords】: Bayes methods; data handling; learning (artificial intelligence); set theory; Bayesian event detection; FSS; GFSS; GFSS method; MBSS; data streams; fast subset sum; generalized fast subset sum; irregularly shaped spatial clusters; learned sparsity parameter; multivariate Bayesian scan statistic; real world hospital data; sparsity parameter distribution; synthetic disease outbreaks; Accuracy; Bayesian methods; Diseases; Event detection; Frequency selective surfaces; Training; biosurveillance; event detection; scan statistics

64. Learning to Rank for Query-Focused Multi-document Summarization.

【Paper Link】【Pages】:626-634

【Authors】: Chao Shen ; Tao Li

【Abstract】: In this paper, we explore how to use ranking SVM to train the feature weights for query-focused multi-document summarization. To apply a supervised learning method to sentence extraction in multi-document summarization, we need to derive the sentence labels for training corpus from the existing human labeling data in form of. However, this process is not trivial, because the human summaries are abstractive, and do not necessarily well match the sentences in the documents. In this paper, we try to address the above problem from the following two aspects. First, we make use of sentence-to-sentence relationships to better estimate the probability of a sentence in the document set to be a summary sentence. Second, to make the derived training data less sensitive, we adopt a cost sensitive loss in the ranking SVM's objective function. The experimental results demonstrate the effectiveness of our proposed method.

【Keywords】: document handling; learning (artificial intelligence); probability; query processing; support vector machines; SVM ranking; feature weights; query-focused multidocument summarization; rank learning; sentence probability estimation; sentence-to-sentence relationships; support vector machine; Feature extraction; Humans; Redundancy; Supervised learning; Support vector machines; Training; Training data; learning to rank; query-based multi-document summarization

65. Simple Multiple Noisy Label Utilization Strategies.

【Paper Link】【Pages】:635-644

【Authors】: Victor S. Sheng

【Abstract】: With the outsourcing of small tasks becoming easier, it is possible to obtain non-expert/imperfect labels at low cost. With low-cost imperfect labeling, it is straightforward to collect multiple labels for the same data items. This paper addresses the strategies of utilizing these multiple labels for improving the performance of supervised learning, based on two basic ideas: majority voting and pair wise solutions. We show several interesting results based on our experiments. The soft majority voting strategies can reduce the bias and roughness, and improve the performance of the directed hard majority voting strategy. Pair wise strategies can completely avoid the bias by having both sides (potential correct and incorrect/noisy information) considered (for binary classification). They have very good performance whenever there are a few or many labels available. However, it could also keep the noise. The improved variation that reduces the impact of the noisy information is recommended. All five strategies investigated are labeling quality agnostic strategies, and can be applied to real world applications directly. The experimental results show some of them perform better than or at least very close to the gnostic strategies.

【Keywords】: data handling; learning (artificial intelligence); hard majority voting strategy; multiple noisy label utilization strategies; outsourcing; pairwise solutions; quality agnostic strategies; soft majority voting strategies; supervised learning; Accuracy; Estimation; Labeling; Noise measurement; Training; Training data; Uncertainty; classification; crowdsourcing; multiple noisy labels; outsourcing

66. Partitionable Kernels for Mapping Kernels.

【Paper Link】【Pages】:645-654

【Authors】: Kilho Shin

【Abstract】: Many of tree kernels in the literature are designed tanking advantage of the mapping kernel framework. The most important advantage of using this framework is that we have a strong theorem to examine positive definiteness of the resulting tree kernels. In the mapping kernel framework, each data object is viewed as a collection of components, and a mapping kernel for a pair of data objects is determined as a sum of kernel values of component pairs over a certain range determined according to the purpose of use of the resulting mapping kernel. For those tree kernels known to belong to the mapping kernel category, the string kernel of the product type is commonly used to compute the kernel values of component pairs. This is because it is known that use of the product-type string kernel together with the mapping kernel framework allows us to have recursive formulas to calculate the resulting tree kernels efficiently. We significantly generalizes this result. In fact, we show that we can use partition able kernels, a new class of string kernels instead of the product-type string kernel to enjoy the same advantage, that is, efficient computation based on recursive formulas. The class of partition able kernels is abundant, and contains the product-type string kernels just as an instance. Also, this result, not limited to tree kernels, can be applied to general mapping kernels after we formalize the decomposition properties of trees as the new notion of pretty decomposability.

【Keywords】: support vector machines; trees (mathematics); Mapping Kernels; SVM; data object; partitionable kernels; positive definiteness; support vector machines; tree kernels; Convolution; Educational institutions; Heuristic algorithms; Kernel; Support vector machines; Vectors; Vegetation

67. Mining Dominant Patterns in the Sky.

【Paper Link】【Pages】:655-664

【Authors】: Arnaud Soulet ; Chedy Raïssi ; Marc Plantevit ; Bruno Crémilleux

【Abstract】: Pattern discovery is at the core of numerous data mining tasks. Although many methods focus on efficiency in pattern mining, they still suffer from the problem of choosing a threshold that influences the final extraction result. The goal of our study is to make the results of pattern mining useful from a user-preference point of view. To this end, we integrate into the pattern discovery process the idea of skyline queries in order to mine skyline patterns in a threshold-free manner. Because the skyline patterns satisfy a formal property of dominations, they not only have a global interest but also have semantics that are easily understood by the user. In this work, we first establish theoretical relationships between pattern condensed representations and skyline pattern mining. We also show that it is possible to compute automatically a subset of measures involved in the user query which allows the patterns to be condensed and thus facilitates the computation of the skyline patterns. This forms the basis for a novel approach to mining skyline patterns. We illustrate the efficiency of our approach over several data sets including a use case from chemo informatics and show that small sets of dominant patterns are produced under various measures.

【Keywords】: chemistry computing; data mining; pattern classification; query processing; chemo informatics; data mining tasks; dominant pattern mining; formal property; pattern condensed representations; pattern discovery process; skyline pattern mining; skyline patterns; skyline query; user query; user-preference; Area measurement; Data mining; Databases; Frequency measurement; Length measurement; Redundancy; Semantics; Pattern mining; Skyline analysis; user-preferences

68. Ranking Web-Based Partial Orders by Significance Using a Markov Reference Model.

【Paper Link】【Pages】:665-674

【Authors】: Michel Speiser ; Gianluca Antonini ; Abderrahim Labbi

【Abstract】: Mining web traffic data has been addressed in literature mostly using sequential pattern mining techniques. Recently, a more powerful pattern called partial order was introduced, with the hope of providing a more compact result set. A further approach towards this goal, valid for both sequential patterns and partial orders, consists in building a statistical significance test for frequent patterns. Our method is based on probabilistic generative models and provides a direct way to rank the extracted patterns. It leaves open the number of patterns of interest, which depends on the application, but provides an alternative criterion to frequency of occurrence: statistical significance. In this paper, we focus on the construction of an algorithm which calculates the probability of partial orders under a first-order Markov reference model, and we show how to use those probabilities to assess the statistical significance of a set of mined partial orders.

【Keywords】: Internet; Markov processes; data mining; statistical testing; Markov reference model; Web based partial orders ranking; Web traffic data mining; sequential pattern mining techniques; sequential patterns; statistical significance test; Absorption; Computational modeling; Data mining; Databases; Markov processes; Probability; Transient analysis; Markov; partial order; pattern; poset; probability; ranking; significance; statisstatistical; test; web

69. Interesting Multi-relational Patterns.

【Paper Link】【Pages】:675-684

【Authors】: Eirini Spyropoulou ; Tijl De Bie

【Abstract】: Mining patterns from multi-relational data is a problem attracting increasing interest within the data mining community. Traditional data mining approaches are typically developed for highly simplified types of data, such as an attribute-value table or a binary database, such that those methods are not directly applicable to multi-relational data. Nevertheless, multi-relational data is a more truthful and therefore often also a more powerful representation of reality. Mining patterns of a suitably expressive syntax directly from this representation, is thus a research problem of great importance. In this paper we introduce a novel approach to mining patterns in multi-relational data. We propose a new syntax for multi-relational patterns as complete connected sub graphs in a representation of the database as a k-partite graph. We show how this pattern syntax is generally applicable to multirelational data, while it reduces to well-known tiles [7] when the data is a simple binary or attribute-value table. We propose RMiner, an efficient algorithm to mine such patterns, and we introduce a method for quantifying their interestingness when contrasted with prior information of the data miner. Finally, we illustrate the usefulness of our approach by discussing results on real-world and synthetic databases.

【Keywords】: data mining; graph theory; pattern classification; K-partite graph; RMiner; attribute-value table; binary database; connected subgraphs; multirelational data mining patterns; Bipartite graph; Data mining; Entropy; Itemsets; Motion pictures; Syntactics

70. On Generating All Optimal Monotone Classifications.

【Paper Link】【Pages】:685-694

【Authors】: Luite Stegeman ; Ad Feelders

【Abstract】: In many applications of data mining one knows beforehand that the response variable should be monotone (either increasing or decreasing) in the attributes. In ordinal classification, changing the class labels of a data set (relabeling) so that the data becomes monotone, is useful for at least two reasons. Firstly, models trained on relabeled data tend to have better predictive performance than models trained on the original data. Secondly, relabeling is an important building block for the construction of monotone classifiers. However, optimal monotone relabelings are rarely unique, and so far an efficient algorithm to generate them all has been lacking. The main result of this paper is an efficient algorithm to produce the structure of all optimal monotone relabelings. We also show that counting the solutions is #P-complete and give algorithms for efficiently enumerating all solutions, as well as sampling uniformly from the set of solutions. Experiments show that relabeling non-monotone data can improve the predictive performance of models trained on that data.

【Keywords】: data mining; pattern classification; data mining; data set; optimal monotone classifications; optimal monotone relabelings; ordinal classification; Bismuth; Data mining; Data models; Partitioning algorithms; Prediction algorithms; Predictive models; Vectors; isotonic regression; monotone classification

71. Recursive Multi-step Time Series Forecasting by Perturbing Data.

【Paper Link】【Pages】:695-704

【Authors】: Souhaib Ben Taieb ; Gianluca Bontempi

【Abstract】: The Recursive strategy is the oldest and most intuitive strategy to forecast a time series multiple steps ahead. At the same time, it is well-known that this strategy suffers from the accumulation of errors as long as the forecasting horizon increases. We propose a variant of the Recursive strategy, called RECNOISY, which perturbs the initial dataset at each step of the forecasting process in order to i) handle more properly the estimated values at each forecasting step and ii) decrease the accumulation of errors induced by the Recursive strategy. In addition to the RECNOISY strategy, we propose another strategy, called HYBRID, which for each horizon selects the most accurate approach among the REC and the RECNOISY strategies according to the estimated accuracy. In order to assess the effectiveness of the proposed strategies, we carry out an experimental session based on the 111 times series of the NN5 forecasting competition. Accuracy results are presented together with a paired comparison over the horizons and the time series. The preliminary results show that our proposed approaches are promising in terms of forecasting performance.

【Keywords】: data handling; forecasting theory; recursive estimation; time series; HYBRID; NN5 forecasting competition; RECNOISY; data perturbation; recursive multistep time series forecasting; recursive strategy; Accuracy; Data mining; Forecasting; Machine learning; Predictive models; Time series analysis; Training; Forecasting strategies; Machine Learning; Multi-step forecasting; NN5 forecasting competition; Recursive forecasting; Time series

72. Finding Robust Itemsets under Subsampling.

【Paper Link】【Pages】:705-714

【Authors】: Nikolaj Tatti ; Fabian Moerchen

【Abstract】: Mining frequent patterns is plagued by the problem of pattern explosion making pattern reduction techniques a key challenge in pattern mining. In this paper we propose a novel theoretical framework for pattern reduction. We do this by measuring the robustness of a property of an item set such as closed ness or non-derivability. The robustness of a property is the probability that this property holds on random subsets of the original data. We study four properties: closed, free, non-derivable and totally shattered item sets, demonstrating how we can compute the robustness analytically without actually sampling the data. Our concept of robustness has many advantages: Unlike statistical approaches for reducing patterns, we do not assume a null hypothesis or any noise model and the patterns reported are simply a subset of all patterns with this property as opposed to approximate patterns for which the property does not really hold. If the underlying property is monotonic, then the measure is also monotonic, allowing us to efficiently mine robust item sets. We further derive a parameter-free technique for ranking item sets that can be used for top-k approaches. Our experiments demonstrate that we can successfully use the robustness measure to reduce the number of patterns and that ranking yields interesting itemsets.

【Keywords】: data mining; frequent pattern mining; noise model; null hypothesis; parameter free technique; pattern explosion; pattern reduction techniques; robust itemsets; subsampling; top-k approaches; Data mining; Itemsets; Polynomials; Random variables; Robustness; TV; Vectors; closed itemsets; free itemsets; non-derivable itemsets; pattern reduction; robust itemsets; totally shattered itemsets

73. Density Estimation Based on Mass.

【Paper Link】【Pages】:715-724

【Authors】: Kai Ming Ting ; Takashi Washio ; Jonathan R. Wells ; Fei Tony Liu

【Abstract】: Density estimation is the ubiquitous base modelling mechanism employed for many tasks such as clustering, classification, anomaly detection and information retrieval. Commonly used density estimation methods such as kernel density estimator and k-nearest neighbour density estimator have high time and space complexities which render them inapplicable in problems with large data size and even a moderate number of dimensions. This weakness sets the fundamental limit in existing algorithms for all these tasks. We propose the first density estimation method which stretches this fundamental limit to an extent that dealing with millions of data can now be done easily and quickly. We analyze the error of the new estimation (from the true density) using a bias-variance analysis. We then perform an empirical evaluation of the proposed method by replacing existing density estimators with the new one in two current density-based algorithms, namely, DBSCAN and LOF. The results show that the new density estimation method significantly improves the runtime of DBSCAN and LOF, while maintaining or improving their task-specific performances in clustering and anomaly detection, respectively. The new method empowers these algorithms, currently limited to small data size only, to process very large databases - setting a new benchmark for what density-based algorithms can achieve.

【Keywords】: error analysis; pattern clustering; ubiquitous computing; DBSCAN; LOF; anomaly detection; bias variance analysis; data size; density based algorithm; density estimation method; empirical evaluation; error analysis; task specific performance; ubiquitous base modelling mechanism; Approximation methods; Clustering algorithms; Complexity theory; Databases; Estimation; Kernel; Noise; density estimation; density-based algorithms

74. Diverse Dimension Decomposition of an Itemset Space.

【Paper Link】【Pages】:725-734

【Authors】: Mikalai Tsytsarau ; Francesco Bonchi ; Aristides Gionis ; Themis Palpanas

【Abstract】: We introduce the problem of diverse dimension decomposition in transactional databases. A dimension is a set of mutually-exclusive item sets, and our problem is to find a decomposition of the item set space into dimensions, which are orthogonal to each other, and that provide high coverage of the input database. The mining framework we propose effectively represents a dimensionality-reducing transformation from the space of all items to the space of orthogonal dimensions. Our approach relies on information-theoretic concepts, and we are able to formulate the dimension-finding problem with a single objective function that simultaneously captures constraints on coverage, exclusivity and orthogonality. We describe an efficient greedy method for finding diverse dimensions from transactional databases. The experimental evaluation of the proposed approach using two real datasets, flickr and delicious, demonstrates the effectiveness of our solution. Although we are motivated by the applications in the collaborative tagging domain, we believe that the mining task we introduce in this paper is general enough to be useful in other application domains.

【Keywords】: database management systems; transaction processing; collaborative tagging; diverse dimension decomposition; information theoretic concepts; itemset space; orthogonal dimensions; transactional databases; Art; Data mining; Entropy; Itemsets; Joints; Mutual information; dimensionality reduction; itemset mining

75. Conditional Anomaly Detection with Soft Harmonic Functions.

【Paper Link】【Pages】:735-743

【Authors】: Michal Valko ; Branislav Kveton ; Hamed Valizadegan ; Gregory F. Cooper ; Milos Hauskrecht

【Abstract】: In this paper, we consider the problem of conditional anomaly detection that aims to identify data instances with an unusual response or a class label. We develop a new non-parametric approach for conditional anomaly detection based on the soft harmonic solution, with which we estimate the confidence of the label to detect anomalous mislabeling. We further regularize the solution to avoid the detection of isolated examples and examples on the boundary of the distribution support. We demonstrate the efficacy of the proposed method on several synthetic and UCI ML datasets in detecting unusual labels when compared to several baseline approaches. We also evaluate the performance of our method on a real-world electronic health record dataset where we seek to identify unusual patient-management decisions.

【Keywords】: data handling; graph theory; UCI ML datasets; conditional anomaly detection; data instances; distribution support; graph theory; patient management decisions; soft harmonic functions; soft harmonic solution; Design automation; Educational institutions; Electronic mail; Harmonic analysis; Laplace equations; Manifolds; Solid modeling; backbone graph; conditional anomaly detection; graph methods; harmonic solution; health care informatics; outlier and anomaly detection; random walks

76. Random Forest Based Feature Induction.

【Paper Link】【Pages】:744-753

【Authors】: Celine Vens ; Fabrizio Costa

【Abstract】: We propose a simple yet effective strategy to induce a task dependent feature representation using ensembles of random decision trees. The new feature mapping is efficient in space and time, and provides a metric transformation that is non parametric and not implicit in nature (i.e. not expressed via a kernel matrix), nor limited to the transductive setup. The main advantage of the proposed mapping lies in its flexibility to adapt to several types of learning tasks ranging from regression to multi-label classification, and to deal in a natural way with missing values. Finally, we provide an extensive empirical study of the properties of the learned feature representation over real and artificial datasets.

【Keywords】: data mining; decision trees; feature extraction; matrix algebra; pattern classification; random processes; regression analysis; feature mapping; feature representation; metric transformation; multilabel classification; random decision trees; random forest based feature induction; regression analysis; task dependent feature representation; Complexity theory; Decision trees; Encoding; Kernel; Training; Vectors; Vegetation; feature induction; random forests

77. Class Imbalance, Redux.

【Paper Link】【Pages】:754-763

【Authors】: Byron C. Wallace ; Kevin Small ; Carla E. Brodley ; Thomas A. Trikalinos

【Abstract】: Class imbalance (i.e., scenarios in which classes are unequally represented in the training data) occurs in many real-world learning tasks. Yet despite its practical importance, there is no established theory of class imbalance, and existing methods for handling it are therefore not well motivated. In this work, we approach the problem of imbalance from a probabilistic perspective, and from this vantage identify dataset characteristics (such as dimensionality, sparsity, etc.) that exacerbate the problem. Motivated by this theory, we advocate the approach of bagging an ensemble of classifiers induced over balanced bootstrap training samples, arguing that this strategy will often succeed where others fail. Thus in addition to providing a theoretical understanding of class imbalance, corroborated by our experiments on both simulated and real datasets, we provide practical guidance for the data mining practitioner working with imbalanced data.

【Keywords】: data mining; pattern classification; probability; statistical analysis; balanced bootstrap training sample; class imbalance handling; classifier ensemble; data mining; dataset characteristics; probability; training data; Bagging; Equations; Mathematical model; Measurement; Particle separators; Predictive models; Training; Classification; class imbalance

78. Combining Feature Context and Spatial Context for Image Pattern Discovery.

【Paper Link】【Pages】:764-773

【Authors】: Hongxing Wang ; Junsong Yuan ; Yap-Peng Tan

【Abstract】: Once an image is decomposed into a number of visual primitives, e.g., local interest points or salient image regions, it is of great interests to discover meaningful visual patterns from them. Conventional clustering (e.g., k-means) of visual primitives, however, usually ignores the spatial dependency among them, thus cannot discover the high-level visual patterns of complex spatial structure. To overcome this problem, we propose to consider both spatial and feature contexts among visual primitives for pattern discovery. By discovering both spatial co-occurrence patterns among visual primitives and feature co-occurrence patterns among different types of features, our method can better handle the ambiguities of visual primitives, by leveraging these co-occurrences. We formulate the problem as a regularized k-means clustering, and propose an iterative bottom-up/top-down self-learning procedure to gradually refine the result until it converges. The experiments of image text on discovery and image region clustering convince that combining spatial and feature contexts can significantly improve the pattern discovery results.

【Keywords】: feature extraction; learning (artificial intelligence); pattern clustering; feature co-occurrence patterns; feature context; image pattern discovery; image region clustering; iterative bottom-up self-learning procedure; iterative top-down self-learning procedure; local interest points; regularized k-means clustering; salient image regions; spatial co-occurrence patterns; spatial context; spatial dependency; visual primitives; Clustering algorithms; Context; Shape; Spatial resolution; TV; Uncertainty; Visualization; clustering; feature context; image pattern discovery; spatial context

79. Nonnegative Matrix Tri-factorization Based High-Order Co-clustering and Its Fast Implementation.

【Paper Link】【Pages】:774-783

【Authors】: Hua Wang ; Feiping Nie ; Heng Huang ; Chris H. Q. Ding

【Abstract】: The fast growth of Internet and modern technologies has brought data involving objects of multiple types that are related to each other, called as Multi-Type Relational data. Traditional clustering methods for single-type data rarely work well on them, which calls for new clustering techniques, called as high-order co-clustering (HOCC), to deal with the multiple types of data at the same time. A major challenge in developing HOCC methods is how to effectively make use of all available information contained in a multi-type relational data set, including both inter-type and intra-type relationships. Meanwhile, because many real world data sets are often of large sizes, clustering methods with computationally efficient solution algorithms are of great practical interest. In this paper, we first present a general HOCC framework, named as Orthogonal Nonnegative Matrix Tri-factorization (O-NMTF), for simultaneous clustering of multi-type relational data. The proposed O-NMTF approach employs Nonnegative Matrix Tri-Factorization (NMTF) to simultaneously cluster different types of data using the inter-type relationships, and incorporate intra-type information through manifold regularization, where, different from existing works, we emphasize the importance of the orthogonal ties of the factor matrices of NMTF. Based on O-NMTF, we further develop a novel Fast Nonnegative Matrix Tri-Factorization (F-NMTF) approach to deal with large-scale data. Instead of constraining the factor matrices of NMTF to be nonnegative as in existing methods, F-NMTF constrains them to be cluster indicator matrices, a special type of nonnegative matrices. As a result, the optimization problem of the proposed method can be decoupled, which results in sub problems of much smaller sizes requiring much less matrix multiplications, such that our new algorithm scales well to real world data of large sizes. Extensive experimental evaluations have demonstrated the effectiveness of our new approaches.

【Keywords】: Internet; matrix decomposition; matrix multiplication; pattern clustering; Internet; fast nonnegative matrix trifactorization approach; high order coclustering; intertype relationships; intratype relationships; manifold regularization; matrix multiplications; multitype relational data; orthogonal nonnegative matrix trifactorization; single type data; Clustering algorithms; Convergence; Linear matrix inequalities; Manifolds; Optimization; Symmetric matrices; Web pages; Cluster Indicator Matrix; High-Order Co-Clustering; Multi-Type Relational Data; Nonnegative Matrix Tri-Factorization

【Paper Link】【Pages】:784-793

【Authors】: Liaoruo Wang ; Tiancheng Lou ; Jie Tang ; John E. Hopcroft

【Abstract】: In many social networks, there exist two types of users that exhibit different influence and different behavior. For instance, statistics have shown that less than 1% of the Twitter users (e.g. entertainers, politicians, writers) produce 50% of its content, while the others (e.g. fans, followers, readers) have much less influence and completely different social behavior. In this paper, we define and explore a novel problem called community kernel detection in order to uncover the hidden community structure in large social networks. We discover that influential users pay closer attention to those who are more similar to them, which leads to a natural partition into different community kernels. We propose Greedy and We BA, two efficient algorithms for finding community kernels in large social networks. Greedy is based on maximum cardinality search, while We BA formalizes the problem in an optimization framework. We conduct experiments on three large social networks: Twitter, Wikipedia, and Coauthor, which show that We BA achieves an average 15%-50% performance improvement over the other state-of-the-art algorithms, and We BA is on average 6-2,000 times faster in detecting community kernels.

【Keywords】: Internet; greedy algorithms; social networking (online); Coauthor; Greedy algorithm; Twitter users; Wikipedia; community kernel detection; hidden community structure; social behavior; social networks; Communities; Electronic publishing; Encyclopedias; Internet; Kernel; Twitter; auxiliary communities; community kernel detection; community kernels; social networks

81. ADANA: Active Name Disambiguation.

【Paper Link】【Pages】:794-803

【Authors】: Xuezhi Wang ; Jie Tang ; Hong Cheng ; Philip S. Yu

【Abstract】: Name ambiguity has long been viewed as a challenging problem in many applications, such as scientific literature management, people search, and social network analysis. When we search a person name in these systems, many documents (e.g., papers, web pages) containing that person's name may be returned. It is hard to determine which documents are about the person we care about. Although much research has been conducted, the problem remains largely unsolved, especially with the rapid growth of the people information available on the Web. In this paper, we try to study this problem from a new perspective and propose an ADANA method for disambiguating person names via active user interactions. In ADANA, we first introduce a pairwise factor graph (PFG) model for person name disambiguation. The model is flexible and can be easily extended by incorporating various features. Based on the PFG model, we propose an active name disambiguation algorithm, aiming to improve the disambiguation performance by maximizing the utility of the user's correction. Experimental results on three different genres of data sets show that with only a few user corrections, the error rate of name disambiguation can be reduced to 3.1%. A real system has been developed based on the proposed method and is available online.

【Keywords】: Internet; graph theory; information retrieval; user interfaces; ADANA method; PFG model; active name disambiguation algorithm; active person name disambiguation; active user interactions; name ambiguity; pairwise factor graph model; user correction; utility maximizing; Accuracy; Approximation algorithms; Clustering algorithms; Correlation; Educational institutions; Inference algorithms; Web pages; Active Learning; Digital Library; Name Disambiguation; Social Network Analysis

82. Document Clustering via Matrix Representation.

【Paper Link】【Pages】:804-813

【Authors】: Xufei Wang ; Jiliang Tang ; Huan Liu

【Abstract】: Vector Space Model (VSM) is widely used to represent documents and web pages. It is simple and easy to deal computationally, but it also oversimplifies a document into a vector, susceptible to noise, and cannot explicitly represent underlying topics of a document. A matrix representation of document is proposed in this paper: rows represent distinct terms and columns represent cohesive segments. The matrix model views a document as a set of segments, and each segment is a probability distribution over a limited number of latent topics which can be mapped to clustering structures. The latent topic extraction based on the matrix representation of documents is formulated as a constraint optimization problem in which each matrix (i.e., a document) A_i is factorized into a common base determined by non-negative matrices L and R^T, and a non-negative weight matrix M_i such that the sum of reconstruction error on all documents is minimized. Empirical evaluation demonstrates that it is feasible to use the matrix model for document clustering: (1) compared with vector representation, using matrix representation improves clustering quality consistently, and the proposed approach achieves a relative accuracy improvement up to 66% on the studied datasets, and (2) the proposed method outperforms baseline methods such as k-means and NMF, and complements the state-of-the-art methods like LDA and PLSI. Furthermore, the proposed matrix model allows more refined information retrieval at a segment level instead of at a document level, which enables the return of more relevant documents in information retrieval tasks.

【Keywords】: Web sites; constraint handling; document handling; information retrieval; matrix algebra; optimisation; pattern clustering; probability; LDA; NMF; PLSI; Web pages; clustering structures; cohesive segments; constraint optimization problem; document clustering; information retrieval tasks; k-means; matrix model; matrix representation; probability distribution; vector space model; Approximation methods; Bismuth; Clustering algorithms; Data mining; Matrix decomposition; Probability distribution; Vectors; Document Clustering; Document Representation; Matrix Representation; Non-Negative Matrix Approximation

83. Using Bayesian Network Learning Algorithm to Discover Causal Relations in Multivariate Time Series.

【Paper Link】【Pages】:814-823

【Authors】: Zhenxing Wang ; Laiwan Chan

【Abstract】: Many applications naturally involve time series data, and the vector auto regression (VAR) and the structural VAR (SVAR) are dominant tools to investigate relations between variables in time series. In the first part of this work, we show that the SVAR method is incapable of identifying contemporaneous causal relations when data follow Gaussian distributions. In addition, least squares estimators become unreliable when the scales of the problems are large and observations are limited. In the remaining part, we propose an approach to apply Bayesian network learning algorithms to identify SVARs from time series data in order to capture both temporal and contemporaneous causal relations and avoid high-order statistical tests. The difficulty of applying Bayesian network learning algorithms to time series is that the sizes of the networks corresponding to time series tend to be large and high-order statistical tests are required by Bayesian network learning algorithms in this case. To overcome the difficulty, we show that the search space of conditioning sets d-separating two vertices should be subsets of Markov blankets. Based on this fact, we propose an algorithm learning Bayesian networks locally and making the largest order of statistical tests independent of the scales of the problems. Empirical results show that our algorithm outperforms existing methods in terms of both efficiency and accuracy.

【Keywords】: Gaussian distribution; Markov processes; belief networks; data mining; learning (artificial intelligence); regression analysis; time series; Bayesian network learning algorithm; Gaussian distributions; Markov blankets; causal relation discovery; high-order statistical tests; least squares estimators; multivariate time series; search space; structural VAR; vector autoregression; Bayesian methods; Covariance matrix; Equations; Markov processes; Mathematical model; Reactive power; Time series analysis; Bayesian networks; Causality; SVAR; VAR

84. Efficient Mining of a Concise and Lossless Representation of High Utility Itemsets.

【Paper Link】【Pages】:824-833

【Authors】: Cheng-Wei Wu ; Philippe Fournier-Viger ; Philip S. Yu ; Vincent S. Tseng

【Abstract】: Mining high utility item sets from transactional databases is an important data mining task, which refers to the discovery of item sets with high utilities (e.g. high profits). Although several studies have been carried out, current methods may present too many high utility item sets for users, which degrades the performance of the mining task in terms of execution and memory efficiency. To achieve high efficiency for the mining task and provide a concise mining result to users, we propose a novel framework in this paper for mining closed+ high utility item sets, which serves as a compact and loss less representation of high utility item sets. We present an efficient algorithm called CHUD (Closed+ High Utility item set Discovery) for mining closed+ high utility item sets. Further, a method called DAHU (Derive All High Utility item sets) is proposed to recover all high utility item sets from the set of closed+ high utility item sets without accessing the original database. Results of experiments on real and synthetic datasets show that CHUD and DAHU are very efficient with a massive reduction (up to 800 times in our experiments) in the number of high utility item sets. In addition, when all high utility item sets are recovered by DAHU, the approach combining CHUD and DAHU also outperforms the state-of-the-art algorithms in mining high utility item sets.

【Keywords】: data mining; database management systems; transaction processing; CHUD; DAHU; closed+ high utility item set discovery; closed+ high utility item set mining; data mining task; derive all high utility item sets; high utility itemsets concise representation mining; high utility itemsets lossless representation mining; transactional databases; Algorithm design and analysis; Arrays; Data mining; Educational institutions; Itemsets; Memory management; closed+ high utility itemset; frequent itemset; lossless and concise representation; utility mining

85. Understanding Propagation Error and Its Effect on Collective Classification.

【Paper Link】【Pages】:834-843

【Authors】: Rongjing Xiang ; Jennifer Neville

【Abstract】: Recent empirical evaluation has shown that the performance of collective classification models can vary based on the amount of class label information available for use during inference. In this paper, we further demonstrate that the relative performance of statistical relational models learned with different estimation methods changes as the availability of test set labels increases. We reason about the cause of this phenomenon from an information-theoretic perspective and this points to a previously unidentified consideration in the development of relational learning algorithms. In particular, we characterize the high propagation error of collective inference models that are estimated with maximum pseudolikelihood estimation (MPLE), and show how this affects performance across the spectrum of label availability when compared to MLE, which has low propagation error. Our formal study leads to a quantitative characterization that can be used to predict the confidence of local propagation for MPLE models. We use this to propose a mixture model that can learn the best trade-off between high and low propagation models. Empirical evaluation on synthetic and real-world data show that our proposed method achieves comparable, or superior, results to both MPLE and low propagation models across the full spectrum of label availability.

【Keywords】: inference mechanisms; information theory; learning (artificial intelligence); maximum likelihood estimation; pattern classification; class label information; collective classification models; empirical evaluation; inference; information theoretic perspective; label availability; maximum pseudolikelihood estimation; mixture model; propagation error; relational learning algorithms; test set labels; Availability; Computational modeling; Data models; Maximum likelihood estimation; Microscopy; Probabilistic logic; Training; collective classification; probabilistic relational models; statistical relational learning

86. Direct Robust Matrix Factorizatoin for Anomaly Detection.

【Paper Link】【Pages】:844-853

【Authors】: Liang Xiong ; Xi Chen ; Jeff G. Schneider

【Abstract】: Matrix factorization methods are extremely useful in many data mining tasks, yet their performances are often degraded by outliers. In this paper, we propose a novel robust matrix factorization algorithm that is insensitive to outliers. We directly formulate robust factorization as a matrix approximation problem with constraints on the rank of the matrix and the cardinality of the outlier set. Then, unlike existing methods that resort to convex relaxations, we solve this problem directly and efficiently. In addition, structural knowledge about the outliers can be incorporated to find outliers more effectively. We applied this method in anomaly detection tasks on various data sets. Empirical results show that this new algorithm is effective in robust modeling and anomaly detection, and our direct solution achieves superior performance over the state-of-the-art methods based on the L1-norm and the nuclear norm of matrices.

【Keywords】: convex programming; data mining; matrix algebra; anomaly detection; convex relaxation; data mining; direct robust matrix factorizatoin; matrix approximation; structural knowledge; Approximation algorithms; Approximation methods; Estimation; Matrix decomposition; Measurement uncertainty; Robustness; Vectors; anomaly detection; matrix factorization; robust

87. A New Markov Model for Clustering Categorical Sequences.

【Paper Link】【Pages】:854-863

【Authors】: Tengke Xiong ; Shengrui Wang ; Qingshan Jiang ; Joshua Zhexue Huang

【Abstract】: Clustering categorical sequences remains an open and challenging task due to the lack of an inherently meaningful measure of pair wise similarity between sequences. Model initialization is an unsolved problem in model-based clustering algorithms for categorical sequences. In this paper, we propose a simple and effective Markov model to approximate the conditional probability distribution (CPD) model, and use it to design a novel two-tier Markov model to represent a sequence cluster. Furthermore, we design a novel divisive hierarchical algorithm for clustering categorical sequences based on the two-tier Markov model. The experimental results on the data sets from three different domains demonstrate the promising performance of our models and clustering algorithm.

【Keywords】: Markov processes; pattern clustering; sequences; statistical distributions; categorical sequence clustering; conditional probability distribution model; divisive hierarchical algorithm; model based clustering algorithm; pairwise similarity; sequence cluster; two-tier Markov model; Algorithm design and analysis; Clustering algorithms; Data models; Hidden Markov models; Markov processes; Numerical models; Vectors; Markov model; categorical sequence; clustering

88. BibClus: A Clustering Algorithm of Bibliographic Networks by Message Passing on Center Linkage Structure.

【Paper Link】【Pages】:864-873

【Authors】: Xiaoran Xu ; Zhi-Hong Deng

【Abstract】: Multi-type objects with multi-type relations are ubiquitous in real-world networks, e.g. bibliographic networks. Such networks are also called heterogeneous information networks. However, the research on clustering for heterogeneous information networks is little. A new algorithm, called NetClus, has been proposed in recent two years. Although NetClus is applied on a heterogeneous information network with a star network schema, considering the relations between center objects and all attribute objects linking to them, it ignores the relations between center objects such as citation relations, which also contain rich information. Hence, we think the star network schema cannot be used to characterize all possible relations without integrating the linkage structure among center objects, which we call the Center Linkage Structure, and there has been no practical way good enough to solve it. In this paper, we present a novel algorithm, BibClus, for clustering heterogeneous objects with center linkage structure by taking a bibliographic information network as an example. In BibClus, we build a probabilistic model of pair wise hidden Markov random field (P-HMRF) to characterize the center linkage structure, and convert it to a factor graph. We further combine EM algorithm with factor graph theory, and design an efficient way based on message passing algorithm to inference marginal probabilities and estimate parameters at each iteration of EM. We also study how factor functions affect clustering performance with different function forms and constraints. For evaluating our proposed method, we have conducted thorough experiments on a real dataset that we had crawled from ACM Digital Library. The experimental results show that BibClus is effective and has a much higher quantity than the recently proposed algorithm, NetClus, in both recall and precision.

【Keywords】: Markov processes; bibliographic systems; digital libraries; graph theory; inference mechanisms; iterative methods; message passing; pattern clustering; ACM digital library; BibClus; EM iteration; NetClus; bibliographic networks; center linkage structure; clustering algorithm; factor graph theory; heterogeneous information networks; inference marginal probabilities; message passing algorithm; multitype objects; multitype relations; pairwise hidden Markov random field; star network schema; Approximation algorithms; Clustering algorithms; Couplings; Hidden Markov models; Joints; Message passing; Probability; clustering; factor graph; heterogeneous information network; message passing

89. Multi-instance Metric Learning.

【Paper Link】【Pages】:874-883

【Authors】: Ye Xu ; Wei Ping ; Andrew T. Campbell

【Abstract】: Multi-instance learning, like other machine learning and data mining tasks, requires distance metrics. Although metric learning methods have been studied for many years, metric learners for multi-instance learning remain almost untouched. In this paper, we propose a framework called Multi-Instance MEtric Learning (MIMEL) to learn an appropriate distance under the multi-instance setting. The distance metric between two bags is defined using the Mahalanobis distance function. The problem is formulated by minimizing the KL divergence between two multivariate Gaussians under the constraints of maximizing the between-class bag distance and minimizing the within-class bag distance. To exploit the mechanism of how instances determine bag labels in multi-instance learning, we design a nonparametric density-estimation-based weighting scheme to assign higher "weights" to the instances that are more likely to be positive in positive bags. The weighting scheme itself has a small workload, which adds little extra computing costs to the proposed framework. Moreover, to further boost the classification accuracy, a kernel version of MIMEL is presented. We evaluate MIMEL, using not only several typical multi-instance tasks, but also two activity recognition datasets. The experimental results demonstrate that MIMEL achieves better classification accuracy than many state-of-the-art distance based algorithms or kernel methods for multi-instance learning.

【Keywords】: Gaussian processes; data mining; learning (artificial intelligence); KL divergence minimization; MIMEL; Mahalanobis distance function; between-class bag distance maximization; classification accuracy; data mining task; distance metrics; machine learning task; multiinstance metric learning; multivariate Gaussians; nonparametric density-estimation-based weighting scheme; within-class bag distance minimization; Data mining; Kernel; Learning systems; Machine learning; Mathematical model; Measurement; Training; Metric learning; Multi-instance learning; Weighting scheme

90. Multi-task Learning with Task Relations.

【Paper Link】【Pages】:884-893

【Authors】: Zhao Xu ; Kristian Kersting

【Abstract】: Multi-task and relational learning with Gaussian processes are two active but also orthogonal areas of research. So far, there has been few attempt at exploring relational information within multi-task Gaussian processes. While existing relational Gaussian process methods have focused on relations among entities and in turn could be employed within an individual task, we develop a class of Gaussian process models which incorporates relational information across multiple tasks. As we will show, inference and learning within the resulting class of models, called relational multi-task Gaussian processes, can be realized via a variational EM algorithm. Experimental results on synthetic and real-world datasets verify the usefulness of this approach: The observed relational knowledge at the level of tasks can indeed reveal additional pair wise correlations between tasks of interest and, in turn, improve prediction performance.

【Keywords】: Gaussian processes; learning (artificial intelligence); Gaussian process; knowledge relation; multitask learning; relational information; task relations; Covariance matrix; Equations; Gaussian processes; Kernel; Mathematical model; Probabilistic logic; Vectors; Link-based Analysis; Multi-task Learning; Nonparametric Bayesian Models; Relational Learning

91. Secure Clustering in Private Networks.

【Paper Link】【Pages】:894-903

【Authors】: Bin Yang ; Issei Sato ; Hiroshi Nakagawa

【Abstract】: Many clustering methods have been proposed for analyzing the relations inside networks with complex structures. Some of them can detect a mixture of assortative and disassortative structures in networks. All these methods are based on the fact that the entire network is observable. However, in the real world, the entities in networks, for example a social network, may be private, and thus, cannot be observed. We focus on private peer-to-peer networks in which all vertices are independent and private, and each vertex only knows about itself and its neighbors. We propose a privacy-preserving Gibbs sampling for clustering these types of private networks and detecting their mixed structures without revealing any private information about any individual entity. Moreover, the running cost of our method is related only to the number of clusters and the maximum degree, but is nearly independent of the number of vertices in the entire network.

【Keywords】: computer network security; pattern clustering; peer-to-peer computing; sampling methods; assortative network structure; disassortative network structure; peer-to-peer networks; privacy-preserving Gibbs sampling; private networks; secure clustering; vertices; Data mining; Peer to peer computing; Privacy; Protocols; Servers; Social network services; Vectors; Gibbs sampling; clustering; mixed network; peer-to-peer; privacy

92. LPTA: A Probabilistic Model for Latent Periodic Topic Analysis.

【Paper Link】【Pages】:904-913

【Authors】: Zhijun Yin ; Liangliang Cao ; Jiawei Han ; Chengxiang Zhai ; Thomas S. Huang

【Abstract】: This paper studies the problem of latent periodic topic analysis from time stamped documents. The examples of time stamped documents include news articles, sales records, financial reports, TV programs, and more recently, posts from social media websites such as Flickr, Twitter, and Face book. Different from detecting periodic patterns in traditional time series database, we discover the topics of coherent semantics and periodic characteristics where a topic is represented by a distribution of words. We propose a model called LPTA (Latent Periodic Topic Analysis) that exploits the periodicity of the terms as well as term co-occurrences. To show the effectiveness of our model, we collect several representative datasets including Seminar, DBLP and Flickr. The results show that our model can discover the latent periodic topics effectively and leverage the information from both text and time well.

【Keywords】: database management systems; document handling; time series; Face book; Flickr; LPTA; TV programs; Twitter; financial reports; latent periodic topic analysis; probabilistic model; sales records; social media websites; time series database; time stamped documents; Analytical models; Complexity theory; Computational modeling; Databases; Equations; Mathematical model; Seminars; periodic topics; topic modeling

93. Causal Associative Classification.

【Paper Link】【Pages】:914-923

【Authors】: Kui Yu ; Xindong Wu ; Wei Ding ; Hao Wang ; Hongliang Yao

【Abstract】: Associative classifiers have received considerable attention due to their easy to understand models and promising performance. However, with a high dimensional dataset, associative classifiers inevitably face two challenges: (1) how to extract a minimal set of strong predictive rules from an explosive number of generated association rules, and (2) how to deal with the highly sensitive choice of the minimal support threshold. In order to address these two challenges, we introduce causality into associative classification, and propose a new framework of causal associative classification. In this framework, we use causal Bayesian networks to bridge irrelevant and redundant features with irrelevant and redundant rules in associative classification. Without loss of prediction power, the feature space involved with the antecedent of a classification rule is reduced to the space of the direct causes, direct effects, and direct causes of the direct effects, a.k.a. the Markov blanket, of the consequent of the rule in causal Bayesian networks. The proposed framework is instantiated via baseline classifiers using emerging patterns. Experimental results show that our framework significantly reduces the model complexity while outperforming the other state-of-the-art algorithms.

【Keywords】: Markov processes; belief networks; computational complexity; pattern classification; Markov blanket; associative classifiers; baseline classifiers; causal Bayesian networks; causal associative classification; classification rule; feature space; minimal support threshold; model complexity; strong predictive rules; Association rules; Bayesian methods; Cancer; Classification algorithms; Feature extraction; Lungs; Markov processes; associative classification; causal bayesian networks; emerging patterns

94. Multi-task Learning for Bayesian Matrix Factorization.

【Paper Link】【Pages】:924-931

【Authors】: Chao Yuan

【Abstract】: Data sparsity is a big challenge for collaborative filtering. This problem becomes more serious if the dataset is newly created and has even fewer ratings. By sharing knowledge among different datasets, multi-task learning is a promising technique to address this issue. Most prior work methods directly share objects (users or items) across different datasets. However, object identities and correspondences may not be known in many cases. We extend the previous work of Bayesian matrix factorization with Dirichlet process mixture into a multi-task learning approach by sharing latent parameters among different tasks. Our method does not require object identities and thus is more widely applicable. The proposed model is fully non-parametric in that the dimension of latent feature vectors is automatically determined. Inference is performed using the variational Bayesian algorithm, which is much faster than Gibbs sampling used by most other related Bayesian methods.

【Keywords】: belief networks; collaborative filtering; knowledge representation; learning (artificial intelligence); matrix decomposition; sampling methods; Bayesian matrix factorization; Dirichlet process; Gibbs sampling; collaborative filtering; data sparsity; knowledge sharing; latent feature vectors; multitask learning; multitask learning approach; variational Bayesian algorithm; Bayesian methods; Covariance matrix; Matrix decomposition; Motion pictures; Principal component analysis; Training; Vectors; co-clustering; collaborative filtering; matrix factorization

95. Enabling Fast Lazy Learning for Data Streams.

【Paper Link】【Pages】:932-941

【Authors】: Peng Zhang ; Byron J. Gao ; Xingquan Zhu ; Li Guo

【Abstract】: Lazy learning, such as k-nearest neighbor learning, has been widely applied to many applications. Known for well capturing data locality, lazy learning can be advantageous for highly dynamic and complex learning environments such as data streams. Yet its high memory consumption and low prediction efficiency have made it less favorable for stream oriented applications. Specifically, traditional lazy learning stores all the training data and the inductive process is deferred until a query appears, whereas in stream applications, data records flow continuously in large volumes and the prediction of class labels needs to be made in a timely manner. In this paper, we provide a systematic solution that overcomes the memory and efficiency limitations and enables fast lazy learning for concept drifting data streams. In particular, we propose a novel Lazy-tree (Ltree for short) indexing structure that dynamically maintains compact high-level summaries of historical stream records. L-trees are M-Tree [5] like, height-balanced, and can help achieve great memory consumption reduction and sub-linear time complexity for prediction. Moreover, L-trees continuously absorb new stream records and discard outdated ones, so they can naturally adapt to the dynamically changing concepts in data streams for accurate prediction. Extensive experiments on real-world and synthetic data streams demonstrate the performance of our approach.

【Keywords】: computational complexity; data mining; learning (artificial intelligence); pattern classification; tree data structures; Ltree; M-tree; complex learning environments; data locality; data mining community; data stream classification; dynamic learning environments; fast lazy learning; k-nearest neighbor learning; lazy-tree; memory consumption reduction; sublinear time complexity; Complexity theory; Indexing; Learning systems; Memory management; Routing; Training data; Vectors; Spatial indexing; concept drifting; data stream classification; data stream mining; lazy learning

96. Clusterability Analysis and Incremental Sampling for Nyström Extension Based Spectral Clustering.

【Paper Link】【Pages】:942-951

【Authors】: Xianchao Zhang ; Quanzeng You

【Abstract】: To alleviate the memory and computational burdens of spectral clustering for large scale problems, some kind of low-rank matrix approximation is usually employed. Nyström method is an efficient technique to generate low rank matrix approximation and its most important aspect is sampling. The matrix approximation errors of several sampling schemes have been theoretically analyzed for a number of learning tasks. However, the impact of matrix approximation error on the clustering performance of spectral clustering has not been studied. In this paper, we firstly analyze the performance of Nyström method in terms of cluster ability, thus answer the impact of matrix approximation error on the clustering performance of spectral clustering. Our analysis immediately suggests an incremental sampling scheme for the Nyström method based spectral clustering. Experimental results show that the proposed incremental sampling scheme outperforms existing sampling schemes on various clustering tasks and image segmentation applications, and its efficiency is comparable with existing sampling schemes.

【Keywords】: approximation theory; image segmentation; learning (artificial intelligence); matrix algebra; pattern clustering; sampling methods; Nyström extension based spectral clustering performance; clusterability analysis; image segmentation; incremental sampling; incremental sampling scheme; learning task; low rank matrix approximation error; Approximation error; Clustering algorithms; Image segmentation; Matrices; Sampling methods; Vectors; Nyström extension; incremental sampling; spectral clustering

97. Fast and Robust Graph-based Transductive Learning via Minimum Tree Cut.

【Paper Link】【Pages】:952-961

【Authors】: Yan-Ming Zhang ; Kaizhu Huang ; Cheng-Lin Liu

【Abstract】: In this paper, we propose an efficient and robust algorithm for graph-based transductive classification. After approximating a graph with a spanning tree, we develop a linear-time algorithm to label the tree such that the cut size of the tree is minimized. This significantly improves typical graph-based methods, which either have a cubic time complexity (for a dense graph) or O(kn²) (for a sparse graph with k denoting the node degree). Furthermore, our method shows great robustness to the graph construction both theoretically and empirically; this overcomes another big problem of traditional graph-based methods. In addition to its good scalability and robustness, the proposed algorithm demonstrates high accuracy. In particular, on a graph with 400,000 nodes (in which 10,000 nodes are labeled) and 10,455,545 edges, our algorithm achieves the highest accuracy of 99.6% but takes less than 10 seconds to label all the unlabeled data.

【Keywords】: computational complexity; graph theory; learning (artificial intelligence); cubic time complexity; fast graph based transductive learning; graph based transductive classification; graph construction; linear time algorithm; minimum tree cut; robust graph based transductive learning; spanning tree; Algorithm design and analysis; Approximation algorithms; Complexity theory; Labeling; Laplace equations; Prediction algorithms; Robustness; graph mining; graph-based semi-supervised learning; transductive learning

98. Positive and Unlabeled Learning for Graph Classification.

【Paper Link】【Pages】:962-971

【Authors】: Yuchen Zhao ; Xiangnan Kong ; Philip S. Yu

【Abstract】: The problem of graph classification has drawn much attention in the last decade. Conventional approaches on graph classification focus on mining discriminative sub graph features under supervised settings. The feature selection strategies strictly follow the assumption that both positive and negative graphs exist. However, in many real-world applications, the negative graph examples are not available. In this paper we study the problem of how to select useful sub graph features and perform graph classification based upon only positive and unlabeled graphs. This problem is challenging and different from previous works on PU learning, because there are no predefined features in graph data. Moreover, the sub graph enumeration problem is NP-hard. We need to identify a subset of unlabeled graphs that are most likely to be negative graphs. However, the negative graph selection problem and the sub graph feature selection problem are correlated. Before the reliable negative graphs can be resolved, we need to have a set of useful sub graph features. In order to address this problem, we first derive an evaluation criterion to estimate the dependency between sub graph features and class labels based on a set of estimated negative graphs. In order to build accurate models for the PU learning problem on graph data, we propose an integrated approach to concurrently select the discriminative features and the negative graphs in an iterative manner. Experimental results illustrate the effectiveness and efficiency of the proposed method.

【Keywords】: computational complexity; data mining; graph theory; learning (artificial intelligence); pattern classification; NP-hard problem; PU learning; discriminative subgraph features mining; graph classification; negative graph selection problem; positive learning; subgraph enumeration problem; unlabeled graphs; unlabeled learning; Data mining; Data models; Kernel; Optimization; Reliability; Support vector machines; Vectors; feature selection; graph classification; positive and unlabeled data

99. Finding Novel Diagnostic Gene Patterns Based on Interesting Non-redundant Contrast Sequence Rules.

【Paper Link】【Pages】:972-981

【Authors】: Yuhai Zhao ; Guoren Wang ; Yuan Li ; Zhanghui Wang

【Abstract】: Diagnostic genes refer to the genes closely related to a specific disease phenotype, the powers of which to distinguish between different classes are often high. Most methods to discovering the powerful diagnostic genes are either singleton discriminability-based or combination discriminability-based. However, both ignore the abundant interactions among genes, which widely exist in the real world. In this paper, we tackle the problem from a new point of view and make the following contributions: (1) we propose an EWave model, which profitably exploits the ordered expressions among genes based on the defined equivalent dimension group sequences taking into account the "noise" universal in the real data, (2) we devise a novel sequence rule, namely interesting non-redundant contrast sequence rule, which is able to capture the difference between different phenotypes in a high accuracy using as few as possible genes, (3) we present an efficient algorithm called NRMINER to find such rules. Unlike the conventional column enumeration and the more recent row enumeration, it performs a novel template-driven enumeration by making use of the special characteristic of micro array data modeled by EWave. Extensive experiments conducted on various synthetic and real datasets show that: (1) NRMINER is significantly faster than the competing algorithm by up to about one order of magnitude, (2) it provides a higher accuracy using fewer genes. Many diagnostic genes discovered by NRMINER are proved biologically related to some disease.

【Keywords】: biology; pattern recognition; EWave model; NRMINER; column enumeration; disease phenotype; finding novel diagnostic gene patterns; interesting nonredundant contrast sequence rules; micro array data; powerful diagnostic genes; Accuracy; Data models; Diseases; Gene expression; Generators; Noise; data mining; diagnostic gene; sequence rule

100. Semi-supervised Hierarchical Clustering.

【Paper Link】【Pages】:982-991

【Authors】: Li Zheng ; Tao Li

【Abstract】: Semi-supervised clustering (i.e., clustering with knowledge-based constraints) has emerged as an important variant of the traditional clustering paradigms. However, most existing semi-supervised clustering algorithms are designed for partitional clustering methods and few research efforts have been reported on semi-supervised hierarchical clustering methods. In addition, current semi-supervised clustering methods have been focused on the use of background information in the form of instance level must-link and cannot-link constraints, which are not suitable for hierarchical clustering where data objects are linked over different hierarchy levels. In this paper, we propose a novel semi-supervised hierarchical clustering framework based on ultra-metric dendrogram distance. The proposed framework is able to incorporate triple-wise relative constraints. We establish the connection between hierarchical clustering and ultra-metric transformation of dissimilarity matrix and propose two techniques (the constrained optimization technique and the transitive dissimilarity based technique) for semi-supervised hierarchical clustering. Experimental results demonstrate the effectiveness and the efficiency of our proposed methods.

【Keywords】: constraint handling; matrix algebra; pattern clustering; cannot link constraints; dissimilarity matrix; must link constraints; partitional clustering methods; semisupervised hierarchical clustering; triple wise relative constraints; ultra metric transformation; Clustering algorithms; Clustering methods; Educational institutions; Measurement; Optimization; USA Councils; Vectors; Hierarchical clustering; semi-supervised clustering; triple-wise relative constraints

101. Handling Conditional Discrimination.

【Paper Link】【Pages】:992-1001

【Authors】: Indre Zliobaite ; Faisal Kamiran ; Toon Calders

【Abstract】: Historical data used for supervised learning may contain discrimination. We study how to train classifiers on such data, so that they are discrimination free with respect to a given sensitive attribute, e.g., gender. Existing techniques that deal with this problem aim at removing all discrimination and do not take into account that part of the discrimination may be explainable by other attributes, such as, e.g., education level. In this context, we introduce and analyze the issue of conditional non-discrimination in classifier design. We show that some of the differences in decisions across the sensitive groups can be explainable and hence tolerable. We observe that in such cases, the existing discrimination aware techniques will introduce a reverse discrimination, which is undesirable as well. Therefore, we develop local techniques for handling conditional discrimination when one of the attributes is considered to be explanatory. Experimental evaluation demonstrates that the new local techniques remove exactly the bad discrimination, allowing differences in decisions as long as they are explainable.

【Keywords】: data handling; learning (artificial intelligence); education level; handling conditional discrimination; historical data; supervised learning; Computer science; Correlation; Data mining; Data models; Decision making; Educational institutions; Remuneration; classification; discrimination; independence

Short Papers 47

102. On the Hardness of Graph Anonymization.

【Paper Link】【Pages】:1002-1007

【Authors】: Charu C. Aggarwal ; Yao Li ; Philip S. Yu

【Abstract】: In this paper, we examine the problem of node re-identification from anonymized graphs. Typical graphs encountered in real applications are massive and sparse. In this paper, we will show that massive and sparse graphs have certain theoretical properties which make them susceptible to re-identification attacks. We design a systematic way to exploit these theoretical properties in order to construct re- identification signatures, which are also known as characteristic vectors. These signatures have the property that they are extremely robust to perturbations, especially for massive and sparse graphs. Our results show that even low levels of anonymization require perturbation levels which are significant enough to result in a massive loss of utility. Our experimental results also show that the true anonymization level of graphs is much lower than is implied by measures such as k-anonymity. Thus, the results of this paper establish that the problem of massive graph anonymization has fundamental theoretical barriers which prevent a fully effective solution.

【Keywords】: data analysis; data mining; digital signatures; graph theory; social networking (online); characteristic vectors; data analysis; data mining; graph anonymization; massive graphs; node re-identification; re-identification attacks; re-identification signatures; social network; sparse graphs; Approximation methods; Couplings; Privacy; Random variables; Robustness; Social network services; Vectors; anonymization; graphs; privacy

103. SPO: Structure Preserving Oversampling for Imbalanced Time Series Classification.

【Paper Link】【Pages】:1008-1013

【Authors】: Hong Cao ; Xiaoli Li ; David Yew-Kwong Woon ; See-Kiong Ng

【Abstract】: This paper presents a novel structure preserving over sampling (SPO) technique for classifying imbalanced time series data. SPO generates synthetic minority samples based on multivariate Gaussian distribution by estimating the covariance structure of the minority class and regularizing the unreliable eigen spectrum. By preserving the main covariance structure and intelligently creating protective variances in the trivial eigen feature dimensions, the synthetic samples expand effectively into the void area in the data space without being too closely tied with existing minority-class samples. Extensive experiments based on several public time series datasets demonstrate that our proposed SPO in conjunction with support vector machines can achieve better performances than existing over sampling methods and state-of-the-art methods in time series classification.

【Keywords】: Gaussian distribution; support vector machines; time series; covariance structure; eigen feature dimensions; multivariate Gaussian distribution; structure preserving oversampling; support vector machines; synthetic minority samples; time series classification; time series datasets; Classification algorithms; Eigenvalues and eigenfunctions; Reliability; Support vector machines; Time series analysis; Training; Vectors; Oversampling; SVM; eigen regularization; imbalance; learning; structure preserving; time series

104. Manifold Learning and Missing Data Recovery through Unsupervised Regression.

【Paper Link】【Pages】:1014-1019

【Authors】: Miguel Á. Carreira-Perpiñán ; Zhengdong Lu

【Abstract】: We propose an algorithm that, given a high-dimensional dataset with missing values, achieves the distinct goals of learning a nonlinear low-dimensional representation of the data (the dimensionality reduction problem) and reconstructing the missing high-dimensional data (the matrix completion, or imputation, problem). The algorithm follows the Dimensionality Reduction by Unsupervised Regression approach, where one alternately optimizes over the latent coordinates given the reconstruction and projection mappings, and vice versa, but here we also optimize over the missing data, using an efficient, globally convergent Gauss-Newton scheme. We also show how to project or reconstruct test data with missing values. We achieve impressive reconstructions while learning good latent representations in image restoration with 50% missing pixels.

【Keywords】: Gaussian processes; data reduction; image restoration; matrix algebra; regression analysis; unsupervised learning; convergent Gauss-Newton scheme; dimensionality reduction; high-dimensional datasets; image restoration; manifold learning; missing data recovery; missing high-dimensional data reconstruction; nonlinear low-dimensional representation; projection mappings; unsupervised regression approach; Image reconstruction; Jacobian matrices; Linear systems; Manifolds; Optimization; Training; Vectors; dimensionality reduction; manifold learning; matrix completion; missing data

105. Characterizing Inverse Time Dependency in Multi-class Learning.

【Paper Link】【Pages】:1020-1025

【Authors】: Danqi Chen ; Weizhu Chen ; Qiang Yang

【Abstract】: The training time of most learning algorithms increases as the size of training data increases. Yet, recent advances in linear binary SVM and LR challenge this commonsense by proposing an inverse dependency property, where the training time decreases as the size of training data increases. In this paper, we study the inverse dependency property of multi-class classification problem. We describe a general framework for multi-class classification problem with a single objective to achieve inverse dependency and extend it to three popular multi-class algorithms. We present theoretical results demonstrating its convergence and inverse dependency guarantee. We conduct experiments to empirically verify the inverse dependency of all the three algorithms on large-scale datasets as well as to ensure the accuracy.

【Keywords】: learning (artificial intelligence); pattern classification; support vector machines; LR challenge; inverse dependency property; inverse time dependency; linear binary SVM; multiclass classification problem; multiclass learning; Accuracy; Algorithm design and analysis; Convergence; Logistics; Support vector machines; Training; Training data; inverse dependency; large-scale classification; multi-class learning; supervised learning

106. Supervised Lazy Random Walk for Topic-Focused Multi-document Summarization.

【Paper Link】【Pages】:1026-1031

【Authors】: Pan Du ; Jiafeng Guo ; Xueqi Cheng

【Abstract】: Topic-focused multi-document summarization aims to produce a summary given a specific topic description and a set of related documents. It has become a crucial text processing task in many real applications that can help users consume the massive information. This paper presents a novel extractive approach based on supervised lazy random walk (Super Lazy). This approach naturally combines the rich features of sentences with the intrinsic sentence graph structure in a principled way, and thus enjoys the advantages of both the existing supervised and unsupervised approaches. Moreover, our approach can achieve the three major goals of topic-focused multi-document summarization (i.e. relevance, salience and diversity) simultaneously with a unified ranking process. Experiments on the benchmark dataset TAC2008 and TAC2009 are performed and the ROUGE evaluation results demonstrate that our approach can significantly outperform both the state-of-the-art supervised and unsupervised methods.

【Keywords】: information retrieval; learning (artificial intelligence); random processes; text analysis; ROUGE evaluation; SuperLazy; benchmark dataset TAC2008; benchmark dataset TAC2009; intrinsic sentence graph structure; supervised lazy random walk; text processing task; topic focused multidocument summarization; unified ranking process; Feature extraction; Humans; Lead; Measurement; Support vector machines; Training; Vectors; diversity; relevance; salience; supervised lazy random walk; topic-focused multi-document summarization

107. Unsupervised Anomaly Intrusion Detection via Localized Bayesian Feature Selection.

【Paper Link】【Pages】:1032-1037

【Authors】: Wentao Fan ; Nizar Bouguila ; Djemel Ziou

【Abstract】: In recent years, an increasing number of security threats have brought a serious risk to the internet and computer networks. Intrusion Detection System (IDS) plays a vital role in detecting various kinds of attacks. Developing adaptive and flexible oriented IDSs remains a challenging and demanding task due to the incessantly appearance of new types of attacks and sabotaging approaches. In this paper, we propose a novel unsupervised statistical approach for detecting network based attacks. In our approach, patterns of normal and intrusive activities are learned through finite generalized Dirichlet mixture models, in the context of Bayesian variational inference. Under the proposed variational framework, the parameters, the complexity of the mixture model, and the features saliency can be estimated simultaneously, in a closed-form. We evaluate the proposed approach using the popular KDD CUP 1999 data set. Experimental results show that this approach is able to detect many different types of intrusions accurately with a low false positive rate.

【Keywords】: Bayes methods; Internet; computer network security; inference mechanisms; statistical analysis; unsupervised learning; Bayesian variational inference; Internet; computer networks; finite generalized Dirichlet mixture models; intrusion detection system; localized Bayesian feature selection; network based attack detection; security threats; unsupervised anomaly intrusion detection; unsupervised statistical approach; Accuracy; Computational modeling; Entropy; Feature extraction; Hidden Markov models; Intrusion detection; Training; anomaly detection; feature selection; generalized Dirichlet; mixture models; model selection; variational inference

108. Identifying Differentially Expressed Genes via Weighted Rank Aggregation.

【Paper Link】【Pages】:1038-1043

【Authors】: Qiong Fang ; Jianlin Feng ; Wilfred Ng

【Abstract】: Identifying differentially expressed genes is an important problem in gene expression analysis, since these genes, exhibiting sufficiently different expression levels under distinct experiment conditions, could be critical for tracing the progression of a disease. In a micro array study, genes are usually sorted in terms of their differentiation abilities with the more differentially expressed genes being ranked higher in the list. As more micro array studies are conducted, rank aggregation becomes an important means to combine such ranked gene lists in order to discover more reliable differentially expressed genes. In this paper, we study a novel weighted gene rank aggregation problem whose complexity is at least NP-hard. To tackle the problem, we develop a new Markov-chain based rank aggregation method called Weighted MC (WMC). The WMC algorithm makes use of rank-based weight information to generate the transition matrix. Extensive experiments on the real biological datasets show that our approach is more efficient in aggregating long gene lists. Importantly, the WMC method is much more robust for identifying biologically significant genes compared with the state-of-the-art methods.

【Keywords】: Markov processes; computational complexity; diseases; genetics; lab-on-a-chip; Markov-chain based rank aggregation method; NP-hard problem; differentially expressed gene identification; disease progression tracing; gene expression analysis; transition matrix; weighted gene rank aggregation problem; Complexity theory; Diseases; Gene expression; Itemsets; Markov processes; Reliability; Markov chain; differential expression; existence disagreement; ordering disagreement; rank aggregation

109. Efficient Mining of Closed Sequential Patterns on Stream Sliding Window.

【Paper Link】【Pages】:1044-1049

【Authors】: Chuancong Gao ; Jianyong Wang ; Qingyan Yang

【Abstract】: As a typical data mining research topic, sequential pattern mining has been studied extensively for the past decade. Recently, mining various sequential patterns incrementally over stream data has raised great interest. Due to the challenges of mining stream data, many difficulties not so obvious in static data mining have to be reconsidered carefully. In this paper, we propose a novel algorithm which stores only frequent closed prefixes in its enumeration tree structure, used for mining and maintaining patterns in the current sliding window, to solve the frequent closed sequential pattern mining problem efficiently over stream data. Some effective search space pruning and pattern closure checking strategies have been also devised to accelerate the algorithm. Experimental results show that our algorithm outperforms other state-of-the-art algorithm significantly in both running time and memory use.

【Keywords】: data mining; closed sequential patterns; data mining; pattern closure checking; pattern mining; search space pruning; stream data; stream sliding window; Acceleration; Algorithm design and analysis; Data mining; Itemsets; Runtime; Unsolicited electronic mail; Closed Sequential Pattern; Sliding Window

110. A Spectral Framework for Detecting Inconsistency across Multi-source Object Relationships.

【Paper Link】【Pages】:1050-1055

【Authors】: Jing Gao ; Wei Fan ; Deepak S. Turaga ; Srinivasan Parthasarathy ; Jiawei Han

【Abstract】: In this paper, we propose to conduct anomaly detection across multiple sources to identify objects that have inconsistent behavior across these sources. We assume that a set of objects can be described from various perspectives (multiple information sources). The underlying clustering structure of normal objects is usually shared by multiple sources. However, anomalous objects belong to different clusters when considering different aspects. For example, there exist movies that are expected to be liked by kids by genre, but are liked by grown-ups based on user viewing history. To identify such objects, we propose to compute the distance between different eigen decomposition results of the same object with respect to different sources as its anomalous score. We also give interpretations from the perspectives of constrained spectral clustering and random walks over graph. Experimental results on several UCI as well as DBLP and Movie Lens datasets demonstrate the effectiveness of the proposed approach.

【Keywords】: object detection; DBLP; UCI; clustering structure; eigen decomposition; movielens datasets; multi source object relationships; Clustering algorithms; History; Joints; Laplace equations; Motion pictures; Object recognition; Vectors; anomaly detection; multiple information sources; spectral methods

111. Tracking and Connecting Topics via Incremental Hierarchical Dirichlet Processes.

【Paper Link】【Pages】:1056-1061

【Authors】: Zekai Gao ; Yangqiu Song ; Shixia Liu ; Haixun Wang ; Hao Wei ; Yang Chen ; Weiwei Cui

【Abstract】: Much research has been devoted to topic detection from text, but one major challenge has not been addressed: revealing the rich relationships that exist among the detected topics. Finding such relationships is important since many applications are interested in how topics come into being, how they develop, grow, disintegrate, and finally disappear. In this paper, we present a novel method that reveals the connections between topics discovered from the text data. Specifically, our method focuses on how one topic splits into multiple topics, and how multiple topics merge into one topic. We adopt the hierarchical Dirichlet process (HDP) model, and propose an incremental Gibbs sampling algorithm to incrementally derive and refine the labels of clusters. We then characterize the splitting and merging patterns among clusters based on how labels change. We propose a global analysis process that focuses on cluster splitting and merging, and a finer granularity analysis process that helps users to better understand the content of the clusters and the evolution patterns. We also develop a visualization process to present the results.

【Keywords】: data visualisation; merging; pattern clustering; sampling methods; text analysis; evolution pattern clustering; global analysis process; granularity analysis process; incremental Gibbs sampling algorithm; incremental hierarchical Dirichlet process; merging pattern; splitting pattern; text data; topic detection; Business; Clustering algorithms; Data models; Merging; Predictive models; Semantics; Time measurement; Clustering; Hierarchical Dirichlet processes; Incremental Gibbs Sampling; Mixture models

112. Learning Protein Folding Energy Functions.

【Paper Link】【Pages】:1062-1067

【Authors】: Wei Guan ; Arkadas Ozakin ; Alexander G. Gray ; Jose Borreguero ; Shashi Bhushan Pandit ; Anna Jagielska ; Liliana Wroblewska ; Jeffrey Skolnick

【Abstract】: A critical open problem in emph{ab initio} protein folding is protein energy function design, which pertains to defining the energy of protein conformations in a way that makes folding most efficient and reliable. In this paper, we address this issue as a weight optimization problem and utilize a machine learning approach, learning-to-rank, to solve this problem. We investigate the ranking-via-classification approach, especially the Ranking SVM method and compare it with the state-of-the-art approach to the problem using the MINUIT optimization package. To maintain the physicality of the results, we impose non-negativity constraints on the weights. For this we develop two efficient non-negative support vector machine (NNSVM) methods, derived from L2-norm SVM and L1-norm SVMs, respectively. We demonstrate an energy function which maintains the correct ordering with respect to structure dissimilarity to the native state more often, is more efficient and reliable for learning on large protein sets, and is qualitatively superior to the current state-of-the-art energy function.

【Keywords】: biology computing; optimisation; support vector machines; MINUIT optimization package; NNSVM; learning protein folding energy functions; machine learning; nonnegative support vector machine; optimization problem; protein conformations; ranking SVM method; Correlation; Optimization; Proteins; Reliability; Support vector machines; Vectors; emph{ab initio} protein folding; energy function; learning-to-rank; non-negativity constrained SVM optimization; support vector machine

113. How Does Research Evolve? Pattern Mining for Research Meme Cycles.

【Paper Link】【Pages】:1068-1073

【Authors】: Dan He ; Xingquan Zhu ; Douglas Stott Parker Jr.

【Abstract】: Recent years have witnessed a great deal of attention in tracking news memes over the web, modeling shifts in the ebb and flow of their popularity. One of the most important features of news memes is that they seldom occur repeatedly, instead, they tend to shift to different but similar memes. In this work, we consider patterns in research memes, which differ significantly from news memes and have received very little attention. One significant difference between research memes and news memes lies in that research memes have cyclic development, motivating the need for models of cycles of research memes. Furthermore, these cycles may reveal important patterns of evolving research, shedding lights on how research progresses. In this paper, we formulate the modeling of the cycles of research memes, and propose solutions to the problem of identifying cycles and discovering patterns among these cycles. Experiments on two different domain applications indicate that our model does find meaningful patterns and our algorithms for pattern discovery are efficient for large scale data analysis.

【Keywords】: Internet; data analysis; data mining; research and development; Web; cyclic development; large scale data analysis; news memes; pattern discovery; pattern mining; research meme cycles; Complexity theory; Computational modeling; Computer science; Data mining; Heuristic algorithms; Ontologies; Social network services; MeSH hierarchy; Research memes; frequent patterns; shortest paths; topic evolution; topic mining

114. Twin Gaussian Processes for Binary Classification.

【Paper Link】【Pages】:1074-1079

【Authors】: Jianjun He ; Hong Gu ; Shaorui Jiang

【Abstract】: Gaussian process classifiers (GPCs) have recently attracted more and more attention from the machine learning community. However, because the posterior needs to be approximated by using a tractable Gaussian distribution, they usually suffer from high computational cost which is prohibitive for practical applications. In this paper, we present a new Gaussian process model termed as twin Gaussian processes for binary classification. The basic idea is to make predictions based on two latent functions with Gaussian process prior, each of which is close to one of the two classes and is as far as possible from the other. Being compared with the published GPCs, the proposed algorithm allows for an explicit inference based on analytical methods, thereby avoiding the high computational cost caused by approximating the posterior with Gaussian distribution. Experimental results on several benchmark data sets show that the proposed algorithm is valid and can achieve superior performance to the published algorithms.

【Keywords】: Gaussian distribution; Gaussian processes; learning (artificial intelligence); pattern classification; Gaussian process classifier; binary classification; latent function; machine learning community; tractable Gaussian distribution; twin Gaussian process; Approximation algorithms; Approximation methods; Computational efficiency; Computational modeling; Covariance matrix; Gaussian processes; Training; Bayesian methods; Binary classification; Kernel machine; Twin Gaussian processes

115. Constraint Selection-Based Semi-supervised Feature Selection.

【Paper Link】【Pages】:1080-1085

【Authors】: Mohammed Hindawi ; Kais Allab ; Khalid Benabdeslem

【Abstract】: In this paper, we present a novel feature selection approach based on an efficient selection of pair wise constraints. This aims at selecting the most coherent constraints extracted from labeled part of data. The relevance of features is then evaluated according to their efficient locality preserving and chosen constraint preserving ability. Finally, experimental results are provided for validating our proposal with respect to other known feature selection methods.

【Keywords】: constraint handling; data handling; constraint preservation; constraint selection based semisupervised feature selection; locality preservation; pairwise constraints; Accuracy; Clustering algorithms; Coherence; Data mining; Feature extraction; Laplace equations; Vectors; Dimensionality reduction; constraint selection; feature selection

116. Discovering the Intrinsic Cardinality and Dimensionality of Time Series Using MDL.

【Paper Link】【Pages】:1086-1091

【Authors】: Bing Hu ; Thanawin Rakthanmanon ; Yuan Hao ; Scott Evans ; Stefano Lonardi ; Eamonn J. Keogh

【Abstract】: Most algorithms for mining or indexing time series data do not operate directly on the original data, but instead they consider alternative representations that include transforms, quantization, approximation, and multi-resolution abstractions. Choosing the best representation and abstraction level for a given task/dataset is arguably the most critical step in time series data mining. In this paper, we investigate techniques to discover the natural intrinsic representation model, dimensionality and alphabet cardinality of a time series. The ability to discover these intrinsic features has implications beyond selecting the best parameters for particular algorithms, as characterizing data in such a manner is useful in its own right and an important sub-routine in algorithms for classification, clustering and outlier discovery. We will frame the discovery of these intrinsic features in the Minimal Description Length (MDL) framework. Extensive empirical tests show that our method is simpler, more general and significantly more accurate than previous methods, and has the important advantage of being essentially parameter-free.

【Keywords】: data mining; time series; MDL; alphabet cardinality; intrinsic cardinality discovering; intrinsic representation model; minimal description length; time series data mining; Approximation methods; Data mining; Data models; Discrete Fourier transforms; Muscles; Piecewise linear approximation; Time series analysis; Dimensionality Reduction; MDL; Time Series

117. Discovery of Versatile Temporal Subspace Patterns in 3-D Datasets.

【Paper Link】【Pages】:1092-1097

【Authors】: Zhen Hu ; Raj Bhatnagar

【Abstract】: Most existing methods for clustering temporal data are based on either a strict similarity metric or a precisely defined temporal profile such as a sine, exponential wave etc. Also, these methods compute similarity metric across the entire time-span of the objects. However these types of temporal patterns are more useful in many biological analysis, where it is important to observe gene expression pattens across arbitrary subintervals. These types of temporal patterns are very useful in bioinformatics, where it is important to observe gene expression pattens across arbitrary subintervals. In this paper we present an algorithm for searching for multiple contiguous temporal subintervals within which the selected objects demonstrate existence of clear patterns. We demonstrate the power and advantages of our algorithm by using a synthetic dataset and a pharmacokinetics dataset for which other researchers have recently published their results. We compare and contrast our results with these results to show superiority of our approach.

【Keywords】: bioinformatics; pattern clustering; 3D datasets; bioinformatics; biological analysis; gene expression pattens; pharmacokinetics dataset; synthetic dataset; temporal data clustering; versatile temporal subspace patterns; Algorithm design and analysis; Bioinformatics; Coherence; Context; Correlation; Gene expression; Search problems; Data Mining; Temporal patterns; Tricluster

118. A Fixed Parameter Tractable Integer Program for Finding the Maximum Order Preserving Submatrix.

【Paper Link】【Pages】:1098-1103

【Authors】: Jens Humrich ; Thomas Gärtner ; Gemma C. Garriga

【Abstract】: Order-preserving sub matrices are an important tool for the analysis of gene expression data. As finding large order-preserving sub matrices is a computationally hard problem, previous work has investigated both exact but exponential-time as well as polynomial-time but inexact algorithms for finding large order-preserving sub matrices. In this paper, we propose a novel exact algorithm to find maximum order preserving sub matrices which is fixed parameter tractable with respect to the number of columns of the provided gene expression data. In particular, our algorithm is based on solving a sequence of mixed integer linear programs and it exhibits better guarantees as well as better runtime performance as compared to the state-of-the-art exact algorithms. Our empirical study in benchmark datasets shows large improvement in terms of computational speed.

【Keywords】: biology; integer programming; linear programming; matrix algebra; benchmark datasets; exponential time; fixed parameter tractable integer program; gene expression data; integer linear programs; maximum order preserving submatrix; polynomial time; Gene expression; Linear programming; Noise; Optimization; Programming; Radio access networks; Upper bound

119. ASAP: A Self-Adaptive Prediction System for Instant Cloud Resource Demand Provisioning.

【Paper Link】【Pages】:1104-1109

【Authors】: Yexi Jiang ; Chang-Shing Perng ; Tao Li ; Rong N. Chang

【Abstract】: The promise of cloud computing is to provide computing resources instantly whenever they are needed. The state-of-art virtual machine (VM) provisioning technology can provision a VM in tens of minutes. This latency is unacceptable for jobs that need to scale out during computation. To truly enable on-the-fly scaling, new VM needs to be ready in seconds upon request. In this paper, We present an online temporal data mining system called ASAP, to model and predict the cloud VM demands. ASAP aims to extract high level characteristics from VM provisioning request stream and notify the provisioning system to prepare VMs in advance. For quantification issue, we propose Cloud Prediction Cost to encodes the cost and constraints of the cloud and guide the training of prediction algorithms. Moreover, we utilize a two-level ensemble method to capture the characteristics of the high transient demands time series. Experimental results using historical data from an IBM cloud in operation demonstrate that ASAP significantly improves the cloud service quality and provides possibility for on-the-fly provisioning.

【Keywords】: cloud computing; data mining; software quality; virtual machines; ASAP; cloud computing; cloud prediction cost; cloud service quality; instant cloud resource demand provisioning; online temporal data mining system; self-adaptive prediction system; two-level ensemble method; virtual machine provisioning technology; Artificial neural networks; Data models; Heuristic algorithms; Prediction algorithms; Predictive models; Time series analysis; Training; cloud service; service quality improvement; time series prediction

120. Learning from Negative Examples in Set-Expansion.

【Paper Link】【Pages】:1110-1115

【Authors】: Prateek Jindal ; Dan Roth

【Abstract】: This paper addresses the task of set-expansion on free text. Set-expansion has been viewed as a problem of generating an extensive list of instances of a concept of interest, given a few examples of the concept as input. Our key contribution is that we show that the concept definition can be significantly improved by specifying some negative examples in the input, along with the positive examples. The state-of-the art centroid-based approach to set-expansion doesn't readily admit the negative examples. We develop an inference-based approach to set-expansion which naturally allows for negative examples and show that it performs significantly better than a strong baseline.

【Keywords】: learning (artificial intelligence); set theory; text analysis; free text; learning; negative example learning; set expansion; state-of-the art centroid; Equations; IP networks; Mathematical model; Semantics; USA Councils; Vectors; Vocabulary; Information Extraction; Negative Examples; Set-Expansion

【Paper Link】【Pages】:1116-1121

【Authors】: Edward Casey Kenley ; Young-Rae Cho

【Abstract】: Complex systems have been widely studied to characterize their structural behaviors from a topological perspective. High modularity is one of the recurrent features of real-world complex systems. Various graph clustering algorithms have been applied to identifying communities in social networks or modules in biological networks. However, their applicability to real-world systems has been limited because of the massive scale and complex connectivity of the networks. In this study, we exploit a novel information-theoretic model for graph clustering. The entropy-based clustering approach finds locally optimal clusters by growing a random seed in a manner that minimizes graph entropy. We design and analyze modifications that further improve its performance. Assigning priority in seed-selection and seed-growth is well applicable to the scale-free networks characterized by the hub-oriented structure. Computing seed-growth in parallel streams also decomposes an extremely large network efficiently. The experimental results with real biological and social networks show that the entropy-based approach has better performance than competing methods in terms of accuracy and efficiency.

【Keywords】: complex networks; graph theory; large-scale systems; network theory (graphs); pattern clustering; social networking (online); biological networks; entropy based graph clustering; hub oriented structure; information theoretic model; real world complex systems; scale free networks; seed growth; seed selection; social networks; Accuracy; Clustering algorithms; Entropy; MySpace; Proteins; biological networks; complex systems; graph clustering; graph entropy; graph mining; social networks

122. Semi-supervised Discriminant Hashing.

【Paper Link】【Pages】:1122-1127

【Authors】: Saehoon Kim ; Seungjin Choi

【Abstract】: Hashing refers to methods for embedding high dimensional data into a similarity-preserving low-dimensional Hamming space such that similar objects are indexed by binary codes whose Hamming distances are small. Learning hash functions from data has recently been recognized as a promising approach to approximate nearest neighbor search for high dimensional data. Most of 'learning to hash' methods resort to either unsupervised or supervised learning to determine hash functions. Recently semi-supervised learning approach was introduced in hashing where pair wise constraints (must link and cannot-link) using labeled data are leveraged while unlabeled data are used for regularization to avoid over-fitting. In this paper we base our semi-supervised hashing on linear discriminant analysis, where hash functions are learned such that labeled data are used to maximize the separability between binary codes associated with different classes while unlabeled data are used for regularization as well as for balancing condition and pair wise decor relation of bits. The resulting method is referred to as semi-supervised discriminant hashing (SSDH). Numerical experiments on MNIST and CIFAR-10 datasets demonstrate that our method outperforms existing methods, especially in the case of short binary codes.

【Keywords】: approximation theory; binary codes; cryptography; information retrieval; learning (artificial intelligence); statistical analysis; CIFAR-10 dataset; Hamming distances; MNIST dataset; binary codes; hash function learning; high dimensional data; labeled data; linear discriminant analysis; nearest neighbor search approximation; semi-supervised discriminant hashing; similarity search; similarity-preserving low-dimensional Hamming space; unlabeled data; Binary codes; Decorrelation; Eigenvalues and eigenfunctions; Hamming distance; Training; Training data; Vectors; Hashing; regularized discriminant analysis; semisupervised

123. Using Frequent Closed Itemsets for Data Dimensionality Reduction.

【Paper Link】【Pages】:1128-1133

【Authors】: Petr Krajca ; Jan Outrata ; Vilém Vychodil

【Abstract】: We address important issues of dimensionality reduction of transactional data sets where the input data consists of lists of transactions, each of them being a finite set of items. The reduction consists in finding a small set of new items, so-called factor-items, which is considerably smaller than the original set of items while comprising full or nearly full information about the original items. Using this type of reduction, the original data set can be represented by a smaller transactional data set using factor-items instead of the original items, thus reducing its dimensionality. The procedure utilized in this paper is based on approximate Boolean matrix decomposition. In this paper, we focus on the role of frequent closed item sets that can be used to determine factor-items. We present the factorization problem, its reduction to Boolean matrix decompositions, experiments with publicly available data sets, and an algorithm for computing decompositions.

【Keywords】: Boolean functions; matrix decomposition; boolean matrix decomposition; data dimensionality reduction; factorization problem; frequent closed itemsets; transactional data sets; Approximation algorithms; Approximation methods; Arrays; Data mining; Itemsets; Matrix decomposition; Boolean matrices; dimensionality reduction; frequent closed itemsets; set covering

124. Modeling High-Level Behavior Patterns for Precise Similarity Analysis of Software.

【Paper Link】【Pages】:1134-1139

【Authors】: Taeho Kwon ; Zhendong Su

【Abstract】: The analysis of software similarity has many applications such as detecting code clones, software plagiarism, code theft, and polymorphic malware. Because often source code is unavailable and code obfuscation is used to avoid detection, there has been much research on developing effective models to capture runtime behavior to aid detection. Existing models focus on low-level information such as dependency or purely occurrence of function calls, and suffer from poor precision, poor scalability, or both. To overcome limitations of existing models, this paper introduces a precise and succinct behavior representation that characterizes high-level object-accessing patterns as regular expressions. We first distill a set of high-level patterns (the alphabet S of the regular language) based on two pieces of information: function call patterns to access objects and type state information of the objects. Then we abstract a runtime trace of a program P into a regular expression e over the pattern alphabet S to produce P's behavior signature. We show that software instances derived from the same code exhibit similar behavior signatures and develop effective algorithms to cluster and match behavior signatures. To evaluate the effectiveness of our behavior model, we have applied it to the similarity analysis of polymorphic malware. Our results on a large malware collection demonstrate that our model is both precise and succinct for effective and scalable matching and detection of polymorphic malware.

【Keywords】: industrial property; invasive software; pattern matching; program diagnostics; behavior signature; code clone detection; code obfuscation; function call pattern; high-level object-accessing pattern modeling; polymorphic malware; poor precision; poor scalability; precise behavior representation; program runtime tracing; runtime behavior; software plagiarism; software precise similarity analysis; source code theft; succinct behavior representation; Algorithm design and analysis; Analytical models; Clustering algorithms; Malware; Measurement; Software; Software algorithms; malware analysis and clustering; sequence clustering; software behavior model

125. Co-clustering for Binary and Categorical Data with Maximum Modularity.

【Paper Link】【Pages】:1140-1145

【Authors】: Lazhar Labiod ; Mohamed Nadif

【Abstract】: To tackle the co-clustering problem for binary and categorical data, we propose a generalized modularity measure and a spectral approximation of the modularity matrix. A spectral algorithm maximizing the modularity measure is then presented. Experimental results are performed on a variety of simulated and real-world data sets confirming the interest of the use of the modularity in co-clustering and assessing the number of clusters contexts.

【Keywords】: matrix algebra; pattern clustering; binary data; categorical data; coclustering; maximum modularity; modularity matrix; spectral approximation; Accuracy; Approximation methods; Clustering algorithms; Clustering methods; Educational institutions; Eigenvalues and eigenfunctions; Partitioning algorithms; co-clustering; modularity; spectral decomposition

126. Calculating Feature Weights in Naive Bayes with Kullback-Leibler Measure.

【Paper Link】【Pages】:1146-1151

【Authors】: Chang-Hwan Lee ; Fernando Gutierrez ; Dejing Dou

【Abstract】: Naive Bayesian learning has been popular in data mining applications. However, the performance of naive Bayesian learning is sometimes poor due to the unrealistic assumption that all features are equally important and independent given the class value. Therefore, it is widely known that the performance of naive Bayesian learning can be improved by mitigating this assumption, and many enhancements to the basic naive Bayesian learning have been proposed to resolve this problem including feature selection and feature weighting. In this paper, we propose a new method for calculating the weights of features in naive Bayesian learning using Kullback-Leibler measure. Empirical results are presented comparing this new feature weighting method with some other methods for a number of datasets.

【Keywords】: Bayes methods; data mining; learning (artificial intelligence); Kullback-Leibler measurement; Naive Bayes; feature selection; feature weighting; feature weights calculation; naive Bayesian learning; Accuracy; Bayesian methods; Decision trees; Equations; Mathematical model; Training data; Weight measurement; Classification; Feature Weighting; Naive Bayes

127. Scalable Diversified Ranking on Large Graphs.

【Paper Link】【Pages】:1152-1157

【Authors】: Rong-Hua Li ; Jeffrey Xu Yu

【Abstract】: Enhancing diversity in ranking on graphs has been identified as an important retrieval and mining task. Nevertheless, many existing diversified ranking algorithms cannot be scalable to large graphs as they have high time or space complexity. In this paper, we propose a scalable algorithm to find the top-K diversified ranking list on graphs. The key idea of our algorithm is that we first compute the Pagerank of the nodes of the graph, and then perform a carefully designed vertex selection algorithm to find the top-K diversified ranking list. Specifically, we firstly present a new diversified ranking measure, which can capture both relevance and diversity. Secondly, we prove the submodularity of the proposed measure. And then we propose an efficient greedy algorithm with linear time and space complexity with respect to the size of the graph to achieve near-optimal diversified ranking. Finally, we evaluate the proposed method through extensive experiments on four real networks. The experimental results indicate that the proposed method outperforms existing diversified ranking algorithms both on improving diversity in ranking and the efficiency of the algorithms.

【Keywords】: computational complexity; data mining; graph theory; information retrieval; social networking (online); Pagerank; diversified ranking algorithms; diversity enhancement; large graphs; linear space complexity; linear time complexity; mining task; retrieval task; scalable diversified ranking; social network analysis; top-K diversified ranking list; vertex selection algorithm; Algorithm design and analysis; Collaboration; Complexity theory; Diversity reception; Greedy algorithms; Measurement; Social network services

128. Web Horror Image Recognition Based on Context-Aware Multi-instance Learning.

【Paper Link】【Pages】:1158-1163

【Authors】: Bing Li ; Weihua Xiong ; Weiming Hu

【Abstract】: Along with the ever-growing Web, horror contents sharing in the Internet has interfered with our daily life and affected our, especially children's, health. Therefore horror image recognition is becoming more important for web objectionable content filtering. This paper presents a novel context-aware multi-instance learning (CMIL) model for this task. This work is distinguished by three key contributions. Firstly, the traditional multi-instance learning is extended to context-aware multi-instance learning model through integrating an undirected graph in each bag that represents contextual relationships among instances. Secondly, by introducing a novel energy function, a heuristic optimization algorithm based on Fuzzy Support Vector Machine (FSVM) is given out to find the optimal classifier on CMIL. Finally, the CMIL is applied to recognize horror images. Experimental results on an image set collected from the Internet show that the proposed method is effective on horror image recognition.

【Keywords】: Internet; computer aided instruction; fuzzy set theory; graph theory; image recognition; support vector machines; ubiquitous computing; CMIL; FSVM; Internet; Web horror image recognition; Web objectionable content filtering; context aware multiinstance learning; fuzzy support vector machine; heuristic optimization algorithm; undirected graph; Context modeling; Equations; Image color analysis; Image recognition; Mathematical model; Optimization; Support vector machines; Context-aware Multi-Instance Learning; Horror Image Recognition; Image Emotion

129. Mixture of Softmax sLDA.

【Paper Link】【Pages】:1164-1169

【Authors】: Xiaoxu Li ; Junyu Zeng ; Xiaojie Wang ; Yixin Zhong

【Abstract】: In this paper, we propose a new variant of supervised Latent Dirichlet Allocation(sLDA): mixture of soft max sLDA, for image classification. Ensemble classification methods can combine multiple weak classifiers to construct a strong classifier. Inspired by the ensemble idea, we try to improve sLDA model using the idea. The mixture of soft max model is a probabilistic ensemble classification model, it can fit the training data and class label well. We embed the mixture of soft max model into LDA model under the framwork of sLDA, and construct an ensemble supervised topic model for image classification. Meanwhile, we derive an elegant parameters estimation algorithm based on variational EM method, and give a simple and efficient approximation method for classifying a new image. Finally, we demonstrate the effectiveness of our model by comparing with some existing approaches on two real world datasets. The results show that our model enhances classification accuracy by 7% on the 1600-image Label Me dataset and 9% on the 1791-image UIUC-Sport dataset.

【Keywords】: feature extraction; image classification; parameter estimation; probability; 1600-image LabelMe dataset; 179-image UlUC-Sport dataset; image classification; parameter estimation algorithm; probabilistic ensemble classification model; softmax model; softmax sLDA; supervised latent Dirichlet allocation; supervised topic model; Accuracy; Computational modeling; Data models; Feature extraction; Parameter estimation; Predictive models; Support vector machines; cation; ensemble classiﬁ probabilistic modeling; supervised topic model; the mixture of softmax model

130. Optimizing Performance Measures for Feature Selection.

【Paper Link】【Pages】:1170-1175

【Authors】: Qi Mao ; Ivor Wai-Hung Tsang

【Abstract】: Feature selection with specific multivariate performance measures is the key to the success of many applications, such as information retrieval and bioinformatics. The existing feature selection methods are usually designed for classification error. In this paper, we present a unified feature selection framework for general loss functions. In particular, we study the novel feature selection paradigm by optimizing multivariate performance measures. The resultant formulation is a challenging problem for high-dimensional data. Hence, a two-layer cutting plane algorithm is proposed to solve this problem, and the convergence is presented. Extensive experiments on large-scale and high-dimensional real world datasets show that the proposed method outperforms l₁-SVM and SVM-RFE when choosing a small subset of features, and achieves significantly improved performances over SVM^perf in terms of F₁-score.

【Keywords】: optimisation; bioinformatics; classification error; feature selection; information retrieval; multivariate performance measure optimisation; multivariate performance measures; Accuracy; Approximation algorithms; Atmospheric measurements; Loss measurement; Particle measurements; Testing; Vectors; feature selection; multiple kernel learning; multivariate performance measure; structural SVMs

131. Detecting Recurring and Novel Classes in Concept-Drifting Data Streams.

【Paper Link】【Pages】:1176-1181

【Authors】: Mohammad M. Masud ; Tahseen Al-Khateeb ; Latifur Khan ; Charu C. Aggarwal ; Jing Gao ; Jiawei Han ; Bhavani M. Thuraisingham

【Abstract】: Concept-evolution is one of the major challenges in data stream classification, which occurs when a new class evolves in the stream. This problem remains unaddressed by most state-of-the-art techniques. A recurring class is a special case of concept-evolution. This special case takes place when a class appears in the stream, then disappears for a long time, and again appears. Existing data stream classification techniques that address the concept-evolution problem, wrongly detect the recurring classes as novel class. This creates two main problems. First, much resource is wasted in detecting a recurring class as novel class, because novel class detection is much more computationally- and memory-intensive, as compared to simply recognizing an existing class. Second, when a novel class is identified, human experts are involved in collecting and labeling the instances of that class for future modeling. If a recurrent class is reported as novel class, it will be only a waste of human effort to find out whether it is really a novel class. In this paper, we address the recurring issue, and propose a more realistic novel class detection technique, which remembers a class and identifies it as "not novel" when it reappears after a long disappearance. Our approach has shown significant reduction in classification error over state-of-the-art stream classification techniques on several benchmark data streams.

【Keywords】: data handling; pattern classification; concept-drifting data streams; concept-evolution; data stream classification techniques; recurring detection; Analytical models; Copper; Data models; Error analysis; Humans; Training; Training data; novel class; recurring class; stream classification

132. Performances and Characteristics of DIGRank, Ranking in the Incomplete Networks.

【Paper Link】【Pages】:1182-1187

【Authors】: Xiang Niu ; Lusong Li ; Xiaobing Xiong ; Daniel S. Tkach ; He Li ; Ke Xu

【Abstract】: Page Rank has been widely used in ranking retrieval results on the web, finding the top influential papers in citation networks or detecting valuable users in online social networks. However, in practice, it is usually hard to obtain a complete structure of any above networks to rank nodes. Thus, some researchers have begun to explore how to get estimated ranks efficiently without acquiring the whole network. They have proposed some approximating methods, however, it is difficult to determine which method is the best one or which is suitable to a certain application. In this case, we set experiments in small-world and scale-free generated networks to certify the feasibility and characteristics of four approximating methods. We also use eleven real networks to mention different optimal conditions for these methods. We find the DIG Rank method performs better than other local estimation methods in almost every given sub graph. Besides, Mean field approach method tends to perform well in networks that have low average shortest path length, small amount of nodes with the same low in degree, or weak community structure. Finally, we apply the most versatile method DIG Rank to Sina micro-blog website to precisely classify users in a group as elites, grassroots or mummy users.

【Keywords】: Internet; approximation theory; information retrieval; social networking (online); DIGRank characteristics; approximating methods; incomplete networks; online social networks; ranking retrieval; Accuracy; Correlation; Electronic mail; Error analysis; Fans; Neodymium; Social network services; Online Social Network; PageRank; Scale-free; Small-world

133. Cross-Temporal Link Prediction.

【Paper Link】【Pages】:1188-1193

【Authors】: Satoshi Oyama ; Kohei Hayashi ; Hisashi Kashima

【Abstract】: The increasing interest in dynamically changing networks has led to growing interest in a more general link prediction problem called temporal link prediction in the data mining and machine learning communities. However, only links in identical time frames are considered in temporal link prediction. We propose a new link prediction problem called cross-temporal link prediction in which the links among nodes in different time frames are inferred. A typical example of cross-temporal link prediction is cross-temporal entity resolution to determine the identity of real entities represented by data objects observed in different time periods. In dynamic environments, the features of data change over time, making it difficult to identify cross-temporal links by directly comparing observed data. Other examples of cross-temporal links are asynchronous communications in social networks such as Face book and Twitter, where a message is posted in reply to a previous message. We adopt a dimension reduction approach to cross-temporal link prediction, that is, data objects in different time frames are mapped into a common low-dimensional latent feature space, and the links are identified on the basis of the distance between the data objects. The proposed method uses different low-dimensional feature projections in different time frames, enabling it to adapt to changes in the latent features over time. Using multi-task learning, it jointly learns a set of feature projection matrices from the training data, given the assumption of temporal smoothness of the projections. The optimal solutions are obtained by solving a single generalized eigenvalue problem. Experiments using a real-world set of bibliographic data for cross-temporal entity resolution showed that introducing time-dependent feature projections improves the accuracy of link prediction.

【Keywords】: data mining; feature extraction; learning (artificial intelligence); matrix algebra; social networking (online); asynchronous communications; cross-temporal entity resolution; cross-temporal link prediction; data mining; dimension reduction approach; feature projection matrices; generalized eigenvalue problem; low-dimensional feature projections; low-dimensional latent feature space; machine learning; social networks; time-dependent feature projections; Accuracy; Databases; Eigenvalues and eigenfunctions; Optimization; Social network services; Training; Training data; dimension reduction; entity resolution; link prediction; social network analysis; temporal data

134. Helix: Unsupervised Grammar Induction for Structured Activity Recognition.

【Paper Link】【Pages】:1194-1199

【Authors】: Huan-Kai Peng ; Pang Wu ; Jiang Zhu ; Joy Ying Zhang

【Abstract】: The omnipresence of mobile sensors has brought tremendous opportunities to ubiquitous computing systems. In many natural settings, however, their broader applications are hindered by three main challenges: rarity of labels, uncertainty of activity granularities, and the difficulty of multi-dimensional sensor fusion. In this paper, we propose building a grammar to address all these challenges using a language-based approach. The proposed algorithm, called Helix, first generates an initial vocabulary using unlabeled sensor readings, followed by iteratively combining statistically collocated sub-activities across sensor dimensions and grouping similar activities together to discover higher level activities. The experiments using a 20-minute ping-pong game demonstrate favorable results compared to a Hierarchical Hidden Markov Model (HHMM) baseline. Closer investigations to the learned grammar also shows that the learned grammar captures the natural structure of the underlying activities.

【Keywords】: hidden Markov models; mobile computing; natural language processing; sensor fusion; HHMM; Helix; Hidden Markov Model; activity granularities; mobile sensors; multidimensional sensor fusion; structured activity recognition; ubiquitous computing; unlabeled sensor readings; unsupervised grammar induction; Context; Grammar; Joints; Mutual information; Semantics; Sensors; Vocabulary; Heterogeneous Sensor Fusion; Ubiquitous Knowledge Discovery; Unsupervised Grammar Induction

135. Distance Preserving Graph Simplification.

【Paper Link】【Pages】:1200-1205

【Authors】: Ning Ruan ; Ruoming Jin ; Yan Huang

【Abstract】: Large graphs are difficult to represent, visualize, and understand. In this paper, we introduce "gate graph" a new approach to perform graph simplification. A gate graph provides a simplified topological view of the original graph. Specifically, we construct a gate graph from a large graph so that for any "non-local" vertex pair (distance greater than some threshold) in the original graph, their shortest-path distance can be recovered by consecutive "local" walks through the gate vertices in the gate graph. We perform a theoretical investigation on the gate-vertex set discovery problem. We characterize its computational complexity and reveal the upper bound of minimum gate- vertex set using VC-dimension theory. We propose an efficient mining algorithm to discover a gate-vertex set with guaranteed logarithmic bound. The detailed experimental results using both real and synthetic graphs demonstrate the effectiveness and efficiency of our approach.

【Keywords】: computational complexity; graph theory; set theory; VC-dimension theory; computational complexity; distance preserving graph simplification; gate graph; gate-vertex set discovery problem; local walks; nonlocal vertex pair; shortest-path distance; Approximation algorithms; Complexity theory; Educational institutions; Greedy algorithms; Logic gates; Road transportation; Sampling methods; Gate Graph; Gate Vertices; Graph Simplification; Set Cover Problem; VC-Dimension

136. Clustering with Attribute-Level Constraints.

【Paper Link】【Pages】:1206-1211

【Authors】: Jana Schmidt ; Elisabeth Maria Brändle ; Stefan Kramer

【Abstract】: In many clustering applications the incorporation of background knowledge in the form of constraints is desirable. In this paper, we introduce a new constraint type and the corresponding clustering problem: attribute constrained clustering. The goal is to induce clusters of binary instances that satisfy constraints on the attribute level. These constraints specify whether instances may or may not be grouped to a cluster, depending on specific attribute values. We show how the well-established instance-level constraints, must-link and cannot-link, can be adapted to the attribute level. A variant of the k-Medoids algorithm taking into account attribute level constraints is evaluated on synthetic and real-world data. Experimental results show that such constraints may provide better clustering results at lower specification costs if constraints can be expressed on the attribute level.

【Keywords】: constraint handling; constraint satisfaction problems; pattern clustering; set theory; attribute-level constraint clustering; background knowledge; constraint satisfaction; instance-level constraints; k-Medoids algorithm; Animals; Clustering algorithms; Data mining; Electronic mail; Equations; Runtime; attribute level; constrained clustering

137. A Fast and Flexible Clustering Algorithm Using Binary Discretization.

【Paper Link】【Pages】:1212-1217

【Authors】: Mahito Sugiyama ; Akihiro Yamamoto

【Abstract】: We present in this paper a new clustering algorithm for multivariate data. This algorithm, called BOOL (Binary coding Oriented clustering), can detect arbitrarily shaped clusters and is noise tolerant. BOOL handles data using a two-step procedure: data points are first discretized and represented as binary words, clusters are then iteratively constructed by agglomerating smaller clusters using this representation. This latter step is carried out with linear complexity by sorting such binary representations, which results in dramatic speedups when compared with other techniques. Experiments show that BOOL is faster than K-means, and about two to three orders of magnitude faster than two state-of-the-art algorithms that can detect non-convex clusters of arbitrary shapes. We also show that BOOL's results are robust to changes in parameters, whereas most algorithms for arbitrarily shaped clusters are known to be overly sensitive to such changes. The key to the robustness of BOOL is the hierarchical structure of clusters that is introduced automatically by increasing the accuracy of the discretization.

【Keywords】: computational complexity; data handling; data mining; learning (artificial intelligence); pattern clustering; BOOL; binary coding oriented clustering; binary discretization; binary representations; binary words; clustering algorithm; data mining; data points; knowledge discovery; linear complexity; machine learning; multivariate data; nonconvex cluster detection; Conferences; Data mining; Indexes; Noise; Binary encoding; Discretization; Hierarchical clustering; Shape-based clustering; Sorting

138. A New Multi-task Learning Method for Personalized Activity Recognition.

【Paper Link】【Pages】:1218-1223

【Authors】: Xu Sun ; Hisashi Kashima ; Ryota Tomioka ; Naonori Ueda ; Ping Li

【Abstract】: Personalized activity recognition usually faces the problem of data sparseness. We aim at improving accuracy of personalized activity recognition by incorporating the information from other persons. We propose a new online multi-task learning method for personalized activity recognition. The proposed online multi-task learning method automatically learns the ``transfer-factors" (similarities) among different tasks (i.e., among different persons in our case). Experiments demonstrate that the proposed method significantly outperforms existing methods. The novelty of this paper is twofold: (1) A new multi-task learning framework, which can naturally learn similarities among tasks, (2) To our knowledge, this is the first study of large-scale personalized activity recognition.

【Keywords】: gesture recognition; learning (artificial intelligence); data sparseness; large scale personalized activity recognition; online multitask learning method; transfer factor; Acceleration; Accuracy; Convergence; Kernel; Polynomials; Training; Vectors

【Paper Link】【Pages】:1224-1229

【Authors】: Chih-Hua Tai ; Peng-Jui Tseng ; Philip S. Yu ; Ming-Syan Chen

【Abstract】: Privacy in social network data publishing is always an important concern. Nowadays most prior privacy protection techniques focus on static social networks. However, there are additional privacy disclosures in dynamic social networks due to the sequential publications. In this paper, we first show that the risks of vertex and community re-identification exist in a dynamic social network, even if the release at each time instance is protected by a static anonymity scheme. To prevent vertex and community re-identification in a dynamic social network, we propose novel dynamic k^w-structural diversity anonymity, where w is the time that an adversary can monitor a victim. This scheme extends the k-structural diversity anonymity to a dynamic scenario. We also present a heuristic to anonymize the releases of networks to satisfy the proposed privacy scheme. The evaluations show that our approach can retain much of the characteristics of the networks while confirming the privacy protection.

【Keywords】: data privacy; social networking (online); community reidentification; dynamic k^w-structural diversity anonymity; dynamic social networks; identities anonymization; privacy protection techniques; sequential publications; social network data publishing; static anonymity scheme; static social networks; Communities; Cultural differences; Diseases; Heuristic algorithms; Privacy; Publishing; Social network services; anonymization; dynamic; privacy; social network

【Paper Link】【Pages】:1230-1235

【Authors】: Toshimitsu Takahashi ; Ryota Tomioka ; Kenji Yamanishi

【Abstract】: Detection of emerging topics are now receiving renewed interest motivated by the rapid growth of social networks. Conventional term-frequency-based approaches may not be appropriate in this context, because the information exchanged are not only texts but also images, URLs, and videos. We focus on the social aspects of theses networks. That is, the links between users that are generated dynamically intentionally or unintentionally through replies, mentions, and retweets. We propose a probability model of the mentioning behaviour of a social network user, and propose to detect the emergence of a new topic from the anomaly measured through the model. We combine the proposed mention anomaly score with a recently proposed change-point detection technique based on the Sequentially Discounting Normalized Maximum Likelihood (SDNML), or with Kleinberg's burst model. Aggregating anomaly scores from hundreds of users, we show that we can detect emerging topics only based on the reply/mention relationships in social network posts. We demonstrate our technique in a number of real data sets we gathered from Twitter. The experiments show that the proposed mention-anomaly-based approaches can detect new topics at least as early as the conventional term-frequency-based approach, and sometimes much earlier when the keyword is ill-defined.

【Keywords】: data mining; probability; security of data; social networking (online); Kleinberg burst model; Twitter; change-point detection technique; emerging topic discovery; link anomaly detection; probability model; sequentially discounting normalized maximum likelihood; social network user behaviour; social streams; term-frequency-based approach; Density functional theory; Encoding; Hidden Markov models; Maximum likelihood detection; Training; Twitter; Anomaly Detection; Burst detection; Sequentially Discounted Maximum Likelihood Coding; Social Networks; Topic Detection

【Paper Link】【Pages】:1236-1241

【Authors】: Chayant Tantipathananandh ; Tanya Y. Berger-Wolf

【Abstract】: Communities are natural structures observed in social networks and are usually characterized as "relatively dense" subsets of nodes. Social networks change over time and so do the underlying community structures. Thus, to truly uncover this structure we must take the temporal aspect of networks into consideration. Previously, we have represented framework for finding dynamic communities using the social cost model and formulated the corresponding optimization problem [33], assuming that partitions of individuals into groups are given in each time step. We have also presented heuristics and approximation algorithms for the problem, with the same assumption [32]. In general, however, dynamic social networks are represented as a sequence of graphs of snapshots of the social network and the assumption that we have partitions of individuals into groups does not hold. In this paper, we extend the social cost model and formulate an optimization problem of finding community structure from the sequence of arbitrary graphs. We propose a semi definite programming formulation and a heuristic rounding scheme. We show, using synthetic data sets, that this method is quite accurate on synthetic data sets and present its results on a real social network.

【Keywords】: approximation theory; data handling; graph theory; optimisation; sequences; social networking (online); approximation algorithm; community structure; dynamic social networks; graph sequence; heuristic rounding scheme; optimization problem; semideflnite programming formulation; social cost model; synthetic data sets; Approximation algorithms; Approximation methods; Communities; Heuristic algorithms; Partitioning algorithms; Social network services; Vectors; community structure; dynamic social networks; semidefinite programming

142. Review Graph Based Online Store Review Spammer Detection.

【Paper Link】【Pages】:1242-1247

【Authors】: Guan Wang ; Sihong Xie ; Bing Liu ; Philip S. Yu

【Abstract】: Online reviews provide valuable information about products and services to consumers. However, spammers are joining the community trying to mislead readers by writing fake reviews. Previous attempts for spammer detection used reviewers' behaviors, text similarity, linguistics features and rating patterns. Those studies are able to identify certain types of spammers, e.g., those who post many similar reviews about one target entity. However, in reality, there are other kinds of spammers who can manipulate their behaviors to act just like genuine reviewers, and thus cannot be detected by the available techniques. In this paper, we propose a novel concept of a heterogeneous review graph to capture the relationships among reviewers, reviews and stores that the reviewers have reviewed. We explore how interactions between nodes in this graph can reveal the cause of spam and propose an iterative model to identify suspicious reviewers. This is the first time such intricate relationships have been identified for review spam detection. We also develop an effective computation method to quantify the trustiness of reviewers, the honesty of reviews, and the reliability of stores. Different from existing approaches, we don't use review text information. Our model is thus complementary to existing approaches and able to find more difficult and subtle spamming activities, which are agreed upon by human judges after they evaluate our results.

【Keywords】: graphs; retail data processing; reviews; unsolicited e-mail; heterogeneous review graph; iterative model; online store review; spammer detection; text information review; Companies; Computational modeling; Data mining; Humans; Reliability; Writing; Spammer detection; review graph

143. Classifying Categorical Data by Rule-Based Neighbors.

【Paper Link】【Pages】:1248-1253

【Authors】: Jiabing Wang ; Pei Zhang ; Guihua Wen ; Jia Wei

【Abstract】: A new learning algorithm for categorical data, named CRN (Classification by Rule-based Neighbors) is proposed in this paper. CRN is a nonmetric and parameter-free classifier, and can be regarded as a hybrid of rule induction and instance-based learning. Based on a new measure of attributes quality and the separate-and-conquer strategy, CRN learns a collection of feature sets such that for each pair of instances belonging to different classes, there is a feature set on which the two instances disagree. For an unlabeled instance I and a labeled instance I', I' is a neighbor of I if and only if they agree on all attributes of a feature set. Then, CRN classifies an unlabeled instance I based on I's neighbors on those learned feature sets. To validate the performance of CRN, CRN is compared with six state-of-the-art classifiers on twenty-four datasets. Experimental results demonstrate that although the underlying idea of CRN is simple, the predictive accuracy of CRN is comparable to or better than that of the state-of-the-art classifiers on most datasets.

【Keywords】: data handling; knowledge based systems; learning (artificial intelligence); pattern classification; categorical data classification; feature sets; instance-based learning; learning algorithm; nonmetric classifier; parameter-free classifier; rule induction; rule-based neighbors; separate-and-conquer strategy; Accuracy; Classification algorithms; Entropy; Impurities; Measurement; Training; categorical data; classification; feature selection; rule-based neighbors

【Paper Link】【Pages】:1254-1259

【Authors】: Miao Zhang ; Chris H. Q. Ding ; Zhifang Liao

【Abstract】: Social tagging predictions involve the co occurrence of users, items and tags. The tremendous growth of users require the recommender system to produce tag recommendations for millions of users and items at any minute. The triplets of users, items and tags are most naturally described by a 3D tensor, and tensor decomposition-based algorithms can produce high quality recommendations. However, each day, thousands of new users are added to the system and the decompositions must be updated daily in a online fashion. In this paper, we provide analysis of the new user problem, and present fold-in algorithms for Tucker, Para Fac, and Low-order tensor decompositions. We show that these algorithm can very efficiently compute the needed decompositions. We evaluate the fold-in algorithms experimentally on several datasets and the results demonstrate the effectiveness of these algorithms.

【Keywords】: data mining; recommender systems; social networking (online); solid modelling; tensors; 3D tensor decomposition based algorithm; ParaFac decomposition; Tucker decomposition; low-order tensor decomposition; recommender system; social tagging prediction; tag recommendation; tensor fold-in algorithm; Accuracy; Algorithm design and analysis; Matrix decomposition; Prediction algorithms; Predictive models; Tagging; Tensile stress; Graph Mining; Recommender System; Social Network

145. Discovering Thematic Patterns in Videos via Cohesive Sub-graph Mining.

【Paper Link】【Pages】:1260-1265

【Authors】: Gangqiang Zhao ; Junsong Yuan

【Abstract】: One category of videos usually contains the same thematic pattern, e.g., the spin action in skating videos. The discovery of the thematic pattern is essential to understand and summarize the video contents. This paper addresses two critical issues in mining thematic video patterns: (1) automatic discovery of thematic patterns without any training or supervision information, and (2) accurate localization of the occurrences of all thematic patterns in videos. The major contributions are two-fold. First, we formulate the thematic video pattern discovery as a cohesive sub-graph selection problem by finding a sub-set of visual words that are spatio-temporally collocated. Then spatio-temporal branch-and-bound search can locate all instances accurately. Second, a novel method is proposed to efficiently find the cohesive sub-graph of maximum overall mutual information scores. Our experimental results on challenging commercial and action videos show that our approach can discover different types of thematic patterns despite variations in scale, view-point, color and lighting conditions, or partial occlusions. Our approach is also robust to the videos with cluttered and dynamic backgrounds.

【Keywords】: data mining; graph theory; hidden feature removal; search problems; video signal processing; automatic discovery; cohesive subgraph mining; cohesive subgraph selection problem; lighting condition; overall mutual information score; partial occlusion; spatiotemporal branch and bound search; thematic video pattern discovery; video contents; visual word subset; Data mining; Feature extraction; Pattern matching; Vectors; Video sequences; Videos; Visualization; cohesive subgraph mining; thematic pattern; unsupervised

146. Low Rank Metric Learning with Manifold Regularization.

【Paper Link】【Pages】:1266-1271

【Authors】: Guoqiang Zhong ; Kaizhu Huang ; Cheng-Lin Liu

【Abstract】: In this paper, we present a semi-supervised method to learn a low rank Mahalanobis distance function. Based on an approximation to the projection distance from a manifold, we propose a novel parametric manifold regularizer. In contrast to previous approaches that usually exploit side information only, our proposed method can further take advantages of the intrinsic manifold information from data. In addition, we focus on learning a metric of low rank directly, this is different from traditional approaches that often enforce the l₁ norm on the metric. The resulting configuration is convex with respect to the manifold structure and the distance function, respectively. We solve it with an alternating optimization algorithm, which proves effective to find a satisfactory solution. For efficient implementation, we even present a fast algorithm, in which the manifold structure and the distance function are learned independently without alternating minimization. Experimental results over 12 standard UCI data sets demonstrate the advantages of our method.

【Keywords】: approximation theory; iterative methods; learning (artificial intelligence); optimisation; intrinsic manifold information; low rank Mahalanobis distance function; low rank metric learning; manifold regularization; optimization algorithm; parametric manifold regularizer; semisupervised method; Eigenvalues and eigenfunctions; Euclidean distance; Iterative methods; Manifolds; Optimization; Training; Semi-supervised metric learning; fixed-point iterative algorithm; linearly constrained nuclear norm minimization; low rank; manifold regularization

147. A Study of Laplacian Spectra of Graph for Subgraph Queries.

【Paper Link】【Pages】:1272-1277

【Authors】: Lei Zhu ; Qinbao Song

【Abstract】: The spectrum of graph has been widely used in graph mining to extract graph topological information. It has also been employed as a characteristic of graph to check the sub graph isomorphism testing since it is an invariant of a graph. However, the spectrum cannot be directly applied to a graph and its sub graph, which is a bottleneck for sub graph isomorphism testing. In this paper, we study the Laplacian spectra between a graph and its sub graph, and propose a method by straightforward adoption of them for sub graph queries. In our proposed method, we first encode every vertex and graph by extracting their Laplacian spectra, and generate a novel two-step filtering conditions. Then, we follow the filtering-and verification framework to conduct sub graph queries. Extensive experiments show that, compared with existing counterpart method, as a graph feature, Laplacian spectra can be used to efficiently improves the efficiency of sub graph queries and thus indicate that it have considerable potential.

【Keywords】: data mining; graph theory; information filtering; query processing; Laplacian spectra; flltering and veriflcation framework; graph feature; graph mining; graph spectrum; graph topological information extraction; subgraph isomorphism testing; subgraph queries; two-step filtering condition; Data mining; Eigenvalues and eigenfunctions; Feature extraction; Filtering; Indexes; Laplace equations; Laplacian spectra of graph; graph mining; subgraph queries

148. Text Clustering via Constrained Nonnegative Matrix Factorization.

【Paper Link】【Pages】:1278-1283

【Authors】: Yan Zhu ; Liping Jing ; Jian Yu

【Abstract】: Semi-supervised nonnegative matrix factorization (NMF)receives more and more attention in text mining field. The semi-supervised NMF methods can be divided into two types, one is based on the explicit category labels, the other is based on the pair wise constraints including must-link and cannot-link. As it is hard to obtain the category labels in some tasks, the latter one is more widely used in real applications. To date, all the constrained NMF methods treat the must-link and cannot-link constraints in a same way. However, these two kinds of constraints play different roles in NMF clustering. Thus a novel constrained NMF method is proposed in this paper. In the new method, must-link constraints are used to control the distance of the data in the compressed form, and cannot-ink constraints are used to control the encoding factor. Experimental results on real-world text data sets have shown the good performance of the proposed method.

【Keywords】: matrix decomposition; text analysis; cannot link constraints; constrained nonnegative matrix factorization; must link constraints; text clustering; text mining; Clustering methods; Complexity theory; Data models; Encoding; Gradient methods; Matrix decomposition; Vectors; nonnegative matrix factorization; pairwise constraints; semi-supervised clustering; text clustering