ICDM 2010, The 10th IEEE International Conference on Data Mining, Sydney, Australia, 14-17 December 2010. IEEE Computer Society 【DBLP Link】
【Paper Link】 【Pages】:5
【Authors】: Christos Faloutsos
【Abstract】: What do graphs look like? How do they evolve over time? How to handle a graph with a billion nodes? We present a comprehensive list of static and temporal laws, and some recent observations on real graphs (e.g., "eigenSpokes"). For generators, we describe some recent ones, which naturally match all of the known properties of real graphs. Finally, for tools, we present "oddball" for discovering anomalies and patterns, as well as an overview of the PEGASUS system which is designed for handling Billion-node graphs, running on top of the "hadoop" system.
【Keywords】: data mining; graph theory; PEGASUS system; billion-node graphs; data mining; hadoop system; oddball; Data mining; Generators; Graphics; Streaming media
【Paper Link】 【Pages】:6
【Authors】: Geoffrey J. McLachlan
【Abstract】:
【Keywords】: bioinformatics; data mining; pattern classification; sampling methods; bioinformatics literature; bootstrap sample; cluster analytic procedure; data mining; error rate estimation; factor analytic model; high dimensional data; labelled training data; null hypothesis; optimistic assessment; resampling approach; supervised classification
【Paper Link】 【Pages】:7
【Authors】: Xindong Wu
【Abstract】:
【Keywords】: Artificial intelligence; Committees; Computer science; Conferences; Data mining; IEEE Computer Society; Service awards
【Paper Link】 【Pages】:8-17
【Authors】: James Abello ; Tina Eliassi-Rad ; Nishchal Devanur
【Abstract】: We address the problem of detecting characteristic patterns in communication networks. We introduce a scalable approach based on set-system discrepancy. By implicitly labeling each network edge with the sequence of times in which its two endpoints communicate, we view an entire communication network as a set-system. This view allows us to use combinatorial discrepancy as a mechanism to "observe" system behavior at different time scales. We illustrate our approach, called Discrepancy-based Novelty Detector (DND), on networks obtained from emails, blue tooth connections, IP traffic, and tweets. DND has almost linear runtime complexity and linear storage complexity in the number of communications. Examples of novel discrepancies that it detects are (i) asynchronous communications and (ii) disagreements in the firing rates of nodes and edges relative to the communication network as a whole.
【Keywords】: computational complexity; network theory (graphs); set theory; telecommunication networks; communication networks; discrepancy based novelty detector; linear runtime complexity; linear storage complexity; network edge; set system discrepancy; Novelty detection; communication networks; set-system discrepancy
【Paper Link】 【Pages】:18-27
【Authors】: Morteza Alamgir ; Ulrike von Luxburg
【Abstract】: We consider the problem of local graph clustering where the aim is to discover the local cluster corresponding to a point of interest. The most popular algorithms to solve this problem start a random walk at the point of interest and let it run until some stopping criterion is met. The vertices visited are then considered the local cluster. We suggest a more powerful alternative, the multi-agent random walk. It consists of several ``agents'' connected by a fixed rope of length l. All agents move independently like a standard random walk on the graph, but they are constrained to have distance at most l from each other. The main insight is that for several agents it is harder to simultaneously travel over the bottleneck of a graph than for just one agent. Hence, the multi-agent random walk has less tendency to mistakenly merge two different clusters than the original random walk. In our paper we analyze the multi-agent random walk theoretically and compare it experimentally to the major local graph clustering algorithms from the literature. We find that our multi-agent random walk consistently outperforms these algorithms.
【Keywords】: data mining; graphs; multi-agent systems; pattern clustering; data mining; graph clustering; multiagent random walk; Graph Clustering; Local Clustering; Mixing Time; Random Walk
【Paper Link】 【Pages】:28-37
【Authors】: S. Tom Au ; Rong Duan ; Heeyoung Kim ; Guangqin Ma
【Abstract】: Learning and identifying events in network traffic is crucial for service providers to improve their mobility network performance. In fact, large special events attract cell phone users to relative small areas, which causes sudden surge in network traffic. To handle such increased load, it is necessary to measure the increased network traffic and quantify the impact of the events, so that relevant resources can be optimized to enhance the network capability. However, this problem is challenging due to several issues: (1) Multiple periodic temporal traffic patterns (i.e., nonhomogeneous process) even for normal traffic, (2) Irregularly distributed spatial neighbor information, (3) Different temporal patterns driven by different events even for spatial neighborhoods, (4) Large scale data set. This paper proposes a systematic event detection method that deals with the above problems. With the additivity property of Poisson process, we propose an algorithm to integrate spatial information by aggregating the behavior of temporal data under various areas. Markov Modulated Nonhomogeneous Poisson Process (MMNHPP) is employed to estimate the probability with which event happens, when and where the events take place, and assess the spatial and temporal impacts of the events. Localized events are then ranked globally for prioritizing more significant events. Synthetic data are generated to illustrate our procedure and validate the performance. An industrial example from a telecommunication company is also presented to show the effectiveness of the proposed method.
【Keywords】: Markov processes; mobility management (mobile radio); optimisation; telecommunication traffic; Markov modulated nonhomogeneous Poisson process; cell phone users; irregularly distributed spatial neighbor information; mobility network performance; multiple periodic temporal traffic patterns; network traffic; probability estimate; spatiotemporal event detection; telecommunication company; Event Detection; Markov Modulated Nonhomogeneous Poisson Process; Network Traffic; Spatiotemporal
【Paper Link】 【Pages】:38-47
【Authors】: Tengfei Bao ; Happia Cao ; Enhong Chen ; Jilei Tian ; Hui Xiong
【Abstract】: Mobile context modeling is a process of recognizing and reasoning about contexts and situations in a mobile environment, which is critical for the success of context-aware mobile services. While there are prior work on mobile context modeling, the use of unsupervised learning techniques for mobile context modeling is still under-explored. Indeed, unsupervised techniques have the ability to learn personalized contexts which are difficult to be predefined. To that end, in this paper, we propose an unsupervised approach to modeling personalized contexts of mobile users. Along this line, we first segment the raw context data sequences of mobile users into context sessions where a context session contains a group of adjacent context records which are mutually similar and usually reflect the similar contexts. Then, we exploit topic models to learn personalized contexts in the form of probabilistic distributions of raw context data from the context sessions. Finally, experimental results on real-world data show that the proposed approach is efficient and effective for mining personalized contexts of mobile users.
【Keywords】: data mining; mobile computing; personal information systems; unsupervised learning; context record; mobile environment; mobile user; personalized context mining; personalized context modeling; probabilistic distribution; raw context data sequence; unsupervised learning technique; mobile context modeling; unsupervised approach
【Paper Link】 【Pages】:48-57
【Authors】: Kanishka Bhaduri ; Qiang Zhu ; Nikunj C. Oza ; Ashok N. Srivastava
【Abstract】: Multivariate Time-Series (MTS) are ubiquitous, and are generated in areas as disparate as sensor recordings in aerospace systems, music and video streams, medical monitoring, and financial systems. Domain experts are often interested in searching for interesting multivariate patterns from these MTS databases which can contain up to several gigabytes of data. Surprisingly, research on MTS search is very limited. Most existing work only supports queries with the same length of data, or queries on a fixed set of variables. In this paper, we propose an efficient and flexible subsequence search framework for massive MTS databases, that, for the first time, enables querying on any subset of variables with arbitrary time delays between them. We propose two provably correct algorithms to solve this problem - (1) an R*-tree Based Search (RBS) which uses Minimum Bounding Rectangles (MBR) to organize the subsequences, and (2) a List Based Search (LBS) algorithm which uses sorted lists for indexing. We demonstrate the performance of these algorithms using two large MTS databases from the aviation domain, each containing several millions of observations. Both these tests show that our algorithms have very high prune rates (>;95%) thus needing actual disk access for only less than 5% of the observations. To the best of our knowledge, this is the first flexible MTS search algorithm capable of subsequence search on any subset of variables. Moreover, MTS subsequence search has never been attempted on datasets of the size we have used in this paper.
【Keywords】: data mining; query processing; time series; tree searching; ubiquitous computing; R-tree based search; flexible multivariate time series subsequence search; flexible subsequence search framework; list based search; massive MTS databases; minimum bounding rectangle; multivariate patterns; provably correct algorithm; multivariate analysis; similarity search
【Paper Link】 【Pages】:58-67
【Authors】: Alessandro Camerra ; Themis Palpanas ; Jin Shieh ; Eamonn J. Keogh
【Abstract】: There is an increasingly pressing need, by several applications in diverse domains, for developing techniques able to index and mine very large collections of time series. Examples of such applications come from astronomy, biology, the web, and other domains. It is not unusual for these applications to involve numbers of time series in the order of hundreds of millions to billions. However, all relevant techniques that have been proposed in the literature so far have not considered any data collections much larger than one-million time series. In this paper, we describe iSAX 2.0, a data structure designed for indexing and mining truly massive collections of time series. We show that the main bottleneck in mining such massive datasets is the time taken to build the index, and we thus introduce a novel bulk loading mechanism, the first of this kind specifically tailored to a time series index. We show how our method allows mining on datasets that would otherwise be completely untenable, including the first published experiments to index one billion time series, and experiments in mining massive data from domains as diverse as entomology, DNA and web-scale image collections.
【Keywords】: data mining; indexing; time series; data collection; data mining; data structure; iSAX 2.0; indexable symbolic aggregate approximation; indexing; time series; data mining; indexing; representations; time series
【Paper Link】 【Pages】:68-77
【Authors】: Cornelia Caragea ; Adrian Silvescu ; Doina Caragea ; Vasant Honavar
【Abstract】: High accuracy sequence classification often requires the use of higher order Markov models (MMs). However, the number of MM parameters increases exponentially with the range of direct dependencies between sequence elements, thereby increasing the risk of over fitting when the data set is limited in size. We present abstraction augmented Markov models (AAMMs) that effectively reduce the number of numeric parameters of kth order MMs by successively grouping strings of length k (i.e., k-grams) into abstraction hierarchies. We evaluate AAMMs on three protein sub cellular localization prediction tasks. The results of our experiments show that abstraction makes it possible to construct predictive models that use significantly smaller number of features (by one to three orders of magnitude) as compared to MMs. AAMMs are competitive with and, in some cases, significantly outperform MMs. Moreover, the results show that AAMMs often perform significantly better than variable order Markov models, such as decomposed context tree weighting, prediction by partial match, and probabilistic suffix trees.
【Keywords】: Markov processes; bioinformatics; cellular biophysics; feature extraction; pattern classification; prediction theory; proteins; MM parameter; abstraction augmented Markov model; abstraction hierarchy; decomposed context tree weighting; numeric parameter; predictive model; probabilistic suffix tree; protein subcellular localization prediction task; sequence classification; Markov models; abstraction; sequence classification
【Paper Link】 【Pages】:78-87
【Authors】: Sharma Chakravarthy ; Aravind Venkatachalam ; Aditya Telang
【Abstract】: This paper presents a novel framework for multi-folder email classification using graph mining as the underlying technique. Although several techniques exist (e.g., SVM, TF-IDF, n-gram) for addressing this problem in a delimited context, they heavily rely on extracting high-frequency keywords, thus ignoring the inherent structural aspects of an email (or document in general) which can play a critical role in classification. Some of the models (e.g., n-gram) consider only the words without taking into consideration where in the structure these words appear together. This paper presents a supervised learning model that leverages graph mining techniques for multi-folder email classification. A ranking formula is presented for ordering the representative - common and recurring - substructures generated from pre-classified emails. These ranked representative substructures are then used for categorizing incoming emails. This approach is based on a global ranking model that incorporates several relevant parameters for email classification and overcomes numerous problems faced by extant approaches used for multi-folder classification. A number of parameters which influence the generation of representative substructures are analyzed, reexamined, and adapted to multiple folders. The effect of graph representations has been analyzed. The effectiveness of the proposed approach has been validated experimentally.
【Keywords】: data mining; data structures; document handling; electronic mail; graphs; learning (artificial intelligence); pattern classification; graph based approach; graph mining; graph representation; keyword extraction; multifolder e-mail classification; ranking formula; representative data substructure; supervised learning
【Paper Link】 【Pages】:88-97
【Authors】: Wei Chen ; Yifei Yuan ; Li Zhang
【Abstract】: Influence maximization is the problem of finding a small set of most influential nodes in a social network so that their aggregated influence in the network is maximized. In this paper, we study influence maximization in the linear threshold model, one of the important models formalizing the behavior of influence propagation in social networks. We first show that computing exact influence in general networks in the linear threshold model is #P-hard, which closes an open problem left in the seminal work on influence maximization by Kempe, Kleinberg, and Tardos, 2003. As a contrast, we show that computing influence in directed a cyclic graphs (DAGs) can be done in time linear to the size of the graphs. Based on the fast computation in DAGs, we propose the first scalable influence maximization algorithm tailored for the linear threshold model. We conduct extensive simulations to show that our algorithm is scalable to networks with millions of nodes and edges, is orders of magnitude faster than the greedy approximation algorithm proposed by Kempe et al. and its optimized versions, and performs consistently among the best algorithms while other heuristic algorithms not design specifically for the linear threshold model have unstable performances on different real-world networks.
【Keywords】: approximation theory; directed graphs; greedy algorithms; optimisation; social networking (online); #P-hard; directed acyclic graphs; greedy approximation algorithm; linear threshold model; scalable influence maximization algorithm; social networks; influence maximization; linear threshold model; social networks
【Paper Link】 【Pages】:98-107
【Authors】: Alzennyr Da Silva ; Raja Chiky ; Georges Hébrail
【Abstract】: The growing usage of embedded devices and sensors in our daily lives has been profoundly reshaping the way we interact with our environment and our peers. As more and more sensors will pervade our future cities, increasingly efficient infrastructures to collect, process, and store massive amounts of data streams from a wide variety of sources will be required. Despite the different application-specific features and hardware platforms, sensor network applications share a common goal: periodically sample and store data collected from different sensors in a common persistent memory. In this article we present a clustering approach for rapidly and efficiently computing the best sampling rate which minimizes the SSE (Sum of Square Errors) for each particular sensor in a network. In order to evaluate the efficiency of the proposed approach, we carried out experiments on real electric power consumption data streams produced by a 1-thousand sensor network provided by the French energy group-EDF (Electricite de France).
【Keywords】: data acquisition; pattern clustering; sampling methods; sensor fusion; CLUSMASTER; data stream; sampling rate; sensor network; clustering; data streams; sampling; sensor network
【Paper Link】 【Pages】:108-117
【Authors】: Bo Dai ; Baogang Hu ; Gang Niu
【Abstract】: Most well-known discriminative clustering models, such as spectral clustering (SC) and maximum margin clustering (MMC), are non-Bayesian. Moreover, they merely considered to embed domain-dependent prior knowledge into data-specific kernels, while other forms of prior knowledge were seldom considered in these models. In this paper, we propose a Bayesian maximum margin clustering model (BMMC) based on the low-density separation assumption, which unifies the merits of both Bayesian and discriminative approaches. In addition to stating prior distribution on functions explicitly as traditional Gaussian processes, special prior knowledge can be embedded into BMMC implicitly via the Universum set easily. Furthermore, it is much easier to solve a BMMC than an MMC since the integer variables in the optimization are eliminated. Experimental results show that the BMMC achieves comparable or even better performance than state-of-the-art clustering methods and solving BMMC is more efficiently.
【Keywords】: Bayes methods; Gaussian processes; data mining; optimisation; pattern clustering; BMMC; Bayesian maximum margin clustering; Bayesian maximum margin clustering model; Gaussian process; Universum set; data specific kernel; discriminative approach; discriminative clustering model; domain dependent prior knowledge; integer variable; low density separation assumption; nonBayesian clustering; optimization; prior distribution; spectral clustering; Bayesian; Clustering; Maximum Margin Principle; Universum
【Paper Link】 【Pages】:118-127
【Authors】: Samik Datta ; Anirban Majumder ; Nisheeth Shrivastava
【Abstract】: Viral Marketing, the idea of exploiting social interactions of users to propagate awareness for products, has gained considerable focus in recent years. One of the key issues in this area is to select the best seeds that maximize the influence propagated in the social network. In this paper, we define the seed selection problem (called t-Influence Maximization, or t-IM) for multiple products. Specifically, given the social network and t products along with their seed requirements, we want to select seeds for each product that maximize the overall influence. As the seeds are typically sent promotional messages, to avoid spamming users, we put a hard constraint on the number of products for which any single user can be selected as a seed. In this paper, we design two efficient techniques for the t-IM problem, called Greedy and FairGreedy. The Greedy algorithm uses simple greedy hill climbing, but still results in a 1/3-approximation to the optimum. Our second technique, FairGreedy, allocates seeds with not only high overall influence (close to Greedy in practice), but also ensures fairness across the influence of different products. We also design efficient heuristics for estimating the influence of the selected seeds, that are crucial for running the seed selection on large social network graphs. Finally, using extensive simulations on real-life social graphs, we show the effectiveness and scalability of our techniques compared to existing and naive strategies.
【Keywords】: greedy algorithms; marketing; optimisation; social networking (online); greedy algorithm; greedy hill climbing; multiple product; seed selection problem; social network; t influence maximization; viral marketing; influence propagation; social networks; viral marketing
【Paper Link】 【Pages】:128-137
【Authors】: Timothy de Vries ; Sanjay Chawla ; Michael E. Houle
【Abstract】: Time, cost and energy efficiency are critical factors for many data analysis techniques when the size and dimensionality of data is very large. We investigate the use of Local Outlier Factor (LOF) for data of this type, providing a motivating example from real world data. We propose Projection-Indexed Nearest-Neighbours (PINN), a novel technique that exploits extended nearest neighbour sets in the a reduced dimensional space to create an accurate approximation for k-nearest-neighbour distances, which is used as the core density measurement within LOF. The reduced dimensionality allows for efficient sub-quadratic indexing in the number of items in the data set, where previously only quadratic performance was possible. A detailed theoretical analysis of Random Projection(RP) and PINN shows that we are able to preserve the density of the intrinsic manifold of the data set after projection. Experimental results show that PINN outperforms the standard projection methods RP and PCA when measuring LOF for many high-dimensional real-world data sets of up to 300000 elements and 102600 dimensions.
【Keywords】: approximation theory; data analysis; data mining; random processes; set theory; anomaly detection; approximation; data analysis; dimensionality reduction; k-nearest neighbour; local outlier factor; projection indexed nearest-neighbour; random projection; subquadratic indexing; anomaly detection; dimensionality reduction
【Paper Link】 【Pages】:138-147
【Authors】: Trong Dinh Thac Do ; Anne Laurent ; Alexandre Termier
【Abstract】: Numerical data (e.g., DNA micro-array data, sensor data) pose a challenging problem to existing frequent pattern mining methods which hardly handle them. In this framework, gradual patterns have been recently proposed to extract covariations of attributes, such as: "When X increases, Y decreases". There exist some algorithms for mining frequent gradual patterns, but they cannot scale to real-world databases. We present in this paper GLCM, the first algorithm for mining closed frequent gradual patterns, which proposes strong complexity guarantees: the mining time is linear with the number of closed frequent gradual item sets. Our experimental study shows that GLCM is two orders of magnitude faster than the state of the art, with a constant low memory usage. We also present PGLCM, a parallelization of GLCM capable of exploiting multicore processors, with good scale-up properties on complex datasets. These algorithms are the first algorithms capable of mining large real world datasets to discover gradual patterns.
【Keywords】: data mining; multiprocessing systems; set theory; closed frequent gradual itemset; linear mining time; multicore processor; numerical data; pattern mining; Data mining; frequent pattern mining; gradual itemsets; parallelism
【Paper Link】 【Pages】:148-157
【Authors】: Lan Du ; Wray Lindsay Buntine ; Huidong Jin
【Abstract】: Understanding how topics within a document evolve over its structure is an interesting and important problem. In this paper, we address this problem by presenting a novel variant of Latent Dirichlet Allocation (LDA): Sequential LDA (SeqLDA). This variant directly considers the underlying sequential structure, i.e., a document consists of multiple segments (e.g., chapters, paragraphs), each of which is correlated to its previous and subsequent segments. In our model, a document and its segments are modelled as random mixtures of the same set of latent topics, each of which is a distribution over words; and the topic distribution of each segment depends on that of its previous segment, the one for first segment will depend on the document topic distribution. The progressive dependency is captured by using the nested two-parameter Poisson Dirichlet process (PDP). We develop an efficient collapsed Gibbs sampling algorithm to sample from the posterior of the PDP. Our experimental results on patent documents show that by taking into account the sequential structure within a document, our SeqLDA model has a higher fidelity over LDA in terms of perplexity (a standard measure of dictionary-based compressibility). The SeqLDA model also yields a nicer sequential topic structure than LDA, as we show in experiments on books such as Melville's "The Whale".
【Keywords】: data mining; sampling methods; stochastic processes; word processing; Gibbs sampling; Poisson Dirichlet process; document structure; sequential Latent Dirichlet allocation; topic distribution; topic structure; Latent Dirichlet Allocation; Poisson-Dirichlet process; collapsed Gibbs sampler; document structure
【Paper Link】 【Pages】:158-167
【Authors】: Wouter Duivesteijn ; Arno J. Knobbe ; Ad Feelders ; Matthijs van Leeuwen
【Abstract】: Whenever a dataset has multiple discrete target variables, we want our algorithms to consider not only the variables themselves, but also the interdependencies between them. We propose to use these interdependencies to quantify the quality of subgroups, by integrating Bayesian networks with the Exceptional Model Mining framework. Within this framework, candidate subgroups are generated. For each candidate, we fit a Bayesian network on the target variables. Then we compare the network's structure to the structure of the Bayesian network fitted on the whole dataset. To perform this comparison, we define an edit distance-based distance metric that is appropriate for Bayesian networks. We show interesting subgroups that we experimentally found with our method on datasets from music theory, semantic scene classification, biology and zoogeography.
【Keywords】: belief networks; data mining; pattern classification; Bayesian network; distance-based distance metric; exceptional model mining approach; music theory; semantic scene classification; Bayesian networks; Exceptional Model Mining; Subgroup Discovery
【Paper Link】 【Pages】:168-175
【Authors】: Haytham Elghazel ; Alex Aussem
【Abstract】: In this paper, we propose another extension of the Random Forests paradigm to unlabeled data, leading to localized unsupervised feature selection (FS). We show that the way internal estimates are used to measure variable importance in Random Forests are also applicable to FS in unsupervised learning. We first illustrate the clustering performance of the proposed method on various data sets based on widely used external criteria of clustering quality. We then assess the accuracy and the scalability of the FS procedure on UCI and real labeled data sets and compare its effectiveness against other FS methods.
【Keywords】: feature extraction; pattern clustering; unsupervised learning; FS method; FS procedure; UCI; clustering performance; clustering quality; feature selection; random cluster ensemble; random forest paradigm; real labeled data set; unlabeled data; unsupervised learning; variable importance; Random Forest; Unsupervised learning; feature selection
【Paper Link】 【Pages】:176-185
【Authors】: Zeno Gantner ; Lucas Drumond ; Christoph Freudenthaler ; Steffen Rendle ; Lars Schmidt-Thieme
【Abstract】: Cold-start scenarios in recommender systems are situations in which no prior events, like ratings or clicks, are known for certain users or items. To compute predictions in such cases, additional information about users (user attributes, e.g. gender, age, geographical location, occupation) and items (item attributes, e.g. genres, product categories, keywords) must be used. We describe a method that maps such entity (e.g. user or item) attributes to the latent features of a matrix (or higher-dimensional) factorization model. With such mappings, the factors of a MF model trained by standard techniques can be applied to the new-user and the new-item problem, while retaining its advantages, in particular speed and predictive accuracy. We use the mapping concept to construct an attribute-aware matrix factorization model for item recommendation from implicit, positive-only feedback. Experiments on the new-item problem show that this approach provides good predictive accuracy, while the prediction time only grows by a constant factor.
【Keywords】: feedback; learning (artificial intelligence); matrix decomposition; recommender systems; attribute-to-feature mapping; cold-start recommendation; matrix factorization; new-item problem; new-user problem; recommender system; cold-start; collaborative filtering; factorization models; long tail; matrix factorization; recommender systems
【Paper Link】 【Pages】:186-195
【Authors】: Yuanyuan Guo ; Xiaoda Niu ; Harry Zhang
【Abstract】: Semi-supervised classification methods utilize unlabeled data to help learn better classifiers, when only a small amount of labeled data is available. Many semi-supervised learning methods have been proposed in the past decade. However, some questions have not been well answered, e.g., whether semi-supervised learning methods outperform base classifiers learned only from the labeled data, when different base classifiers are used, whether selecting unlabeled data with efforts is superior to random selection, and how the quality of the learned classifier changes at each iteration of learning process. This paper conducts an extensive empirical study on the performance of several commonly used semi-supervised learning methods when different Bayesian classifiers (NB, NBTree, TAN, HGC, HNB, and DNB) are used as the base classifier, respectively. Results on Transductive SVM and a graph-based semi-supervised learning method LLGC are also studied for comparison. The experimental results on 26 UCI datasets and 6 widely used benchmark datasets show that these semi-supervised learning methods generally do not obtain better performance than classifiers learned only from the labeled data. Moreover, for standard self-training and co-training, when selecting the most confident unlabeled instances during learning process, the performance is not necessarily better than that of random selection of unlabeled instances. We especially discovered interesting outcomes when drawing learning curves for using NB in self-training on some UCI datasets. The accuracy of the learned classifier on the testing set may fluctuate or decrease as more unlabeled instances are used. Also on the mushroom dataset, even when all the selected unlabeled instances are correctly labeled in each iteration, the accuracy on the testing set still goes down.
【Keywords】: Bayes methods; graphs; learning (artificial intelligence); pattern classification; Bayesian classifier; LLGC; UCI dataset; base classifier; benchmark dataset; graph based semisupervised learning method; learning curve; learning process; mushroom dataset; random selection; transductive SVM; unlabeled data; unlabeled instance; Bayesian classifiers; Semi-supervised learning
【Paper Link】 【Pages】:196-205
【Authors】: Wilhelmiina Hämäläinen
【Abstract】: Statistical dependency analysis is the basis of all empirical science. A commonly occurring problem is to find the most significant dependency rules, which describe either positive or negative dependencies between categorical attributes. For example, in medical science one is interested in genetic factors, which can either predispose or prevent diseases. The requirement of statistical significance is essential, because the discoveries should hold also in the future data. Typically, the significance is estimated either by Fisher's exact test or the χ2-measure. The problem is computationally very difficult, because the number of all possible dependency rules increases exponentially with the number of attributes. As a solution, different kinds of restrictions and heuristics have been applied, but a general, scalable search method has been missing. In this paper, we introduce an efficient algorithm for searching for the top-K globally optimal dependency rules using Fisher's exact test as a measure function. The rules can express either positive or negative dependencies between a set of positive attributes and a single consequent attribute. The algorithm is based on an application of the branch- and-bound search strategy, supplemented by several pruning properties. Especially, we prove a new lower-bound for the Fisher's p, and introduce a new effective pruning principle. The general search algorithm is applicable to other goodness measures, like the χ2-measure, as well. According to our experiments on classical benchmark data, the algorithm is well scalable and can efficiently handle even dense and high dimensional data sets. In addition, the quality of rules is significantly better than with the χ2-measure using the same search algorithm.
【Keywords】: data analysis; data mining; set theory; statistical analysis; tree searching; Fisher exact test; branch-and-bound search; consequent attribute; optimal dependency rules; positive attributes; pruning principle; search algorithm; statistical dependency analysis; Fisher's exact test; dependency rule; negative rule; rule discovery; statistical significance
【Paper Link】 【Pages】:206-215
【Abstract】: Besides high accuracy, stability of feature selection has recently attracted strong interest in knowledge discovery from high-dimensional data. In this study, we present a theoretical framework about the relationship between the stability and accuracy of feature selection based on a formal bias-variance decomposition of feature selection error. The framework also suggests a variance reduction approach for improving the stability of feature selection algorithms. Furthermore, we propose an empirical variance reduction framework, margin based instance weighting, which weights training instances according to their influence to the estimation of feature relevance. We also develop an efficient algorithm under this framework. Experiments based on synthetic data and real-world micro array data verify both the theoretical framework and the effectiveness of the proposed algorithm on variance reduction. The proposed algorithm is also shown to be effective at improving subset stability, while maintaining comparable classification accuracy based on selected features.
【Keywords】: Monte Carlo methods; data mining; feature extraction; lab-on-a-chip; stability; classification accuracy; feature selection error; formal bias variance decomposition; high dimensional data; knowledge discovery; margin based instance weighting; real world microarray data; stable feature selection; subset stability; synthetic data; variance reduction framework; bias-variance decomposition; feature selection; high-dimensional data; stability; variance reduction
【Paper Link】 【Pages】:216-225
【Authors】: Kohei Hayashi ; Takashi Takenouchi ; Tomohiro Shibata ; Yuki Kamiya ; Daishi Kato ; Kazuo Kunieda ; Keiji Yamada ; Kazushi Ikeda
【Abstract】: In this paper, we study probabilistic modeling of heterogeneously attributed multi-dimensional arrays. The model can manage the heterogeneity by employing an individual exponential-family distribution for each attribute of the tensor array. These entries are connected by latent variables and are shared information across the different attributes. Because a Bayesian inference for our model is intractable, we cast the EM algorithm approximated by using the Lap lace method and Gaussian process. This approximation enables us to derive a predictive distribution for missing values in a consistent manner. Simulation experiments show that our method outperforms other methods such as PARAFAC and Tucker decomposition in missing-values prediction for cross-national statistics and is also applicable to discover anomalies in heterogeneous office-logging data.
【Keywords】: Bayes methods; Gaussian processes; Laplace equations; belief networks; expectation-maximisation algorithm; inference mechanisms; matrix decomposition; prediction theory; sensor fusion; Bayesian inference; Gaussian process; Laplace method; anomaly detection; cross national statistic; exponential family tensor factorization; missing values prediction; multidimensional array; office logging data; probabilistic modeling; Bayesian probabilistic model; Gaussian process; data fusion; tensor factorization
【Paper Link】 【Pages】:226-235
【Authors】: Jingrui He ; Hanghang Tong ; Jaime G. Carbonell
【Abstract】: Rare categories abound and their characterization has heretofore received little attention. Fraudulent banking transactions, network intrusions, and rare diseases are examples of rare classes whose detection and characterization are of high value. However, accurate characterization is challenging due to high-skewness and non-separability from majority classes, e.g., fraudulent transactions masquerade as legitimate ones. This paper proposes the RACH algorithm by exploring the compactness property of the rare categories. It is based on an optimization framework which encloses the rare examples by a minimum-radius hyper ball. The framework is then converted into a convex optimization problem, which is in turn effectively solved in its dual form by the projected sub gradient method. RACH can be naturally kernelized. Experimental results validate the effectiveness of RACH.
【Keywords】: data handling; optimisation; RACH algorithm; convex optimization problem; fraudulent banking transactions; minimum-radius hyperball; network intrusions; rare category characterization; subgradient method; characterization; compactness; hyperball; minority class; optimization; rare category; subgradient
【Paper Link】 【Pages】:236-245
【Authors】: Zhen Hu ; Raj Bhatnagar
【Abstract】: The concept of Triclusters has been investigated recently in the context of two relational datasets that share labels along one of the dimensions. By simultaneously processing two datasets to unveil triclusters, new useful knowledge and insights can be obtained. However, some recently reported methods are either closely linked to specific problems or constrain datasets to have some specific distributions. Algorithms for generating triclusters whose cell-values demonstrate simple well known statistical properties, such as upper bounds on standard deviations, are needed for many applications. In this paper we present a 3-Clustering algorithm that searches for meaningful combinations of biclusters in two related datasets. The algorithm can handle situations involving: (i) datasets in which a few data objects may be present in only one dataset and not in both datasets, (ii) the two datasets may have different numbers of objects and/or attributes, and (iii) the cell-value distributions in two datasets may be different. In our formulation the cell-values of each selected tricluster, formed by two independent biclusters, are such that the standard deviations in each bicluster obeys an upper bound and the sets of objects in the two biclusters overlap to the maximum possible extent. We present validation of our algorithm by presenting the properties of the 3-Clusters discovered from a synthetic dataset and from a real world cross-species genomic dataset. The results of our algorithm unveil interesting insights for the cross-species genomic domain.
【Keywords】: data mining; pattern clustering; search problems; statistical analysis; cell-value distributions; data mining; low variance cluster; real valued dataset; relational datasets; standard deviation; statistical property; triclusters; Co-clustering; Triclusters
【Paper Link】 【Pages】:246-255
【Authors】: Sergey Ioffe
【Abstract】: We propose a new Consistent Weighted Sampling method, where the probability of drawing identical samples for a pair of inputs is equal to their Jaccard similarity. Our method takes deterministic constant time per non-zero weight, improving on the best previous approach which takes expected constant time. The samples can be used as Weighted Minhash for efficient retrieval and compression (sketching) under Jaccard or L1 metric. A method is presented for using simple data statistics to reduce the running time of hash computation by two orders of magnitude. We compare our method with the random projection method and show that it has better characteristics for retrieval under L1. We present a novel method of mapping hashes to short bit-strings, apply it to Weighted Minhash, and achieve more accurate distance estimates from sketches than existing methods, as long as the inputs are sufficiently distinct. We show how to choose the optimal number of bits per hash for sketching, and demonstrate experimental results which agree with the theoretical analysis.
【Keywords】: cryptography; data compression; file organisation; information retrieval; pattern matching; sampling methods; Jaccard similarity; L1 sketching; bit string; consistent weighted sampling method; data statistics; deterministic constant time; hash computation; random projection method; running time reduction; sample compression; sample retrieval; weighted Minhash; Compression; Hashing; Minhash; Retrieval; Sampling; Sketching
【Paper Link】 【Pages】:256-265
【Authors】: Peng Jiang ; Chunxia Zhang ; Hongping Fu ; Zhendong Niu ; Qing Yang
【Abstract】: Opinion mining is a challenging task to identify the opinions or sentiments underlying user generated contents, such as online product reviews, blogs, discussion forums, etc. Previous studies that adopt machine learning algorithms mainly focus on designing effective features for this complex task. This paper presents our approach based on tree kernels for opinion mining of online product reviews. Tree kernels alleviate the complexity of feature selection and generate effective features to satisfy the special requirements in opinion mining. In this paper, we define several tree kernels for sentiment expression extraction and sentiment classification, which are subtasks of opinion mining. Our proposed tree kernels encode not only syntactic structure information, but also sentiment related information, such as sentiment boundary and sentiment polarity, which are important features to opinion mining. Experimental results on a benchmark data set indicate that tree kernels can significantly improve the performance of both sentiment expression extraction and sentiment classification. Besides, a linear combination of our proposed tree kernels and traditional feature vector kernel achieves the best performances using the benchmark data set.
【Keywords】: behavioural sciences computing; classification; data mining; feature extraction; natural language processing; text analysis; trees (mathematics); blogs; feature selection; feature vector kernel; machine learning; online product reviews; opinion mining; sentiment classification; sentiment expression extraction; syntactic structure information; tree kernels; opinion mining; sentiment analysis; text mining; tree kernels
【Paper Link】 【Pages】:266-273
【Authors】: Md. Enamul Kabir ; Hua Wang ; Yanchun Zhang
【Abstract】: Microdata protection in statistical databases has recently become a major societal concern and has been intensively studied in recent years. Statistical Disclosure Control (SDC) is often applied to statistical databases before they are released for public use. Micro aggregation for SDC is a family of methods to protect micro data from individual identification. SDC seeks to protect micro data in such a way that can be published and mined without providing any private information that can be linked to specific individuals. Micro aggregation works by partitioning the micro data into groups of at least k records and then replacing the records in each group with the centroid of the group. An optimal micro aggregation method must minimize the information loss resulting from this replacement process. The challenge is how to minimize the information loss during the micro aggregation process. This paper presents a pair wise systematic (P-S) micro aggregation method to minimize the information loss. The proposed technique simultaneously forms two distant groups at a time with the corresponding similar records together in a systematic way and then anonymized with the centroid of each group individually. The structure of P-S problem is defined and investigated and an algorithm of the proposed problem is developed. The performance of the P-S algorithm is compared against the most recent micro aggregation methods. Experimental results show that P-S algorithm incurs less than half information loss than the latest micro aggregation methods for all of the test situations.
【Keywords】: data mining; data privacy; security of data; set theory; statistical databases; information loss minimization; k-anonymity; microdata protection; pairwise systematic microaggregation; statistical databases; statistical disclosure control; ?-anonymity; Disclosure control; Microaggregation; Microdata protection; Privacy
【Paper Link】 【Pages】:274-283
【Authors】: Xiangnan Kong ; Philip S. Yu
【Abstract】: Nowadays, the classification of graph data has become an important and active research topic in the last decade, which has a wide variety of real world applications, e.g. drug activity predictions and kinase inhibitor discovery. Current research on graph classification focuses on single-label settings. However, in many applications, each graph data can be assigned with a set of multiple labels simultaneously. Extracting good features using multiple labels of the graphs becomes an important step before graph classification. In this paper, we study the problem of multi-label feature selection for graph classification and propose a novel solution, called gMLC, to efficiently search for optimal sub graph features for graph objects with multiple labels. Different from existing feature selection methods in vector spaces which assume the feature set is given, we perform multi-label feature selection for graph data in a progressive way together with the sub graph feature mining process. We derive an evaluation criterion, named gHSIC, to estimate the dependence between sub graph features and multiple labels of graphs. Then a branch-and-bound algorithm is proposed to efficiently search for optimal sub graph features by judiciously pruning the sub graph search space using multiple labels. Empirical studies on real-world tasks demonstrate that our feature selection approach can effectively boost multi-label graph classification performances and is more efficient by pruning the sub graph search space using multiple labels.
【Keywords】: data mining; feature extraction; graph theory; pattern classification; tree searching; branch-and-bound algorithm; feature extraction; feature mining; graph classification; graph object; multilabel feature selection; search space; vector space; feature selection; graph classification; multi-label learning
【Paper Link】 【Pages】:284-293
【Authors】: Takuro Kutsuna
【Abstract】: We propose a novel approach for one-class classification problems where a logical formula is used to estimate the region that covers all examples. A formula is viewed as a model that represents a region and is approximated with respect to its hierarchical local densities. The approximation is done quite efficiently via direct manipulations of a binary decision diagram that is a compressed representation of a Boolean formula. The proposed method has only one parameter to be tuned, and the parameter can be selected properly with the help of the minimum description length principle, which requires no labeled training data. In other words, a one-class classifier is generated from an unlabeled training data thoroughly and automatically. Experimental results show that the proposed method works quite well with synthetic data and some realistic data.
【Keywords】: approximation theory; binary decision diagrams; pattern classification; unsupervised learning; Boolean formula; approximation method; binary decision diagram; minimum description length principle; one class classifier; parameter tuned; unlabeled training data; binary decision diagram; minimum description length principle; one-class classification
【Paper Link】 【Pages】:294-303
【Authors】: Zhongmou Li ; Hui Xiong ; Yanchi Liu ; Aoying Zhou
【Abstract】: In this paper, we formulate a novel problem for finding black hole and volcano patterns in a large directed graph. Specifically, a black hole pattern is a group which is made of a set of nodes in a way such that there are only in links to this group from the rest nodes in the graph. In contrast, a volcano pattern is a group which only has out links to the rest nodes in the graph. Both patterns can be observed in real world. For instance, in a trading network, a black hole pattern may represent a group of traders who are manipulating the market. In the paper, we first prove that the black hole mining problem is a dual problem of finding volcanoes. Therefore, we focus on finding the black hole patterns. Along this line, we design two pruning schemes to guide the black hole finding process. In the first pruning scheme, we strategically prune the search space based on a set of pattern-size-independent pruning rules and develop an iBlack hole algorithm. The second pruning scheme follows a divide-and-conquer strategy to further exploit the pruning results from the first pruning scheme. Indeed, a target directed graphs can be divided into several disconnected sub graphs by the first pruning scheme, and thus the black hole finding can be conducted in each disconnected sub graph rather than in a large graph. Based on these two pruning schemes, we also develop an iBlackhole-DC algorithm. Finally, experimental results on real-world data show that the iBlackhole-DC algorithm can be several orders of magnitude faster than the iBlackhole algorithm, which has a huge computational advantage over a brute-force method.
【Keywords】: data mining; directed graphs; divide and conquer methods; financial data processing; fraud; stock markets; blackhole mining; blackhole pattern; directed graph; directed network; divide-and-conquer strategy; iBlackhole algorithm; pruning scheme; search space; trading network; volcano pattern; blackhole pattern; fraud detection; graph mining; network model; volcano pattern
【Paper Link】 【Pages】:304-313
【Authors】: Bo Liu ; Jie Yin ; Yanshan Xiao ; Longbing Cao ; Philip S. Yu
【Abstract】: This paper presents a novel hybrid approach to outlier detection by incorporating local data uncertainty into the construction of a global classifier. To deal with local data uncertainty, we introduce a confidence value to each data example in the training data, which measures the strength of the corresponding class label. Our proposed method works in two steps. Firstly, we generate a pseudo training dataset by computing a confidence value of each input example on its class label. We present two different mechanisms: kernel k-means clustering algorithm and kernel LOF-based algorithm, to compute the confidence values based on the local data behavior. Secondly, we construct a global classifier for outlier detection by generalizing the SVDD-based learning framework to incorporate both positive and negative examples as well as their associated confidence values. By integrating local and global outlier detection, our proposed method explicitly handles the uncertainty of the input data and enhances the ability of SVDD in reducing the sensitivity to noise. Extensive experiments on real life datasets demonstrate that our proposed method can achieve a better tradeoff between detection rate and false alarm rate as compared to four state-of-the-art outlier detection algorithms.
【Keywords】: data description; learning (artificial intelligence); pattern clustering; probability; support vector machines; uncertainty handling; SVDD based learning; boost global outlier detection; kernel LOF based algorithm; kernel k- means clustering; local data uncertainty; pseudo training dataset; support vector data description; Data uncertainty; Outlier detection; SVDD
【Paper Link】 【Pages】:314-323
【Authors】: Jie Liu ; Kai Yu ; Yi Zhang ; Yalou Huang
【Abstract】: Recently, combining Conditional Random Fields (CRF) with Neural Network has shown the success of learning high-level features in sequence labeling tasks. However, such models are difficult to train because of the increase of the parameters to tune which needs enormous of labeled data to avoid over fitting. In this paper, we propose a transfer learning framework for the sequence labeling task of gesture recognition. Taking advantage of the frame correlation, we design an unsupervised sequence model as a pseudo auxiliary task to capture the underlying information from both the labeled and unlabeled data. The knowledge learnt by the auxiliary task can be transferred to the main task of CRF with a deep architecture by sharing the hidden layers, which is very helpful for learning meaningful representation and reducing the need of labeled data. We evaluate our model under 3 gesture recognition datasets. The experimental results of both supervised learning and semi-supervised learning show that the proposed model improves the performance of the CRF with Neural Network and other baseline models.
【Keywords】: gesture recognition; unsupervised learning; conditional random field training; frame correlation; gesture recognition; neural network; semisupervised learning; sequence labeling tasks; transfer learning; unsupervised sequence model; Conditional Random Fields; Gesture Recognition; Semi-supervised Learning; Transfer Learning
【Paper Link】 【Pages】:324-333
【Authors】: Tantan Liu ; Fan Wang ; Gagan Agrawal
【Abstract】: In recent years, one mode of data dissemination has become extremely popular, which is the deep web. Like any other data source, data mining on the deep web can produce important insights or summary of results. However, data mining on the deep web is challenging because the databases cannot be accessed directly, and therefore, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. In this paper, we target two related data mining problems, which are association mining and differential rule mining. We develop stratified sampling methods to perform these mining tasks on a deep web source. Our contributions include a novel greedy stratification approach, which processes the query space of a deep web data source recursively, and considers both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our experiment results show that our algorithms effectively and consistently reduce sampling costs, compared with a stratified sampling method that only considers estimation error. In addition, compared with simple random sampling, our algorithm has higher sampling accuracy and lower sampling costs.
【Keywords】: Internet; data mining; distributed databases; optimisation; query processing; sampling methods; Web database; association mining; data dissemination; data mining; deep Web; differential rule mining; estimation error; query space; sample allocation method; sampling costs; stratified sampling methods; Data Mining; Deep Web; Stratified Sampling
【Paper Link】 【Pages】:334-343
【Authors】: Daniel Lowd ; Jesse Davis
【Abstract】: Traditional Markov network structure learning algorithms perform a search for globally useful features. However, these algorithms are often slow and prone to finding local optima due to the large space of possible structures. Ravikumar et al. recently proposed the alternative idea of applying L1 logistic regression to learn a set of pair wise features for each variable, which are then combined into a global model. This paper presents the DTSL algorithm, which uses probabilistic decision trees as the local model. Our approach has two significant advantages: it is more efficient, and it is able to discover features that capture more complex interactions among the variables. Our approach can also be seen as a method for converting a dependency network into a consistent probabilistic model. In an extensive empirical evaluation on 13 datasets, our algorithm obtains comparable accuracy to three standard structure learning algorithms while running 1-4 orders of magnitude faster.
【Keywords】: Markov processes; decision trees; learning (artificial intelligence); probability; DTSL algorithm; Markov network structure; decision tree structure learner; logistic regression; probabilistic decision tree; structure learning algorithm; Markov networks; decision trees; probabilistic methods; structure learning
【Paper Link】 【Pages】:344-353
【Authors】: Dijun Luo ; Chris H. Q. Ding ; Heng Huang
【Abstract】: In many cases of machine learning or data mining applications, we are not only aimed to establish accurate black box predictors, we are also interested in discovering predictive patterns in data which enhance our interpretation and understanding of underlying physical, biological and other natural processes. Sparse representation is one of the focuses in this direction. More recently, structural sparsity has attracted increasing attentions. The structural sparsity is often achieved by imposing ℓ2/ℓ1 norms. In this paper, we present the explicit ℓ2/ℓ0 norm to directly achieve structural sparsity. To tackle the problem of intractable ℓ2/ℓ0 optimization, we develop a general Lipschitz auxiliary function which leads to simple iterative algorithms. In each iteration, optimal solution is achieved for the induced sub-problem and a guarantee of convergence is provided. Further more, the local convergent rate is also theoretically bounded. We test our optimization techniques in the multi-task feature learning problem. Experimental results suggest that our approaches outperform other approaches in both synthetic and real world data sets.
【Keywords】: convergence of numerical methods; data mining; iterative methods; learning (artificial intelligence); optimisation; sparse matrices; Lipschitz auxiliary function; biological process; black box predictor; convergent rate; data mining; explicit l2/l0 approach; iterative algorithm; machine learning; multitask feature learning; natural process; optimization technique; physical process; predictive data pattern; sparse representation; structural sparsity; ?2/?0-norm; Non-smooth optimization; Structural sparsity
【Paper Link】 【Pages】:354-363
【Authors】: Tengfei Ma ; Xiaojun Wan
【Abstract】: Document summarization plays an important role in the area of natural language processing and text mining. This paper proposes several novel information-theoretic models for multi-document summarization. They consider document summarization as a transmission system and assume that the best summary should have the minimum distortion. By defining a proper distortion measure and a new representation method, the combination of the last two models (the linear representation model and the facility location model) gains good experimental results on the DUC2002 and DUC2004 datasets. Moreover, we also indicate that the model has high interpretability and extensibility.
【Keywords】: data mining; knowledge representation; natural language processing; text analysis; document summarization; information-theoretic model; natural language processing; text mining; J-S Divergence; information-theoretic summarization; linear representation; minimum distortion; multi-document summarization
【Paper Link】 【Pages】:364-373
【Authors】: Aditya Krishna Menon ; Charles Elkan
【Abstract】: In dyadic prediction, labels must be predicted for pairs (dyads) whose members possess unique identifiers and, sometimes, additional features called side-information. Special cases of this problem include collaborative filtering and link prediction. We present a new log-linear model for dyadic prediction that is the first to satisfy several important desiderata: (i) labels may be ordinal or nominal, (ii) side-information can be easily exploited if present, (iii) with or without side-information, latent features are inferred for dyad members, (iv) the model is resistant to sample-selection bias, (v) it can learn well-calibrated probabilities, and (vi) it can scale to large datasets. To our knowledge, no existing method satisfies all the above criteria. In particular, many methods assume that the labels are binary or numerical, and cannot use side-information. Experimental results show that the new method is competitive with previous specialized methods for collaborative filtering and link prediction. Other experimental results demonstrate that the new method succeeds for dyadic prediction tasks where previous methods cannot be used. In particular, the new method predicts nominal labels accurately, and by using side-information it solves the cold-start problem in collaborative filtering.
【Keywords】: groupware; information filtering; probability; Dyadic prediction; collaborative filtering; link prediction; log linear model; Dyadic prediction; collaborative filtering; link prediction; log-linear model
【Paper Link】 【Pages】:374-383
【Authors】: Pradeep Muthukrishnan ; Dragomir R. Radev ; Qiaozhu Mei
【Abstract】: The growth of the web has directly influenced the increase in the availability of relational data. One of the key problems in mining such data is computing the similarity between objects with heterogeneous feature types. For example, publications have many heterogeneous features like text, citations, authorship information, venue information, etc. In most approaches, similarity is estimated using each feature type in isolation and then combined in a linear fashion. However, this approach does not take advantage of the dependencies between the different feature spaces. In this paper, we propose a novel approach to combine the different sources of similarity using a regularization framework over edges in multiple graphs. We show that the objective function induced by the framework is convex. We also propose an efficient algorithm using coordinate descent to solve the optimization problem. We extrinsically evaluate the performance of the proposed unified similarity measure on two different tasks, clustering and classification. The proposed similarity measure outperforms three baselines and a state-of-the-art classification algorithm on a variety of standard, large data sets.
【Keywords】: convex programming; data mining; graph theory; pattern classification; pattern clustering; convex function; data classification; data clustering; data mining; edge weight regularization; multiple graph; optimization; regularization framework; relational data; similarity learning; Classification; Clustering; Heterogeneous Features; Machine Learning; Similarity Learning
【Paper Link】 【Pages】:384-392
【Authors】: Nam Nguyen
【Abstract】: In this paper, we address the problem of multi-instance multi-label learning (MIML) where each example is associated with not only multiple instances but also multiple class labels. In our novel approach, given an MIML example, each instance in the example is only associated with a single label and the label set of the example is the aggregation of all instance labels. Many real-world tasks such as scene classification, text categorization and gene sequence encoding can be properly formalized under our proposed approach. We formulate our MIML problem as a combination of two optimizations: (1) a quadratic programming (QP) that minimizes the empirical risk with L2-norm regularization, and (2) an integer programming (IP) assigning each instance to a single label. We also present an efficient method combining the stochastic gradient decent and alternating optimization approaches to solve our QP and IP optimizations. In our experiments with both an artificially generated data set and real-world applications, i.e. scene classification and text categorization, our proposed method achieves superior performance over existing state-of-the-art MIML methods such as MIMLBOOST, MIMLSVM, M3MIML and MIMLRBF.
【Keywords】: gradient methods; image classification; integer programming; learning (artificial intelligence); quadratic programming; support vector machines; text analysis; L2-norm regularization; M3MIML; MIML method; MIMLBOOST; MIMLRBF; MIMLSVM; SVM approach; alternating optimization approach; gene sequence encoding; integer programing; multiinstance multilabel learning; quadratic programming; real world application; scene classification; stochastic gradient decent; text categorization; Classification; Multi-Instance Multi-Label; SVM
【Paper Link】 【Pages】:393-402
【Authors】: Sunho Park ; Seungjin Choi
【Abstract】: Multiclass classification problems are often decomposed into multiple binary problems that are solved by individual binary classifiers whose results are integrated into a final answer. Various methods have been developed to aggregate binary classifiers, including voting heuristics, loss-based decoding, and probabilistic decoding methods, but a little work on the optimal aggregation has been done. In this paper we present a Bayesian method for optimally aggregating binary classifiers where class membership probabilities are determined by predictive probabilities. We model the class membership probability as a softmax function whose input argument is a linear combination of discrepancies between code words and probability estimates obtained by the binary classifiers. We consider a lower bound on the softmax function, which is represented as a product of logistic sigmoids, and we formulate the problem of learning aggregation weights as a variational logistic regression. Predictive probabilities computed by variational logistic regression yield the class membership probabilities. We stress two notable advantages over existing methods in the viewpoint of complexity and over fitting. Numerical experiments on several datasets confirm its useful behavior.
【Keywords】: Bayes methods; pattern classification; prediction theory; probability; regression analysis; variational techniques; Bayesian method; aggregating binary classifier; logistic sigmoid; multiclass classification; predictive probability; probabilistic decoding; softmax function; variational logistic regression; Classifier aggregation; multiclass classification; variational logistic regression
【Paper Link】 【Pages】:403-410
【Authors】: Sergey M. Plis ; Terran Lane ; Vince D. Calhoun
【Abstract】: Distributions over permutations arise in applications ranging from multi-object tracking to ranking of instances. The difficulty of dealing with these distributions is caused by the size of their domain, which is factorial in the number of considered entities (n!). It makes the direct definition of a multinomial distribution over permutation space impractical for all but a very small n. In this work we propose an embedding of all n! permutations for a given n in a surface of a hyper sphere defined in ℝ(n-1). As a result of the embedding, we acquire ability to define continuous distributions over a hyper sphere with all the benefits of directional statistics. We provide polynomial time projections between the continuous hyper sphere representation and the n!-element permutation space. The framework provides a way to use continuous directional probability densities and the methods developed thereof for establishing densities over permutations. As a demonstration of the benefits of the framework we derive an inference procedure for a state-space model over permutations. We demonstrate the approach with simulations on a large number of objects hardly manageable by the state of the art inference methods, and an application to a real flight traffic control dataset.
【Keywords】: approximation theory; computational complexity; data mining; inference mechanisms; state-space methods; statistical distributions; approximate representation; directional probability densities; directional statistics; factorial space; hypersphere representation; inference procedure; multinomial distribution; object tracking; permutation space; polynomial time projections; state-space model; statistical inference
【Paper Link】 【Pages】:411-420
【Authors】: Marko Pozenel ; Viljan Mahnic ; Matjaz Kukar
【Abstract】: We describe a heuristic search-based method for interleaved HTTP (Web) session reconstruction building upon first order Markov models. An interleaved session is generated by a user who is concurrently browsing the same web site in two or more web sessions (browser tabs or windows). In order to assure data quality for subsequent phases in analyzing user's browsing behavior, such sessions need to be separated in advance. We propose a separating process based on best-first search and trained first order Markov chains. We develop a testing method based on various measures of reconstructed sessions similarity to original ones. We evaluate the developed method on two real world click stream data sources: a web shop and a university student records information system. Preliminary results show that the proposed method performs well.
【Keywords】: Internet; Markov processes; Web sites; hypermedia; pattern matching; records management; search problems; Markov chain; Web shop; Web site; best first search; clickstream data source; data quality; heuristic search based method; interleaved HTTP session; interleaved Web session; separating process; sessions similarity; university student records information system; user browsing behavior; HTTP session; Markov model; clickstream; data quality; interleaved session; session separation process; sessionization; user behavior
【Paper Link】 【Pages】:421-430
【Authors】: Troy Raeder ; T. Ryan Hoens ; Nitesh V. Chawla
【Abstract】: The prevailing approach to evaluating classifiers in the machine learning community involves comparing the performance of several algorithms over a series of usually unrelated data sets. However, beyond this there are many dimensions along which methodologies vary wildly. We show that, depending on the stability and similarity of the algorithms being compared, these sometimes-arbitrary methodological choices can have a significant impact on the conclusions of any study, including the results of statistical tests. In particular, we show that performance metrics and data sets used, the type of cross-validation employed, and the number of iterations of cross-validation run have a significant, and often predictable, effect. Based on these results, we offer a series of recommendations for achieving consistent, reproducible results in classifier performance comparisons.
【Keywords】: learning (artificial intelligence); pattern classification; classifier performance estimation; machine learning; reproducibility; variability; classification; evaluation; reproducibility
【Paper Link】 【Pages】:431-440
【Authors】: Parisa Rashidi ; Diane J. Cook
【Abstract】: In recent years, new emerging application domains have introduced new constraints and methods in data mining field. One of such application domains is activity discovery from sensor data. Activity discovery and recognition plays an important role in a wide range of applications from assisted living to security and surveillance. Most of the current approaches for activity discovery assume a static model of the activities and ignore the problem of mining and discovering activities from a data stream over time. Inspired by the unique requirements of activity discovery application domain, in this paper we propose a new stream mining method for finding sequential patterns over time from streaming non-transaction data using multiple time granularities. Our algorithm is able to find sequential patterns, even if the patterns exhibit discontinuities (interruptions) or variations in the sequence order. Our algorithm also addresses the problem of dealing with rare events across space and over time. We validate the results of our algorithms using data collected from two different smart apartments.
【Keywords】: data mining; sensor fusion; activity discovery; activity recognition; data mining; data stream; sensor data; sequential pattern; static model; Activity Data Mining; Sensor Data; Smart Environments; Stream Sequence Mining
【Paper Link】 【Pages】:441-450
【Authors】: Piotr Rzepakowski ; Szymon Jaroszewicz
【Abstract】: Most classification approaches aim at achieving high prediction accuracy on a given dataset. However, in most practical cases, some action, such as mailing an offer or treating a patient, is to be taken on the classified objects and we should model not the class probabilities themselves, but instead, the change in class probabilities caused by the action. The action should then be performed on those objects for which it will be most profitable. This problem is known as uplift modeling, differential response analysis or true lift modeling, but has received very little attention in Machine Learning literature. In the paper we present a tree based classifier tailored specifically to this task. To this end, we design new splitting criteria and pruning methods. The experiments confirm the usefulness of the proposed approach and show significant improvement over previous uplift modeling techniques.
【Keywords】: decision trees; learning (artificial intelligence); pattern classification; prediction theory; probability; decision tree; information theory; machine learning; uplift modeling; decision trees; information theory; uplift modeling
【Paper Link】 【Pages】:451-460
【Authors】: Eran Shaham ; David Sarne ; Boaz Ben-Moshe
【Abstract】: The paper focuses on mining clusters that are characterized by a lagged relationship between the data objects. We call such clusters lagged co-clusters. A lagged co-cluster of a matrix is a sub matrix determined by a subset of rows and their corresponding lag over a subset of columns. Extracting such subsets (not necessarily successive) may reveal an underlying governing regulatory mechanism. Such a regulatory mechanism is quite common in real life settings. It appears in a variety of fields: meteorology, seismic activity, stock market behavior, neuronal brain activity, river flow and navigation, are but a limited list of examples. Mining such lagged co-clusters not only helps in understanding the relationship between objects in the domain, but assists in forecasting their future behavior. For most interesting variants of this problem, finding an optimal lagged co-cluster is an NP-complete problem. We present a polynomial-time Monte-Carlo algorithm for finding a set of lagged co-clusters whose error does not exceed a pre-specified value, which handles noise, anti-correlations, missing values, and overlapping patterns. Moreover, we prove that the list includes, with fixed probability, a lagged co-cluster which is optimal in its dimensions. The algorithm was extensively evaluated using various environments. First, artificial data, enabling the evaluation of specific, isolated properties of the algorithm. Secondly, real-world data, using river flow and topographic data, enabling the evaluation of the algorithm to efficiently mine relevant and coherent lagged co-clusters in environments that are temporal, i.e., time reading data, and non-temporal, respectively.
【Keywords】: Monte Carlo methods; computational complexity; data mining; matrix algebra; pattern clustering; set theory; Monte Carlo algorithm; NP-complete problem; data mining; lagged cocluster; regulatory mechanism; submatrix; subset; clustering; co-clustering; data mining; lagged clustering; timelagged
【Paper Link】 【Pages】:461-470
【Authors】: Jin Shieh ; Eamonn J. Keogh
【Abstract】: Classification of items taken from data streams requires algorithms that operate in time sensitive and computationally constrained environments. Often, the available time for classification is not known a priori and may change as a consequence of external circumstances. Many traditional algorithms are unable to provide satisfactory performance while supporting the highly variable response times that exemplify such applications. In such contexts, anytime algorithms, which are amenable to trading time for accuracy, have been found to be exceptionally useful and constitute an area of increasing research activity. Previous techniques for improving anytime classification have generally been concerned with optimizing the probability of correctly classifying individual objects. However, as we shall see, serially optimizing the probability of correctly classifying individual objects K times, generally gives inferior results to batch optimizing the probability of correctly classifying K objects. In this work, we show that this simple observation can be exploited to improve overall classification performance by using an anytime framework to allocate resources among a set of objects buffered from a fast arriving stream. Our ideas are independent of object arrival behavior, and, perhaps unintuitively, even in data streams with constant arrival rates our technique exhibits a marked improvement in performance. The utility of our approach is demonstrated with extensive experimental evaluations conducted on a wide range of diverse datasets.
【Keywords】: optimisation; pattern classification; probability; resource allocation; anytime algorithm; batch optimization; computationally constrained environment; constant arrival time; data stream classification; individual object; overall classification performance; probability optimization; satisfactory performance; time sensitive; variable response time; anytime algorithms; classification; nearest neighbor; streaming data
【Paper Link】 【Pages】:471-480
【Authors】: Kelvin Sim ; Zeyar Aung ; Vivekanand Gopalkrishnan
【Abstract】: Subspace clusters represent useful information in high-dimensional data. However, mining significant subspace clusters in continuous-valued 3D data such as stock-financial ratio-year data, or gene-sample-time data, is difficult. Firstly, typical metrics either find subspaces with very few objects, or they find too many insignificant subspaces - those which exist by chance. Besides, typical 3D subspace clustering approaches abound with parameters, which are usually set under biased assumptions, making the mining process a `guessing game'. We address these concerns by proposing an information theoretic measure, which allows us to identify 3D subspace clusters that stand out from the data. We also develop a highly effective, efficient and parameter-robust algorithm, which is a hybrid of information theoretical and statistical techniques, to mine these clusters. From extensive experimentations, we show that our approach can discover significant 3D subspace clusters embedded in 110 synthetic datasets of varying conditions. We also perform a case study on real-world stock datasets, which shows that our clusters can generate higher profits compared to those mined by other approaches.
【Keywords】: data analysis; data mining; pattern clustering; 3D subspace clustering approache; continuous valued 3D data; correlated subspace cluster; information hybrid; mining process; parameter robust algorithm; statistical technique; stock dataset; 3D subspace clustering; financial data mining; information theory
【Paper Link】 【Pages】:481-490
【Authors】: Heli Sun ; Jianbin Huang ; Jiawei Han ; Hongbo Deng ; Peixiang Zhao ; Boqin Feng
【Abstract】: Community detection is an important task for mining the structure and function of complex networks. Many pervious approaches are difficult to detect communities with arbitrary size and shape, and are unable to identify hubs and outliers. A recently proposed network clustering algorithm, SCAN, is effective and can overcome this difficulty. However, it depends on a sensitive parameter: minimum similarity threshold ε, but provides no automated way to find it. In this paper, we propose a novel density-based network clustering algorithm, called gSkeletonClu (graph-skeleton based clustering). By projecting a network to its Core-Connected Maximal Spanning Tree (CCMST), the network clustering problem is converted to finding core-connected components in the CCMST. We discover that all possible values of the parameter ε lie in the edge weights of the corresponding CCMST. By means of tree divisive or agglomerative clustering, our algorithm can find the optimal parameter ε and detect communities, hubs and outliers in large-scale undirected networks automatically without any user interaction. Extensive experiments on both real-world and synthetic networks demonstrate the superior performance of gSkeletonClu over the baseline methods.
【Keywords】: complex networks; pattern clustering; social networking (online); trees (mathematics); agglomerative clustering; community detection; complex networks; core connected maximal spanning tree; density based network clustering; edge weights; gSkeletonClu; graph skeleton based clustering; hubs; large-scale undirected networks; optimal parameter; outliers; real-world networks; structure connected tree division; synthetic networks; Community Discovery; Density-based Network Clustering; Hubs and Outliers; Parameter Selection
【Paper Link】 【Pages】:491-500
【Authors】: Liang Tang ; Tao Li
【Abstract】: Modern computing systems are instrumented to generate huge amounts of system logs and these data can be utilized for understanding and complex system behaviors. One main fundamental challenge in automated log analysis is the generation of system events from raw textual logs. Recent works apply clustering techniques to translate the raw log messages into system events using only the word/term information. In this paper, we first illustrate the drawbacks of existing techniques for event generation from system logs. We then propose Log Tree, a novel and algorithm-independent framework for events generation from raw system log messages. Log Tree utilizes the format and structural information of the raw logs in the clustering process to generate system events with better accuracy. In addition, an indexing data structure, Message Segment Table, is proposed in Log Tree to significantly improve the efficiency of events creation. Extensive experiments on real system logs demonstrate the effectiveness and efficiency of Log Tree.
【Keywords】: fault trees; statistical analysis; system monitoring; LogTree; automated log analysis; clustering techniques; complex system behaviors; event generation; log messages; message segment table; raw textual logs; event creation; log analysis; message clustering
【Paper Link】 【Pages】:501-510
【Authors】: Nikolaj Tatti ; Boris Cule
【Abstract】: Discovering patterns in a sequence is an important aspect of data mining. One popular choice of such patterns are episodes, patterns in sequential data describing events that often occur in the vicinity of each other. Episodes also enforce in which order events are allowed to occur. In this work we introduce a technique for discovering closed episodes. Adopting existing approaches for discovering traditional patterns, such as closed item sets, to episodes is not straightforward. First of all, we cannot define a unique closure based on frequency because an episode may have several closed super episodes. Moreover, to define a closedness concept for episodes we need a subset relationship between episodes, which is not trivial to define. We approach these problems by introducing strict episodes. We argue that this class is general enough, and at the same time we are able to define a natural subset relationship within it and use it efficiently. In order to mine closed episodes we define an auxiliary closure operator. We show that this closure satisfies the needed Galois connection so that we can use the existing framework for mining closed patterns. Discovering the true closed episodes can be done as a post-processing step. We combine these observations into an efficient mining algorithm and demonstrate empirically its performance in practice.
【Keywords】: Galois fields; data mining; graph theory; pattern classification; Galois connection; auxiliary closure operator; closed episode discovery; closed pattern mining; closed superepisode; closedness concept; data mining; postprocessing step; sequential data; Closed Episodes; Frequent Episode Mining; Level-wise Algorithm
【Paper Link】 【Pages】:511-520
【Authors】: Kai Ming Ting ; Jonathan R. Wells
【Abstract】: Mass estimation, an alternative to density estimation, has been shown recently to be an effective base modelling mechanism for three data mining tasks of regression, information retrieval and anomaly detection. This paper advances this work in two directions. First, we generalise the previously proposed one-dimensional mass estimation to multidimensional mass estimation, and significantly reduce the time complexity to O(ψh) from O(ψh)-making it feasible for a full range of generic problems. Second, we introduce the first clustering method based on mass-it is unique because it does not employ any distance or density measure. The structure of the new mass model enables different parts of a cluster to be identified and merged without expensive evaluations. The characteristics of the new clustering method are: (i) it can identify arbitrary-shape clusters; (ii) it is significantly faster than existing density-based or distance-based methods; and (iii) it is noise-tolerant.
【Keywords】: computational complexity; data mining; information retrieval; pattern clustering; anomaly detection; arbitrary-shape cluster; base modelling mechanism; data mining task; density estimation; density-based method; distance-based method; information retrieval; mass-based clustering; multidimensional mass estimation; one dimensional mass estimation; time complexity; Clustering methods; Complexity theory; Data mining; Data models; Equations; Estimation; Runtime; Mass estimation; mass-based clustering
【Paper Link】 【Pages】:521-530
【Authors】: Xuan Vinh Nguyen ; Julien Epps
【Abstract】: Traditional clustering has focused on creating a single good clustering solution, while modern, high dimensional data can often be interpreted, and hence clustered, in different ways. Alternative clustering aims at creating multiple clustering solutions that are both of high quality and distinctive from each other. Methods for alternative clustering can be divided into objective-function-oriented and data-transformation-oriented approaches. This paper presents a novel information theoretic-based, objective-function-oriented approach to generate alternative clusterings, in either an unsupervised or semi-supervised manner. We employ the conditional entropy measure for quantifying both clustering quality and distinctiveness, resulting in an analytically consistent combined criterion. Our approach employs a computationally efficient nonparametric entropy estimator, which does not impose any assumption on the probability distributions. We propose a partitional clustering algorithm, named minCEntropy, to concurrently optimize both clustering quality and distinctiveness. minCEntropy requires setting only some rather intuitive parameters, and performs competitively with existing methods for alternative clustering.
【Keywords】: entropy; nonparametric statistics; pattern clustering; statistical distributions; alternative clustering generation; data-transformation-oriented approach; information theoretic approach; minCEntropy; nonparametric entropy estimator; objective-function-oriented approach; partitional clustering algorithm; probability distributions; alternative clustering; clustering; information theoretic clustering; multi-objective optimization; transformation
【Paper Link】 【Pages】:531-540
【Authors】: Chang-Dong Wang ; Jian-Huang Lai ; Jun-Yong Zhu
【Abstract】: Kernel-based clustering is one of the most popular methods for partitioning nonlinearly separable dataset. However, exhaustive search for the global optimum is NP-hard. Iterative procedure such as k-means can be used to seek one of the local minima. Unfortunately, it is easily trapped into degenerate local minima when the prototypes of clusters are ill-initialized. In this paper, we restate the optimization problem of kernel-based clustering in an on-line learning framework, whereby a conscience mechanism is easily integrated to tackle the ill-initialization problem and faster convergence rate is achieved. Thus, we propose a novel approach termed conscience on-line learning (COLL). For each randomly taken data point, our method selects the winning prototype based on the conscience mechanism to bias the ill-initialized prototype to avoid degenerate local minima, and efficiently updates the winner by the on-line learning rule. Therefore, it can more efficiently obtain smaller distortion error than k-means with the same initialization. Experimental results on synthetic and large-scale real-world datasets, as well as that in the application of video clustering, have demonstrated the significant improvement over existing kernel clustering methods.
【Keywords】: distortion; iterative methods; learning (artificial intelligence); optimisation; pattern clustering; video signal processing; NP-hard problem; conscience online learning; distortion error; ill-initialization problem; iterative method; kernel-based clustering; nonlinearly separable dataset partitioning; optimization; video clustering; conscience mechanism; k-means; kernel-based clustering; on-line learning
【Paper Link】 【Pages】:541-550
【Authors】: Dingding Wang ; Tao Li ; Chris H. Q. Ding
【Abstract】: Keyword (Feature) selection enhances and improves many Information Retrieval (IR) tasks such as document categorization, automatic topic discovery, etc. The problem of keyword selection is usually solved using supervised algorithms. In this paper, we propose an unsupervised approach that combines keyword selection and document clustering (topic discovery) together. The proposed approach extends non-negative matrix factorization (NMF) by incorporating a weight matrix to indicate the importance of the keywords. The proposed approach is further extended to a weighted version in which each document is also assigned a weight to assess its importance in the cluster. This work considers both theoretical and empirical weighted feature subset selection for NMF and draws the connection between unsupervised feature selection and data clustering. We apply our proposed approaches to various document understanding tasks including document clustering, summarization, and visualization. Experimental results demonstrate the effectiveness of our approach for these tasks.
【Keywords】: document handling; information retrieval; matrix decomposition; pattern clustering; unsupervised learning; data clustering; document clustering; information retrieval; keyword selection; nonnegative matrix factorization; unsupervised feature selection; weighted feature subset selection; Non-negative matrix factorization; feature selection; weighted feature subset non-negative matrix factorization
【Paper Link】 【Pages】:551-560
【Authors】: Fei Wang ; Ping Li ; Arnd Christian König
【Abstract】: An idealized clustering algorithm seeks to learn a cluster-adjacency matrix such that, if two data points belong to the same cluster, the corresponding entry would be 1; otherwise the entry would be 0. This integer (1/0) constraint makes it difficult to find the optimal solution. We propose a relaxation on the cluster-adjacency matrix, by deriving a bi-stochastic matrix from a data similarity (e.g., kernel) matrix according to the Bregman divergence. Our general method is named the Bregmanian Bi-Stochastication (BBS) algorithm. We focus on two popular choices of the Bregman divergence: the Euclidian distance and the KL divergence. Interestingly, the BBS algorithm using the KL divergence is equivalent to the Sinkhorn-Knopp (SK) algorithm for deriving bi-stochastic matrices. We show that the BBS algorithm using the Euclidian distance is closely related to the relaxed K-means clustering and can often produce noticeably superior clustering results than the SK algorithm (and other algorithms such as Normalized Cut), through extensive experiments on public data sets.
【Keywords】: data handling; integer programming; learning (artificial intelligence); matrix algebra; pattern clustering; Bregman divergence; Bregmanian bistochastication algorithm; Euclidian distance; Sinkhorn-Knopp algorithm; bistochastic data similarity matrix; cluster adjacency matrix; clustering algorithm; integer constraint; k-means clustering
【Paper Link】 【Pages】:561-568
【Authors】: Xiang Wang ; Ian Davidson
【Abstract】: The technique of spectral clustering is widely used to segment a range of data from graphs to images. Our work marks a natural progression of spectral clustering from the original passive unsupervised formulation to our active semi-supervised formulation. We follow the widely used area of constrained clustering and allow supervision in the form of pair wise relations between two nodes: Must-Link and Cannot-Link. Unlike most previous constrained clustering work, our constraints are specified incrementally by querying an oracle (domain expert). Since in practice, each query comes with a cost, our goal is to maximally improve the result with as few queries as possible. The advantages of our approach include: 1) it is principled by querying the constraints which maximally reduce the expected error, 2) it can incorporate both hard and soft constraints which are prevalent in practice. We empirically show that our method significantly outperforms the baseline approach, namely constrained spectral clustering with randomly selected constraints, on UCI benchmark data sets.
【Keywords】: graph theory; pattern clustering; query processing; unsupervised learning; Cannot-Link; Must-Link; UCI benchmark data /fevwwds-spectral; active semisupervised formulation; active spectral clustering; graphs; natural progression; oracle; original passive unsupervised formulation; pairwise relations; querying; randomly selected constraints; active learning; constrained clustering; spectral clustering
【Paper Link】 【Pages】:569-578
【Authors】: Xufei Wang ; Lei Tang ; Huiji Gao ; Huan Liu
【Abstract】: The increasing popularity of social media is shortening the distance between people. Social activities, e.g., tagging in Flickr, book marking in Delicious, twittering in Twitter, etc. are reshaping people's social life and redefining their social roles. People with shared interests tend to form their groups in social media, and users within the same community likely exhibit similar social behavior (e.g., going for the same movies, having similar political viewpoints), which in turn reinforces the community structure. The multiple interactions in social activities entail that the community structures are often overlapping, i.e., one person is involved in several communities. We propose a novel co-clustering framework, which takes advantage of networking information between users and tags in social media, to discover these overlapping communities. In our method, users are connected via tags and tags are connected to users. This explicit representation of users and tags is useful for understanding group evolution by looking at who is interested in what. The efficacy of our method is supported by empirical evaluation in both synthetic and online social networking data.
【Keywords】: Internet; social networking (online); co-clustering; online social networking; social behavior; social media; tags; users representation; Co-Clustering; Community Detection; Overlapping; Social Media
【Paper Link】 【Pages】:579-588
【Authors】: Adam Woznica ; Alexandros Kalousis
【Abstract】: Recently, there has been a growing interest in learning distances directly from training data. While the previous works focused mainly on adapting distance measures over vectorial data, it is a well-known fact that many real-world data could not be easily represented as fixed length tuples of constants. In this paper we address this limitation and propose a novel class of distance learning techniques for learning problems in which instances are set of vectors, examples of such problems include, among others, automatic image annotation and graph classification. We investigate the behavior of the adaptive set distances on a number of artificial and real-world problems and demonstrate that they improve over the standard set distances.
【Keywords】: learning (artificial intelligence); set theory; vectors; k-nearest neighbor algorithm; learning distances; training data; tuples; vector; vectorial data; adaptive distances; complex objects; distance learning; graphs; sets
【Paper Link】 【Pages】:589-598
【Authors】: Yanshan Xiao ; Bo Liu ; Longbing Cao ; Jie Yin ; Xindong Wu
【Abstract】: Multiple instance learning (MIL) is a generalization of supervised learning which attempts to learn useful information from bags of instances. In MIL, the true labels of the instances in positive bags are not always available for training. This leads to a critical challenge, namely, handling the ambiguity of instance labels in positive bags. To address this issue, this paper proposes a novel MIL method named SMILE (Similarity-based Multiple Instance LEarning). It introduces a similarity weight to each instance in positive bag, which represents the instance similarity towards the positive and negative classes. The instances in positive bags, together with their similarity weights, are thereafter incorporated into the learning phase to build an extended SVM-based predictive classifier. Experiments on three real-world datasets consisting of 12 subsets show that SMILE achieves markedly better classification accuracy than state-of-the-art MIL methods.
【Keywords】: generalisation (artificial intelligence); learning (artificial intelligence); pattern classification; support vector machines; MIL method; SMILE; SVM-based predictive classifier; similarity-based multiple instance learning; supervised learning generalization; Multiple Instance Learning
【Paper Link】 【Pages】:599-608
【Authors】: Jaewon Yang ; Jure Leskovec
【Abstract】: Social media forms a central domain for the production and dissemination of real-time information. Even though such flows of information have traditionally been thought of as diffusion processes over social networks, the underlying phenomena are the result of a complex web of interactions among numerous participants. Here we develop the Linear Influence Model where rather than requiring the knowledge of the social network and then modeling the diffusion by predicting which node will influence which other nodes in the network, we focus on modeling the global influence of a node on the rate of diffusion through the (implicit) network. We model the number of newly infected nodes as a function of which other nodes got infected in the past. For each node we estimate an influence function that quantifies how many subsequent infections can be attributed to the influence of that node over time. A nonparametric formulation of the model leads to a simple least squares problem that can be solved on large datasets. We validate our model on a set of 500 million tweets and a set of 170 million news articles and blog posts. We show that the Linear Influence Model accurately models influences of nodes and reliably predicts the temporal dynamics of information diffusion. We find that patterns of influence of individual participants differ significantly depending on the type of the node and the topic of the information.
【Keywords】: least mean squares methods; multimedia computing; social networking (online); blog posts; least squares problem; linear influence model; modeling information diffusion; social media; social networks; Information diffusion; Node influence; Online media
【Paper Link】 【Pages】:609-618
【Authors】: Zi Yang ; Wei Li ; Jie Tang ; Juanzi Li
【Abstract】: In this paper, we consider a novel problem referred to as term filtering with bounded error to reduce the term (feature) space by eliminating terms without (or with bounded) information loss. Different from existing works, the obtained term space provides a complete view of the original term space. More interestingly, several important questions can be answered such as: 1) how different terms interact with each other and 2) how the filtered terms can be represented by the other terms. We perform a theoretical investigation of the term filtering problem and link it to the Geometric Covering By Discs problem, and prove its NP-hardness. We present two novel approaches for both lossless and lossy term filtering with bounds on the introduced error. Experimental results on multiple text mining tasks validate the effectiveness of the proposed approaches.
【Keywords】: computational complexity; data mining; information filtering; NP-hardness; bounded error; discs problem; feature space reduction; geometric covering; lossless term filtering; lossy term filtering; multiple text mining tasks; term space reduction
【Paper Link】 【Pages】:619-628
【Authors】: Min-Ling Zhang ; Zhi-Hua Zhou
【Abstract】: Ensemble learning aims to improve generalization ability by using multiple base learners. It is well-known that to construct a good ensemble, the base learners should be accurate as well as diverse. In this paper, unlabeled data is exploited to facilitate ensemble learning by helping augment the diversity among the base learners. Specifically, a semi-supervised ensemble method named UDEED is proposed. Unlike existing semi-supervised ensemble methods where error-prone pseudo-labels are estimated for unlabeled data to enlarge the labeled data to improve accuracy, UDEED works by maximizing accuracies of base learners on labeled data while maximizing diversity among them on unlabeled data. Experiments show that UDEED can effectively utilize unlabeled data for ensemble learning and is highly competitive to well-established semi-supervised ensemble methods.
【Keywords】: data analysis; learning (artificial intelligence); optimisation; UDEED; accuracy maximization; ensemble learning; semisupervised ensemble method; unlabeled data; diversity; ensemble learning; unlabeled data
【Paper Link】 【Pages】:629-638
【Authors】: Xianchao Zhang ; Yao Wu ; Yang Qiu
【Abstract】: Clusters are hidden in subspaces of high dimensional data, i.e., only a subset of features is relevant for each cluster. Subspace clustering is challenging since the search for the relevant features of each cluster and the detection of the final clusters are circular dependent and should be solved simultaneously. In this paper, we point out that feature correlation and distance divergence are important to subspace clustering, but both have not been considered in previous works. Feature correlation groups correlated features independently thus helps to reduce the search space for the relevant features search problem. Distance divergence distinguishes distances on different dimensions and helps to find the final clusters accurately. We tackle the two problems with the aid of a small amount domain knowledge in the form of must-links and cannot-links. We then devise a semi-supervised subspace clustering algorithm CDCDD. CDCDD integrates our solutions of the feature correlation and distance divergence problems, and uses an adaptive dimension voting scheme, which is derived from a previous unsupervised subspace clustering algorithm FINDIT. Experimental results on both synthetic data sets and real data sets show that the proposed CDCDD algorithm outperforms FINDIT in terms of accuracy, and outperforms the other constraint based algorithm SCMINER in terms of both accuracy and efficiency.
【Keywords】: constraint handling; correlation methods; feature extraction; pattern clustering; search problems; set theory; CDCDD algorithm; adaptive dimension voting scheme; constraint based dimension correlation; distance divergence; feature correlation; feature search; high-dimensional data clustering; semisupervised subspace clustering algorithm; subset; high-dimensional data; pair-wise constraint; semi-supervised learning; subspace clustering
【Paper Link】 【Pages】:639-648
【Authors】: Yaling Zheng ; Stephen D. Scott ; Kun Deng
【Abstract】: In active learning, where a learning algorithm has to purchase the labels of its training examples, it is often assumed that there is only one labeler available to label examples, and that this labeler is noise-free. In reality, it is possible that there are multiple labelers available (such as human labelers in the online annotation tool Amazon Mechanical Turk) and that each such labeler has a different cost and accuracy. We address the active learning problem with multiple labelers where each labeler has a different (known) cost and a different (unknown) accuracy. Our approach uses the idea of adjusted cost, which allows labelers with different costs and accuracies to be directly compared. This allows our algorithm to find low-cost combinations of labelers that result in high-accuracy labelings of instances. Our algorithm further reduces costs by pruning under-performing labelers from the set under consideration, and by halting the process of estimating the accuracy of the labelers as early as it can. We found that our algorithm often outperforms, and is always competitive with, other algorithms in the literature.
【Keywords】: data mining; learning (artificial intelligence); active learning algorithm; high accuracy labeling; label purchase; low cost combination; multiple noisy labelers; noise-free labeler; under performing labeler pruning; active learning; adjusted cost; algorithms; multiple labelers; noisy labelers
【Paper Link】 【Pages】:649-658
【Authors】: Zeyu Zheng ; Jun Yan ; Shuicheng Yan ; Ning Liu ; Zheng Chen ; Ming Zhang
【Abstract】: The good performances of most classical learning algorithms are generally founded on high quality training data, which are clean and unbiased. The availability of such data is however becoming much harder than ever in many real world problems due to the difficulties in collecting large scale unbiased data and precisely labeling them for training. In this paper, we propose a general Contrast Co-learning (CCL) framework to refine the biased and noisy training data when an unbiased yet unlabeled data pool is available. CCL starts with multiple sets of probably biased and noisy training data and trains a set of classifiers individually. Then under the assumption that the confidently classified data samples may have higher probabilities to be correctly classified, CCL iteratively and automatically filtering out possible data noises as well as adding those confidently classified samples from the unlabeled data pool to correct the bias. Through this process, we can generate a cleaner and unbiased training dataset with theoretical guarantees. Extensive experiments on two public text datasets clearly show that CCL consistently improves the algorithmic classification performance on biased and noisy training data compared with several state-of-the-art classical algorithms.
【Keywords】: noise; pattern classification; set theory; unsupervised learning; contrast colearning; data classification; data collection; high quality training; noisy training data; Co-learning; Contrast Classifier; Noisy training data; Training data bias
【Paper Link】 【Pages】:659-668
【Authors】: Fang Zhou ; Sébastien Mahler ; Hannu Toivonen
【Abstract】: We propose a novel problem to simplify weighted graphs by pruning least important edges from them. Simplified graphs can be used to improve visualization of a network, to extract its main structure, or as a pre-processing step for other data mining algorithms. We define a graph connectivity function based on the best paths between all pairs of nodes. Given the number of edges to be pruned, the problem is then to select a subset of edges that best maintains the overall graph connectivity. Our model is applicable to a wide range of settings, including probabilistic graphs, flow graphs and distance graphs, since the path quality function that is used to find best paths can be defined by the user. We analyze the problem, and give lower bounds for the effect of individual edge removal in the case where the path quality function has a natural recursive property. We then propose a range of algorithms and report on experimental results on real networks derived from public biological databases. The results show that a large fraction of edges can be removed quite fast and with minimal effect on the overall graph connectivity. A rough semantic analysis of the removed edges indicates that few important edges were removed, and that the proposed approach could be a valuable tool in aiding users to view or explore weighted graphs.
【Keywords】: data mining; data visualisation; decision trees; graph theory; network theory (graphs); probability; data mining; distance graphs; flow graphs; graph connectivity function; network visualization; path quality function; probabilistic graphs; pruning least; public biological databases; rough semantic analysis; simplify weighted graphs; connectivity; graph mining; network simplification
【Paper Link】 【Pages】:669-678
【Authors】: Hang Zhou ; Fabio T. Ramos ; Eric Nettleton
【Abstract】: This paper introduces a simple yet powerful data transformation strategy for kernel machines. Instead of adapting the parameters of the kernel function w.r.t. the given data (as in conventional methods), we adjust both the kernel hyper-parameters and the given data itself. Using this approach, the input data is transformed to be more representative of the assumptions encoded in the kernel function. A novel complex mapping is proposed to nonlinearly adjust the data. Optimization of the data transformation parameters is performed in two different manners. Firstly, the complex data mapping parameters and kernel hyper-parameters are selected separately, with the former guided by frequency metrics and the latter under the Bayesian framework. Next, the complex data mapping parameters and kernel hyper-parameters are optimized simultaneously in a Bayesian formulation by creating a new category of "integrated kernel" with the complex data mapping embedded. Experiments using Gaussian Process learning have shown that both methods improve the learning accuracy in either classification or regression tasks, with the complex mapping embedded kernel approach outperforming the separate complex mapping one.
【Keywords】: Bayes methods; Gaussian processes; data handling; learning (artificial intelligence); optimisation; support vector machines; Bayesian formulation; Gaussian process learning; data mapping; data transformation; frequency metrics; kernel hyper parameter; kernel method; regression task; Gaussian Process; complex mapping; frequency domain; kernel methods
【Paper Link】 【Pages】:679-688
【Authors】: Tianyi Zhou ; Dacheng Tao ; Xindong Wu
【Abstract】: Support vector machines (SVMs) are invaluable tools for many practical applications in artificial intelligence, e.g., classification and event recognition. However, popular SVM solvers are not sufficiently efficient for applications with a great deal of samples as well as a large number of features. In this paper, thus, we present NESVM, a fast gradient SVM solver that can optimize various SVM models, e.g., classical SVM, linear programming SVM and least square SVM. Compared against SVM-Perf (whose convergence rate in solving the dual SVM is upper bounded by O(1/√k) where k is the number of iterations) and Pegasos (online SVM that converges at rate O(1/k) for the primal SVM), NESVM achieves the optimal convergence rate at O(1/k2) and a linear time complexity. In particular, NESVM smoothes the nondifferentiable hinge loss and ℓ1-norm in the primal SVM. Then the optimal gradient method without any line search is adopted to solve the optimization. In each iteration round, the current gradient and historical gradients are combined to determine the descent direction, while the Lipschitz constant determines the step size. Only two matrix-vector multiplications are required in each iteration round. Therefore, NESVM is more efficient than existing SVM solvers. In addition, NESVM is available for both linear and nonlinear kernels. We also propose "homotopy NESVM" to accelerate NESVM by dynamically decreasing the smooth parameter and using the continuation method. Our experiments on census income categorization, indoor/outdoor scene classification event recognition and scene recognition suggest the efficiency and the effectiveness of NESVM. The MATLAB code of NESVM will be available on our website for further assessment.
【Keywords】: computational complexity; gradient methods; iterative methods; optimisation; support vector machines; NESVM; Nesterov method; continuation method; convergence rate; gradient method; hinge loss; iteration round; linear time complexity; matrix vector multiplication; support vector machine; $ell_1$ norm; Nesterov's method; Support vector machines; continuation method; hinge loss; smooth
【Paper Link】 【Pages】:689-698
【Authors】: Yang Zhou ; Hong Cheng ; Jeffrey Xu Yu
【Abstract】: In recent years, many networks have become available for analysis, including social networks, sensor networks, biological networks, etc. Graph clustering has shown its effectiveness in analyzing and visualizing large networks. The goal of graph clustering is to partition vertices in a large graph into clusters based on various criteria such as vertex connectivity or neighborhood similarity. Many existing graph clustering methods mainly focus on the topological structures, but largely ignore the vertex properties which are often heterogeneous. Recently, a new graph clustering algorithm, SA-Cluster, has been proposed which combines structural and attribute similarities through a unified distance measure. SA-Cluster performs matrix multiplication to calculate the random walk distances between graph vertices. As the edge weights are iteratively adjusted to balance the importance between structural and attribute similarities, matrix multiplication is repeated in each iteration of the clustering process to recalculate the random walk distances which are affected by the edge weight update. In order to improve the efficiency and scalability of SA-Cluster, in this paper, we propose an efficient algorithm Inc-Cluster to incrementally update the random walk distances given the edge weight increments. Complexity analysis is provided to estimate how much runtime cost Inc-Cluster can save. Experimental results demonstrate that Inc-Cluster achieves significant speedup over SA-Cluster on large graphs, while achieving exactly the same clustering quality in terms of intra-cluster structural cohesiveness and attribute value homogeneity.
【Keywords】: graph theory; iterative methods; matrix multiplication; statistical analysis; Inc-Cluster; SA-cluster; biological networks; complexity analysis; incremental approach; iterative method; large attributed graph clustering; large network analysis; large network visualization; matrix multiplication; sensor networks; social networks; topological structure; graph clustering; incremental computation
【Paper Link】 【Pages】:699-708
【Authors】: Qiang Zhu ; Eamonn J. Keogh
【Abstract】: Initiatives such as the Google Print Library Project and the Million Book Project have already archived more than ten million books in digital format, and within the next decade the majority of world's books will be online. Although most of the data will naturally be text, there will also be tens of millions of pages of images, many in color. While there is an active research community pursuing data mining of text from historical manuscripts, there has been very little work that exploits the rich color information which is often present. In this work we introduce a simple color measure which both addresses and exploits typical features of historical manuscripts. To enable the efficient mining of massive archives, we propose a tight lower bound to the measure. Beyond the fast similarity search, we show how this lower bound allows us to build several higher-level data mining tools, including motif discovery and link analyses. We demonstrate our ideas in several data mining tasks on manuscripts dating back to the fifteenth century.
【Keywords】: Artificial neural networks; Books; Color; Data mining; Histograms; Image color analysis; Pixel; Color Indexing; Historical Manuscripts
【Paper Link】 【Pages】:709-718
【Authors】: Fuzhen Zhuang ; Ping Luo ; Zhiyong Shen ; Qing He ; Yuhong Xiong ; Zhongzhi Shi
【Abstract】: We study what we call semi-defined classification, which deals with the categorization tasks where the taxonomy of the data is not well defined in advance. It is motivated by the real-world applications, where the unlabeled data may also come from some other unknown classes besides the known classes for the labeled data. Given the unlabeled data, our goal is to not only identify the instances belonging to the known classes, but also cluster the remaining data into other meaningful groups. It differs from traditional semi-supervised clustering in the sense that in semi-supervised clustering the supervision knowledge is too far from being representative of a target classification, while in semi-defined classification the labeled data may be enough to supervise the learning on the known classes. In this paper we propose the model of Double-latent-layered LDA (D-LDA for short) for this problem. Compared with LDA with only one latent variable y for word topics, D-LDA contains another latent variable z for (known and unknown) document classes. With this double latent layers consisting of y and z and the dependency between them, D-LDA directly injects the class labels into z to supervise the exploiting of word topics in y. Thus, the semi-supervised learning in D-LDA does not need the generation of pair wise constraints, which is required in most of the previous semi-supervised clustering approaches. We present the experimental results on ten different data sets for semi-defined classification. Our results are either comparable to (on one data sets), or significantly better (on the other nine data set) than the six compared methods, including the state-of-the-art semi-supervised clustering methods.
【Keywords】: learning (artificial intelligence); pattern classification; pattern clustering; D-LDA; constraint generation; data cluster; data taxonomy; double-latent-layered LDA; labeled data; semidefined classification; semisupervised learning; topic modeling approach; unlabeled data; Gibbs Sampling; Semi-defined classification; Semi-supervised clustering; Topic modeling
【Paper Link】 【Pages】:719-724
【Authors】: Mohammad Al Hasan ; Hilmi Yildirim ; Abhirup Chakraborty
【Abstract】: Approximate Nearest Neighbor search over high dimensional data is an important problem with a wide range of practical applications. In this paper, we propose SONNET, a simple multi-core friendly approximate nearest neighbor algorithm that is based on rank aggregation. SONNET is particularly suitable for very high dimensional data, its performance gets better as the dimension increases, whereas the majority of the existing algorithms show a reverse trend. Furthermore, most of the existing algorithms are hard to parallelize either due to the sequential nature of the algorithm or due to the inherent complexity of the algorithm. On the other hand, SONNET has inherent parallelism embedded in the core concept of the algorithm, which earns it almost a linear speed-up as the number of cores increases. Finally, SONNET is very easy to implement and it has an approximation parameter which is intuitively simple.
【Keywords】: approximation theory; data mining; parallel algorithms; tree searching; Sonnet; approximate nearest neighbor search; multi-core; rank aggregation; approximate nearest neighbors; early termination; nearest neighbors; rank aggregation
【Paper Link】 【Pages】:725-730
【Authors】: Suhrid Balakrishnan ; Sumit Chopra
【Abstract】: While latent factor models are built using ratings data, which is typically assumed static, the ability to incorporate different kinds of subsequent user feedback is an important asset. For instance, the user might want to provide additional information to the system in order to improve his personal recommendations. To this end, we examine a novel scheme for efficiently learning (or refining) user parameters from such feedback. We propose a scheme where users are presented with a sequence of pair wise preference questions: "Do you prefer item A over B?". User parameters are updated based on their response, and subsequent questions are chosen adaptively after incorporating the feedback. We operate in a Bayesian framework and the choice of questions is based on an information gain criterion. We validate the scheme on the Netflix movie ratings data set. A user study and automated experiments validate our findings.
【Keywords】: Bayes methods; game theory; information filtering; learning (artificial intelligence); recommender systems; Bayesian framework; Netflix movie ratings data set; adaptive pairwise preference; latent factor model; ratings game; recommender system; Active Learning; Latent factor models; Pairwise preferences; Recommender Systems
【Paper Link】 【Pages】:731-736
【Authors】: Ranieri Baraglia ; Gianmarco De Francisci Morales ; Claudio Lucchese
【Abstract】: Given a collection of objects, the Similarity Self-Join problem requires to discover all those pairs of objects whose similarity is above a user defined threshold. In this paper we focus on document collections, which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. This work borrows from the state of the art in serial algorithms for similarity join and MapReduce-based techniques for set-similarity join. The proposed algorithm shows that it is possible to leverage a distributed file system to support communication patterns that do not naturally fit the MapReduce framework. Scalability is achieved by introducing a partitioning strategy able to overcome memory bottlenecks. Experimental evidence on real world data shows that our algorithm outperforms the state of the art by a factor 4.5.
【Keywords】: distributed databases; document handling; network operating systems; parallel algorithms; MapReduce-based technique; distributed file system; document similarity; memory bottlenecks; object collection; parallel algorithm; partitioning strategy; pruning strategy; serial algorithm; user defined threshold; MapReduce; Similarity Self-Join; Web Information Retrieval
【Paper Link】 【Pages】:737-742
【Authors】: Antonio Bella ; César Ferri ; José Hernández-Orallo ; M. José Ramírez-Quintana
【Abstract】: Quantification is the name given to a novel machine learning task which deals with correctly estimating the number of elements of one class in a set of examples. The output of a quantifier is a real value, since training instances are the same as a classification problem, a natural approach is to train a classifier and to derive a quantifier from it. Some previous works have shown that just classifying the instances and counting the examples belonging to the class of interest classify count typically yields bad quantifiers, especially when the class distribution may vary between training and test. Hence, adjusted versions of classify count have been developed by using modified thresholds. However, previous works have explicitly discarded (without a deep analysis) any possible approach based on the probability estimations of the classifier. In this paper, we present a method based on averaging the probability estimations of a classifier with a very simple scaling that does perform reasonably well, showing that probability estimators for quantification capture a richer view of the problem than methods based on a threshold.
【Keywords】: learning (artificial intelligence); pattern classification; probability; classifier training; machine learning; probability estimations; quantifier; class imbalance; classification; probability estimators; quantification
【Paper Link】 【Pages】:743-748
【Authors】: Xiongcai Cai ; Michael Bain ; Alfred Krzywicki ; Wayne Wobcke ; Yang Sok Kim ; Paul Compton ; Ashesh Mahidadia
【Abstract】: Predicting people who other people may like has recently become an important task in many online social networks. Traditional collaborative filtering (CF) approaches are popular in recommender systems to effectively predict user preferences for items. One major problem in CF is computing similarity between users or items. Traditional CF methods often use heuristic methods to combine the ratings given to an item by similar users, which may not reflect the characteristics of the active user and can give unsatisfactory performance. In contrast to heuristic approaches we have developed CollabNet, a novel algorithm that uses gradient descent to learn the relative contributions of similar users or items to the ranking of recommendations produced by a recommender system, using weights to represent the contributions of similar users for each active user. We have applied CollabNet to the challenging problem of people to people recommendation in social networks, where people have a dual role as both "users" and "items", e.g., both initiating and receiving communications, to recommend other users to a given user, based on user similarity in terms of both taste (whom they like) and attractiveness (who likes them). Evaluation of CollabNet recommendations on datasets from a commercial online social network shows improved performance over standard CF.
【Keywords】: data mining; groupware; information filtering; learning (artificial intelligence); recommender systems; social networking (online); CollabNet; collaborative filtering; datasets; learning; recommender system; social network; Collaborative Filtering; Data Mining; Machine Learning; Recommender Systems
【Paper Link】 【Pages】:749-754
【Authors】: Toon Calders ; Calin Garboni ; Bart Goethals
【Abstract】: Mining frequent item sets from transactional datasets is a well known problem with good algorithmic solutions. Most of these algorithms assume that the input data is free from errors. Real data, however, is often affected by noise. Such noise can be represented by uncertain datasets in which each item has an existence probability. Recently, Bernecker et al. (2009) proposed the frequentness probability, i.e., the probability that a given item set is frequent, to select item sets in an uncertain database. A dynamic programming approach to evaluate this measure was given as well. We argue, however, that for the setting of Bernecker et al. (2009), that assumes independence between the items, already well-known statistical tools exist. We show how the frequentness probability can be approximated extremely accurately using a form of the central limit theorem. We experimentally evaluated our approximation and compared it to the dynamic programming approach. The evaluation shows that our approximation method is extremely accurate even for very small databases while at the same time it has much lower memory overhead and computation time.
【Keywords】: approximation theory; data mining; dynamic programming; statistical distributions; transaction processing; central limit theorem; dynamic programming; frequent itemset mining; frequentness probability approximation; statistical tool; transactional dataset; uncertain data
【Paper Link】 【Pages】:755-760
【Authors】: Andrea Campagna ; Rasmus Pagh
【Abstract】: Given a directed a cyclic graph with labeled vertices, we consider the problem of finding the most common label sequences ("traces") among all paths in the graph (of some maximum length m). Since the number of paths can be huge, we propose novel algorithms whose time complexity depends only on the size of the graph, and on the frequency varepsilon of the most frequent traces. In addition, we apply techniques from streaming algorithms to achieve space usage that depends only on varepsilon, and not on the number of distinct traces. The abstract problem considered models a variety of tasks concerning finding frequent patterns in event sequences. Our motivation comes from working with a data set of 2 million RFID readings from baggage trolleys at Copenhagen Airport. The question of finding frequent passenger movement patterns is mapped to the above problem. We report on experimental findings for this data set.
【Keywords】: data mining; directed graphs; pattern clustering; radiofrequency identification; sequences; Copenhagen Airport; RFID readings; abstract problem; baggage trolleys; directed acyclic graph; event sequences; frequent patterns; frequent traces; label sequences; passenger movement patterns; pattern mapping; streaming algorithms; time complexity; algorithms; data mining; graphs; patterns discovery; sampling
【Paper Link】 【Pages】:761-766
【Authors】: Nicolas Cebron
【Abstract】: When we think of an object in a supervised learning setting, we usually perceive it as a collection of fixed attribute values. Although this setting may be suited well for many classification tasks, we propose a new object representation and therewith a new challenge in data mining: an object is no longer described by one set of attributes but is represented in a hierarchy of attribute sets in different levels of quality. Obtaining a more detailed representation of an object comes with a cost. This raises the interesting question of which objects we want to enhance under a given budget and cost model. This new setting is very useful whenever resources like computing power, memory or time are limited. We propose a new Active Adaptive Algorithm (AAA) to improve objects in an iterative fashion. We demonstrate how to create a hierarchical object representation and prove the effectiveness of our new selection algorithm on these datasets.
【Keywords】: data mining; feature extraction; learning (artificial intelligence); object detection; pattern classification; active adaptive algorithm; budget constraint; classification task; data mining; hierarchical object feature; object representation; supervised learning; Active vision; Pattern classification
【Paper Link】 【Pages】:767-772
【Authors】: Shing-Kit Chan ; Wai Lam
【Abstract】: Cascaded approach has been used for a long time to conduct sub-tasks in order to accomplish a major task. We put cascaded approach in a probabilistic framework and analyze possible reasons for cascaded errors. To reduce the occurrence of cascaded errors, we need to add a constraint when performing joint training. We suggest a pseudo Conditional Random Field (pseudo-CRF) approach that models two sub-tasks as two Conditional Random Fields (CRFs). We then present the formulation in the context of a linear chain CRF for solving problems on sequence data. In conducting joint training for a pseudo-CRF, we reuse all existing well-developed efficient inference algorithms for a linear chain CRF, which would otherwise require the use of approximate inference algorithms or simulations that involve long computational time. Our experimental results show an interesting fact that a jointly trained CRF model in a pseudo-CRF may perform worse than a separately trained CRF on a sub-task. However the overall system performance of a pseudo-CRF would outperform that of a cascaded approach. We implement the implicit constraint in the form of a soft constraint such that users can define the penalty cost for violating the constraint. In order to work on large-scale datasets, we further suggest a parallel implementation of the pseudo-CRF approach, which can be implemented on a multi-core CPU or GPU on a graphics card that supports multi-threading. Our experimental results show that it can achieve a 12 times increase in speedup.
【Keywords】: data mining; inference mechanisms; parallel algorithms; random processes; GPU; inference algorithms; multicore CPU; multithreading; pseudo conditional random field; sequence data labeling; sequence data segmention; CRF; Cascaded Approach; Joint Training; Noun-phrase Chunking; Sequence Labeling Problem
【Paper Link】 【Pages】:773-778
【Authors】: Bo Chen ; Wai Lam ; Ivor W. Tsang ; Tak-Lam Wong
【Abstract】: Dataset shift from the training data in a source domain to the data in a target domain poses a great challenge for many statistical learning methods. Most algorithms can be viewed as exploiting only the first-order statistics, namely, the empirical mean discrepancy to evaluate the distribution gap. Intuitively, considering only the empirical mean may not be statistically efficient. In this paper, we propose a non-parametric distance metric with a good property which jointly considers the empirical mean (Location) and sample covariance (Scatter) difference. More specifically, we propose an improved symmetric Stein's loss function which combines the mean and covariance discrepancy into a unified Bregman matrix divergence of which Jensen-Shannon divergence between normal distributions is a particular case. Our target is to find a good feature representation which can reduce the distribution gap between different domains, at the same time, ensure that the new derived representation can encode most discriminative components with respect to the label information. We have conducted extensive experiments on several document classification datasets to demonstrate the effectiveness of our proposed method.
【Keywords】: covariance analysis; data mining; feature extraction; nonparametric statistics; Bregman matrix divergence; Jensen Shannon divergence; covariance discrepancy; dataset shift; distribution gap; empirical mean discrepancy; feature representation; location matching; nonparametric distance metric; sample covariance difference; scatter matching; statistical learning; symmetric Stein loss function; text mining; Domain Adaptation; Feature Extraction
【Paper Link】 【Pages】:779-784
【Authors】: Xi Chen ; Bing Bai ; Yanjun Qi ; Qihang Lin ; Jaime G. Carbonell
【Abstract】: We study the retrieval task that ranks a set of objects for a given query in the pair wise preference learning framework. Recently researchers found out that raw features (e.g. words for text retrieval) and their pair wise features which describe relationships between two raw features (e.g. word synonymy or polysemy) could greatly improve the retrieval precision. However, most existing methods can not scale up to problems with many raw features (e.g. English vocabulary), due to the prohibitive computational cost on learning and the memory requirement to store a quadratic number of parameters. In this paper, we propose to learn a sparse representation of the pair wise features under the preference learning framework using the L1 regularization. Based on stochastic gradient descent, an online algorithm is devised to enforce the sparsity using a mini-batch shrinkage strategy. On multiple benchmark datasets, we show that our method achieves better performance with fast convergence, and takes much less memory on models with millions of parameters.
【Keywords】: data mining; feature extraction; gradient methods; learning (artificial intelligence); query processing; stochastic processes; text analysis; L1 regularization; mini-batch shrinkage strategy; multiple benchmark dataset; object set; online algorithm; pairwise feature; pairwise preference learning; query processing; raw feature; retrieval task; sparse representation; stochastic gradient descent algorithm; text mining; learning to rank; online learning; preference learning; sparse model; text mining
【Paper Link】 【Pages】:785-790
【Authors】: Robson Leonardo Ferreira Cordeiro ; Fan Guo ; Donna S. Haverkamp ; James H. Horne ; Ellen K. Hughes ; Gunhee Kim ; Agma J. M. Traina ; Caetano Traina Jr. ; Christos Faloutsos
【Abstract】: Given a large collection of images, very few of which have labels, how can we guess the labels of the remaining majority, and how can we spot those images that need brand new labels, different from the existing ones? Current automatic labeling techniques usually scale super linearly with the data size, and/or they fail when only a tiny amount of labeled data is provided. In this paper, we propose QMAS (Querying, Mining And Summarization of Multi-modal Databases), a fast solution to the following problems: (i) low-labor labeling (L3) - given a collection of images, very few of which are labeled with keywords, find the most suitable labels for the remaining ones, and (ii) mining and attention routing - in the same setting, find clusters, the top-NO outlier images, and the top-NR representative images. We report experiments on real satellite images, two large sets (1.5 GB and 2.25 GB) of proprietary images and a smaller set (17 MB) of public images. We show that QMAS scales linearly with the data size, being up to 40 times faster than top competitors (GCap), obtaining better or equal accuracy. In contrast to other methods, QMAS does low-labor labeling (L3), that is, it works even with tiny initial label sets. It also solves both presented problems and spots tiles that potentially require new labels.
【Keywords】: data mining; image recognition; query processing; visual databases; QMAS; image collection; labeling technique; low labor labeling; multimodal databases mining; multimodal databases querying; multimodal databases summarization; public image; satellite image; Automatic labeling; Clustering; Summarization and Multi-modal databases
【Paper Link】 【Pages】:791-796
【Authors】: Kamalika Das ; Ashok N. Srivastava
【Abstract】: Regression problems on massive data sets are ubiquitous in many application domains including the Internet, earth and space sciences, and finances. In many cases, regression algorithms such as linear regression or neural networks attempt to fit the target variable as a function of the input variables without regard to the underlying joint distribution of the variables. As a result, these global models are not sensitive to variations in the local structure of the input space. Several algorithms, including the mixture of experts model, classification and regression trees (CART), and others have been developed, motivated by the fact that a variability in the local distribution of inputs may be reflective of a significant change in the target variable. While these methods can handle the non-stationarity in the relationships to varying degrees, they are often not scalable and, therefore, not used in large scale data mining applications. In this paper we develop Block-GP, a Gaussian Process regression framework for multimodal data, that can be an order of magnitude more scalable than existing state-of-the-art nonlinear regression algorithms. The framework builds local Gaussian Processes on semantically meaningful partitions of the data and provides higher prediction accuracy than a single global model with very high confidence. The method relies on approximating the covariance matrix of the entire input space by smaller covariance matrices that can be modeled independently, and can therefore be parallelized for faster execution. Theoretical analysis and empirical studies on various synthetic and real data sets show high accuracy and scalability of Block-GP compared to existing nonlinear regression techniques.
【Keywords】: Gaussian processes; pattern clustering; regression analysis; very large databases; Block-GP; Internet; classification and regression trees; covariance matrix; earth sciences; finances; mixture of experts model; multimodal data; scalable Gaussian process regression; semantically meaningful partitions; space sciences; state-of-the-art nonlinear regression algorithms; Gaussian Process; parallel computation; regression
【Paper Link】 【Pages】:797-802
【Authors】: Jun Du ; Charles X. Ling
【Abstract】: When active learning is applied to real-world applications, human experts usually act as oracles to provide labels. However, human make mistakes, thus noise might be introduced during the learning process. Most previous studies simplify the problem by assuming uniformly-distributed noise over the sample space. Such assumption, however, might fail to precisely reflect the human experts' behaviour in real-world situations. In this paper, we therefore study active learning with such human-like oracles, by making a more realistic assumption that the noise is example-dependent (i.e., non-uniformly distributed over the sample space). More specifically, when the human-like oracle is highly confident in labelling examples, it is naturally less likely to provide incorrect answers, whereas when such confidence is low, the noise would be more likely to be introduced. Based on the analysis of such human-like oracle, we propose a generic yet simple active learning algorithm to simultaneously explore the unlabelled data and exploit the labelled data. Empirical study on both synthetic and real-world data sets verifies the superiority of the proposed algorithm, compared with the traditional uncertainty sampling.
【Keywords】: data mining; learning (artificial intelligence); sampling methods; active learning; human experts; labelled data; noise; oracles; uncertainty sampling; unlabelled data; active learning; noise; oracle
【Paper Link】 【Pages】:803-808
【Authors】: Ad Feelders
【Abstract】: In many applications of data mining we know beforehand that the response variable should be increasing (or decreasing) in the attributes. Such relations between response and attributes are called monotone. In this paper we present a new algorithm to compute an optimal monotone classification of a data set for convex loss functions. Moreover, we show how the algorithm can be extended to compute all optimal monotone classifications with little additional effort. Monotone relabeling is useful for at least two reasons. Firstly, models trained on relabeled data sets often have better predictive performance than models trained on the original data. Secondly, relabeling is an important building block for the construction of monotone classifiers. We apply the new algorithm to investigate the effect on the prediction error of relabeling the training sample for k nearest neighbour classification and classification trees. In contrast to previous work in this area, we consider all optimal monotone relabelings. The results show that, for small training samples, relabeling the training data results in significantly better predictive performance.
【Keywords】: data mining; learning (artificial intelligence); pattern classification; classification trees; convex loss functions; data mining; k nearest neighbour classification; monotone classification; monotone relabeling; ordinal classification
【Paper Link】 【Pages】:809-814
【Authors】: George Giannakopoulos ; Themis Palpanas
【Abstract】: In several concept attainment systems, ranging from recommendation systems to information filtering, a sliding window of learning instances has been used in the learning process to allow the learner to follow concepts that change over time. However, no analytic study has been performed on the relation between the size of the sliding window and the performance of a learning system. In this work, we present such an analytic model that describes the effect of the sliding window size on the prediction performance of a learning system based on iterative feedback. Using a signal-to-noise approach to model the learning ability of the underlying machine learning algorithms, we can provide good estimates of the average performance of a modeling system independently of the supervised machine learning algorithm employed. We experimentally validate the effectiveness of the proposed methodology with detailed experiments using synthetic and real datasets, and a variety of learning algorithms, including Support Vector Machines, Naive Bayes, Nearest Neighbor and Decision Trees. The results validate the analysis and indicate very good estimation performance in different settings.
【Keywords】: Bayes methods; decision trees; information filtering; iterative methods; learning (artificial intelligence); recommender systems; support vector machines; decision trees; information filtering; iterative feedback; learning ability; learning system; machine learning algorithm; naive Bayes; nearest neighbor algorithm; recommendation system; signal-to-noise approach; sliding window; support vector machine; adaptive learning; concept drift; demanding lord problem; user modeling
【Paper Link】 【Pages】:815-820
【Authors】: Fabian Gieseke ; Gabriel Moruz ; Jan Vahrenhold
【Abstract】: We develop a k-d tree variant that is resilient to a pre-described number of memory corruptions while still using only linear space. We show how to use this data structure in the context of clustering in high-radiation environments and demonstrate that our approach leads to a significantly higher resiliency rate compared to previous results.
【Keywords】: data analysis; pattern clustering; tree data structures; data structure; high radiation environment; linear space; memory corruption; resiliency rate; resilient k-d tree; revisited space; clustering; k-d tree; k-means; resilient algorithm
【Paper Link】 【Pages】:821-826
【Authors】: Sertan Girgin ; Jérémie Mary ; Philippe Preux ; Olivier Nicol
【Abstract】: We consider the problem of displaying advertisements on web pages in the "cost per click" model, which necessitates to learn the appeal of visitors for the different advertisements in order to maximize the revenue. In a realistic context, the advertisements have constraints such as a certain number of clicks to draw, as well as a lifetime. This problem is thus inherently dynamic, and intimately combines combinatorial and statistical issues. To set the stage, it is also noteworthy that we deal with very rare events of interest, since the base probability of one click is in the order of 10-4. We introduce an adaptive policy learning algorithm based on linear programming, and investigate its performance through simulations on a realistic model designed with an important commercial web actor.
【Keywords】: Web sites; advertising; learning (artificial intelligence); linear programming; Web page; adaptive policy learning algorithm; advertising campaigns management; combinatorial issue; commercial Web actor; cost per click model; linear programming; statistical issue; Advertisement selection; CTR estimation; Exploration/exploitation trade-off; Linear Programming; Non-stationary setting; Optimization
【Paper Link】 【Pages】:827-832
【Authors】: Ben Goodrich ; David W. Albrecht ; Peter E. Tischer
【Abstract】: By considering the geometric properties of the Support Vector Machine (SVM) and Minimal Enclosing Ball (MEB) optimization problems, we show that upper and lower bounds on the radius-margin ratio of an SVM can be efficiently computed at any point during training. We use these bounds to accelerate radius-margin parameter selection by terminating training routines as early as possible, while still obtaining a guarantee that the parameters minimize the radius-margin ratio. Once an SVM has been partially trained on any set of parameters, we also show that these bounds can be used to evaluate and possibly reject neighboring parameter values with little or no additional training required. Empirical results show that, when selecting two parameter values, this process can reduce the number of training iterations required by a factor of 10 or more, while suffering no loss of precision in minimizing the radius-margin ratio.
【Keywords】: computational geometry; iterative methods; optimisation; parameter estimation; support vector machines; MEB optimization problem; SVM; geometric bound; minimal enclosing ball optimization problem; radius-margin parameter selection; support vector machine; training iteration; training routine; computational geometry; parameter selection; support vector machines
【Paper Link】 【Pages】:833-838
【Authors】: Francesco Gullo ; Carlotta Domeniconi ; Andrea Tagarelli
【Abstract】: Projective Clustering Ensembles (PCE) has recently been formulated to solve the problem of deriving a robust projective consensus clustering from an ensemble of projective clustering solutions. PCE is formalized as an optimization problem with either a two-objective or a single-objective function, depending on whether the object-based and the feature-based representations of the clusters in the ensemble are treated separately. A major result in is that single-objective PCE outperforms two-objective PCE in terms of efficiency, at the cost of lower accuracy in consensus clustering. In this paper, we enhance the single-objective PCE formulation, with the ultimate goal of providing more effective formulations capable of reducing the accuracy gap with the two-objective counterpart, while maintaining the efficiency advantages. We provide theoretical insights into the single-objective function, and introduce two heuristics that overcome the major limitations of the previous single-objective PCE formulation. Experimental evidence has demonstrated the significance of our proposed heuristics. In fact, results have not only confirmed a far better efficiency w.r.t. two-objective PCE, but have also shown the claimed improvements in accuracy of the consensus clustering obtained by the new single-objective PCE.
【Keywords】: optimisation; pattern clustering; PCE; objective function; optimization; projective clustering ensembles; robust projective consensus clustering
【Paper Link】 【Pages】:839-844
【Authors】: Francesco Gullo ; Giovanni Ponti ; Andrea Tagarelli
【Abstract】: The increasing demand for dealing with uncertainty in data has led to the development of effective and efficient approaches in the data management and mining contexts. Clustering uncertain data objects has particularly attracted great attention in the data mining community. Most existing clustering methods however have urgently to come up with a number of issues, some of which are related to a poor efficiency mainly due to an expensive computation of the distance between uncertain objects. In this work, we propose a novel formulation to the problem of clustering uncertain objects, which allows for reaching accurate solutions by minimizing the variance of the mixture models that represent the clusters to be identified. We define a heuristic, MMVar, which exploits some analytical properties about the computation of variance for mixture models to compute local minima of the objective function at the basis of the proposed formulation. This characteristic allows MMVar to discard any distance measure between uncertain objects and, therefore, to achieve high efficiency. Experiments have shown that MMVar outperforms state-of-the-art algorithms from an efficiency viewpoint, while achieving better average performance in terms of accuracy.
【Keywords】: data mining; pattern clustering; uncertainty handling; MMVar; cluster mixture model; data management; data mining; data uncertainty; uncertain data object clustering
【Paper Link】 【Pages】:845-850
【Authors】: Stephan Günnemann ; Ines Färber ; Brigitte Boden ; Thomas Seidl
【Abstract】: Today's applications deal with multiple types of information: graph data to represent the relations between objects and attribute data to characterize single objects. Analyzing both data sources simultaneously can increase the quality of mining methods. Recently, combined clustering approaches were introduced, which detect densely connected node sets within one large graph that also show high similarity according to all of their attribute values. However, for attribute data it is known that this full-space clustering often leads to poor clustering results. Thus, subspace clustering was introduced to identify locally relevant subsets of attributes for each cluster. In this work, we propose a method for finding homogeneous groups by joining the paradigms of subspace clustering and dense sub graph mining, i.e. we determine sets of nodes that show high similarity in subsets of their dimensions and that are as well densely connected within the given graph. Our twofold clusters are optimized according to their density, size, and number of relevant dimensions. Our developed redundancy model confines the clustering to a manageable size of only the most interesting clusters. We introduce the algorithm Gamer for the efficient calculation of our clustering. In thorough experiments on synthetic and real world data we show that Gamer achieves low runtimes and high clustering qualities.
【Keywords】: data mining; graph theory; optimisation; pattern clustering; redundancy; data attribute; data mining; dense subgraph mining; graph data; optimization; redundancy; subspace clustering; attribute data; combined clustering approach; dense subgraph mining; graph data; redundancy removal; subspace clustering
【Paper Link】 【Pages】:851-856
【Authors】: Robert Gwadera
【Abstract】: Sliding-window multi-stream join (SWMJ) is a fundamental operation for correlating information from different streams. We provide a solution to the problem of assessing significance of the SWMJ result by focusing on the relative frequency of windows satisfying a given equijoin predicate as the most important parameter of the SWMJ result. In particular, we derive an analytic formula for computing the average relative frequency of windows satisfying a given equijoin predicate that can be evaluated in quadratic time in the window size given a probabilistic model of the multi-stream. In experiments we demonstrated remarkable accuracy of our method, which confirmed our theoretical analysis.
【Keywords】: data mining; probability; query processing; cross stream correlation; data Mining; multistream join answering; probabilistic model; quadratic time; sliding window multistream join
【Paper Link】 【Pages】:857-862
【Authors】: Tsukasa Ishigaki ; Takeshi Takenaka ; Yoichi Motomura
【Abstract】: This paper describes an appropriate category discovery method that simultaneously involves a customer's lifestyle category and item category for the sustainable management of retail services, designated as ``category mining''. Category mining is realized using a large-scale ID-POS data and customer's questionnaire responses with respect to their lifestyle. For the heterogeneous data fusion, we propose a probabilistic double-latent semantic indexing (PdLSI) model that is an extension of PLSI model. In the PdLSI model, customers and items are classified probabilistically into some latent lifestyle categories and latent item category. Then, understanding of relation between the latent categories and various purchased situations is realized using Bayesian network modeling. This method provides useful knowledge based on a large-scale data for efficient customer relationship management and category management, and can be applicable for other service industries.
【Keywords】: belief networks; customer relationship management; data mining; probability; retail data processing; sensor fusion; service industries; Bayesian network; category discovery; category management; category mining; customer relationship management; heterogeneous data fusion; item classification; large scale ID POS data; latent item category; latent lifestyle category; probabilistic double latent semantic indexing method; retail service industries; Bayesian network; heterogeneous data fusion; large-scale ID-POS data; service engineering; topic model
【Paper Link】 【Pages】:863-868
【Authors】: Santosh Kabbur ; Eui-Hong Han ; George Karypis
【Abstract】: Demographic information plays an important role in gaining valuable insights about a web-site's user-base and is used extensively to target online advertisements and promotions. This paper investigates machine-learning approaches for predicting the demographic attributes of web-sites using information derived from their content and their hyper linked structure and not relying on any information directly or indirectly obtained from the web-site's users. Such methods are important because users are becoming increasingly more concerned about sharing their personal and behavioral information on the Internet. Regression-based approaches are developed and studied for predicting demographic attributes that utilize different content-derived features, different ways of building the prediction models, and different ways of aggregating web-page level predictions that take into account the web's hyper linked structure. In addition, a matrix-approximation based approach is developed for coupling the predictions of individual regression models into a model designed to predict the probability mass function of the attribute. Extensive experiments show that these methods are able to achieve an RMSE of 8-10% and provide insights on how to best train and apply such models.
【Keywords】: Internet; Web sites; advertising data processing; approximation theory; content-based retrieval; demography; learning (artificial intelligence); matrix algebra; regression analysis; Internet; Web hyperlinked structure; Web site demographic attributes prediction; Web-page level predictions; content based method; hyperlinked structure; machine-learning approaches; matrix-approximation based approach; online advertisements; probability mass function; regression based approach; Content Based Models; Demographic Attribute Prediction; Inlink Count; Probability Mass Function; Regression
【Paper Link】 【Pages】:869-874
【Authors】: Faisal Kamiran ; Toon Calders ; Mykola Pechenizkiy
【Abstract】: Recently, the following discrimination aware classification problem was introduced: given a labeled dataset and an attribute B, find a classifier with high predictive accuracy that at the same time does not discriminate on the basis of the given attribute B. This problem is motivated by the fact that often available historic data is biased due to discrimination, e.g., when B denotes ethnicity. Using the standard learners on this data may lead to wrongfully biased classifiers, even if the attribute B is removed from training data. Existing solutions for this problem consist in “cleaning away” the discrimination from the dataset before a classifier is learned. In this paper we study an alternative approach in which the non-discrimination constraint is pushed deeply into a decision tree learner by changing its splitting criterion and pruning strategy. Experimental evaluation shows that the proposed approach advances the state-of-the-art in the sense that the learned decision trees have a lower discrimination than models provided by previous methods, with little loss in accuracy.
【Keywords】: data mining; decision making; decision trees; learning (artificial intelligence); pattern classification; decision tree learning; discrimination aware classification; pruning strategy; splitting criterion; Classification; Data Mining; Discrimination Aware Data Mining
【Paper Link】 【Pages】:875-880
【Authors】: U. Kang ; Mary McGlohon ; Leman Akoglu ; Christos Faloutsos
【Abstract】: How do connected components evolve? What are the regularities that govern the dynamic growth process and the static snapshot of the connected components? In this work, we study patterns in connected components of large, real-world graphs. First, we study one of the largest static Web graphs with billions of nodes and edges and analyze the regularities among the connected components using GFD(Graph Fractal Dimension) as our main tool. Second, we study several time evolving graphs and find dynamic patterns and rules that govern the dynamics of connected components. We analyze the growth rates of top connected components and study their relation over time. We also study the probability that a newcomer absorbs to disconnected components as a function of the current portion of the disconnected components and the degree of the newcomer. Finally, we propose a generative model that explains both the dynamic growth process and the static regularities of connected components.
【Keywords】: Internet; data mining; fractals; graph theory; probability; connected components; dynamic growth process; generative model; graph fractal dimension; real-world graphs; static Web graphs; static regularities; terabyte scale graph; CommunityConnection Model; Evolution of Connected Components; Graph Mining
【Paper Link】 【Pages】:881-886
【Authors】: Brendan Kitts ; Liang Wei ; Dyng Au ; Amanda Powter ; Brian Burdick
【Abstract】: This paper presents a practical method for measuring the impact of multiple marketing events on sales, including marketing events that are not traditionally trackable. The technique infers which of several competing media events are likely to have caused a given conversion. We test the method using hold-out sets, and also a live media experiment in which we test whether the method can accurately predict television-generated web conversions.
【Keywords】: advertising data processing; multimedia systems; attribution; conversion event; marketing event; media events; multichannel media; advertising; attribution; credit assignment
【Paper Link】 【Pages】:887-892
【Authors】: Neal Lathia ; Jon Froehlich ; Licia Capra
【Abstract】: Traveller information, route planning, and service updates have become essential components of public transport systems: they help people navigate built environments by providing access to information regarding delays and service disruptions. However, one aspect that these systems lack is a way of tailoring the information they offer in order to provide personalised trip time estimates and relevant notifications to each traveller. Mining each user's travel history, collected by automated ticketing systems, has the potential to address this gap. In this work, we analyse one such dataset of travel history on the London underground. We then propose and evaluate methods to (a) predict personalised trip times for the system users and (b) rank stations based on future mobility patterns, in order to identify the subset of stations that are of greatest interest to the user and thus provide useful travel updates.
【Keywords】: data mining; public administration; public information systems; rapid transit systems; traffic information systems; London underground; automated ticketing systems; data mining; mobility patterns; personalised intelligent transport systems; personalised trip times; public transport systems; route planning; service disruptions; service updates; travel history; traveller information; Intelligent Transport Systems; Personalization
【Paper Link】 【Pages】:893-898
【Authors】: Guangxia Li ; Steven C. H. Hoi ; Kuiyu Chang ; Ramesh Jain
【Abstract】: We study the online micro-blog sentiment detection problem, which aims to determine whether a micro-blog post expresses emotions. This problem is challenging because a micro-blog post is very short and individuals have distinct ways of expressing emotions. A single classification model trained on the entire corpus may fail to capture characteristics unique to each user. On the other hand, a personalized model for each user may be inaccurate due to the scarcity of training data, especially at the very beginning where users have just posted a few entries. To overcome these challenges, we propose learning a global model over all micro-bloggers, which is then leveraged to continuously refine the individual models through a collaborative online learning way. We evaluate our algorithm on a real-life micro-blog dataset collected from the popular micro-blog site - Twitter. Results show that our algorithm is effective and efficient for timely sentiment detection in real micro-blogging applications.
【Keywords】: Web sites; computer aided instruction; data mining; distance learning; groupware; pattern classification; Twitter; collaborative online learning; data mining; microblog post; microblog site; online microblog sentiment detection problem; real life microblog dataset; training data; classification; data mining; mining methods and algorithms
【Paper Link】 【Pages】:899-904
【Authors】: Junqiang Liu ; Ke Wang
【Abstract】: Web query logs provide a rich wealth of information, but also present serious privacy risks. We consider publishing vocabularies, bags of query-terms extracted from web query logs, which has a variety of applications. We aim at preventing identity disclosure of such bag-valued data. The key feature of such data is the extreme sparsity, which renders conventional anonymization techniques not working well in retaining enough utility. We propose a semantic similarity based clustering approach to address the issue. We measure the semantic similarity between two vocabularies by a weighted bipartite matching and present a greedy algorithm to cluster vocabularies by the semantic similarities. Extensive experiments on the AOL query log show that our approach retains more data utility than existing approaches.
【Keywords】: Internet; greedy algorithms; pattern clustering; publishing; query processing; vocabulary; AOL query log; Web query log; anonymization technique; bag valued data; cluster vocabulary k-anonymity; data utility; greedy algorithm; publishing vocabulary; query-term extraction; semantic similarity based clustering; weighted bipartite matching; Anonymity; bag-valued data; privacy; web query logs
【Paper Link】 【Pages】:905-910
【Authors】: Sen Liu ; Chaolun Xia ; Xiaohong Jiang
【Abstract】: Probabilistic latent semantic analysis is a topic modeling technique to discover the hidden structure in binary and count data. As a mixture model, it performs a probabilistic mixture decomposition on the co-occurrence matrix, which produces two matrices assigned with probabilistic explanations. However, the factorized matrices may be rather smooth, which means we may obtain global feature and topic representations rather than expected local ones. To resolve this problem, one of the solutions is to revise the decomposition process with considerations of sparsity. In this paper, we present an approach that provides direct control over sparsity during the expectation maximization process. Furthermore, by using the log penalty function as sparsity measurement instead of the widely used L2 norm, we can approximate the re-estimation of parameters in linear time, as same as original PLSA does, while many other approaches require much more time. Experiments on face databases are reported to show visual representations on obtaining local features, and detailed improvements in clustering tasks compared with the original process.
【Keywords】: information retrieval; learning (artificial intelligence); matrix decomposition; sparse matrices; visual databases; cooccurrence matrix; expectation maximization process; factorized matrices; probabilistic latent semantic analysis; probabilistic mixture decomposition; sparsity control; topic modeling; data-adaptive representations; opic model; plsa; sparsity; unsupervised learning
【Paper Link】 【Pages】:911-916
【Authors】: Yanchi Liu ; Zhongmou Li ; Hui Xiong ; Xuedong Gao ; Junjie Wu
【Abstract】: Clustering validation has long been recognized as one of the vital issues essential to the success of clustering applications. In general, clustering validation can be categorized into two classes, external clustering validation and internal clustering validation. In this paper, we focus on internal clustering validation and present a detailed study of 11 widely used internal clustering validation measures for crisp clustering. From five conventional aspects of clustering, we investigate their validation properties. Experiment results show that S_Dbw is the only internal validation measure which performs well in all five aspects, while other measures have certain limitations in different application scenarios.
【Keywords】: data mining; pattern clustering; unsupervised learning; S_Dbw; clustering application; crisp clustering; internal clustering validation
【Paper Link】 【Pages】:917-922
【Authors】: Mingsheng Long ; Wei Cheng ; Xiaoming Jin ; Jianmin Wang ; Dou Shen
【Abstract】: Transfer learning targets to leverage knowledge from one domain for tasks in a new domain. It finds abundant applications, such as text/sentiment classification. Many previous works are based on cluster analysis, which assume some common clusters shared by both domains. They mainly focus on the one-to-one cluster correspondence to bridge different domains. However, such a correspondence scheme might be too strong for real applications where each cluster in one domain corresponds to many clusters in the other domain. In this paper, we propose a Cluster Correspondence Inference (CCI) method to iteratively infer many-to-many correspondence among clusters from different domains. Specifically, word clusters and document clusters are exploited for each domain using nonnegative matrix factorization, then the word clusters from different domains are corresponded in a many-to-many scheme, with the help of shared word space as a bridge. These two steps are run iteratively and label information is transferred from source domain to target domain through the inferred cluster correspondence. Experiments on various real data sets demonstrate that our method outperforms several state-of-the-art approaches for cross-domain text classification.
【Keywords】: inference mechanisms; learning (artificial intelligence); matrix decomposition; pattern classification; pattern clustering; text analysis; cluster analysis; cluster correspondence inference; many-to-many scheme; nonnegative matrix factorization; one-to-one scheme; sentiment classification; text classification; transfer learning; Cluster Correspondence Inference; Text Classification; Transfer Learning
【Paper Link】 【Pages】:923-928
【Authors】: Zhengdong Lu ; Berkant Savas ; Wei Tang ; Inderjit S. Dhillon
【Abstract】: Link prediction is a fundamental problem in social network analysis and modern-day commercial applications such as Face book and My space. Most existing research approaches this problem by exploring the topological structure of a social network using only one source of information. However, in many application domains, in addition to the social network of interest, there are a number of auxiliary social networks and/or derived proximity networks available. The contribution of the paper is twofold: (1) a supervised learning framework that can effectively and efficiently learn the dynamics of social networks in the presence of auxiliary networks, (2) a feature design scheme for constructing a rich variety of path-based features using multiple sources, and an effective feature selection strategy based on structured sparsity. Extensive experiments on three real-world collaboration networks show that our model can effectively learn to predict new links using multiple sources, yielding higher prediction accuracy than unsupervised and single-source supervised models.
【Keywords】: learning (artificial intelligence); social networking (online); collaboration networks; information source; social network analysis; supervised learning; supervised link prediction; link prediction; multiple sources; social network; supervised learning
【Paper Link】 【Pages】:929-934
【Authors】: Mohammad M. Masud ; Qing Chen ; Latifur Khan ; Charu C. Aggarwal ; Jing Gao ; Jiawei Han ; Bhavani M. Thuraisingham
【Abstract】: The problem of data stream classification is challenging because of many practical aspects associated with efficient processing and temporal behavior of the stream. Two such well studied aspects are infinite length and concept-drift. Since a data stream may be considered a continuous process, which is theoretically infinite in length, it is impractical to store and use all the historical data for training. Data streams also frequently experience concept-drift as a result of changes in the underlying concepts. However, another important characteristic of data streams, namely, concept-evolution is rarely addressed in the literature. Concept-evolution occurs as a result of new classes evolving in the stream. This paper addresses concept-evolution in addition to the existing challenges of infinite-length and concept-drift. In this paper, the concept-evolution phenomenon is studied, and the insights are used to construct superior novel class detection techniques. First, we propose an adaptive threshold for outlier detection, which is a vital part of novel class detection. Second, we propose a probabilistic approach for novel class detection using discrete Gini Coefficient, and prove its effectiveness both theoretically and empirically. Finally, we address the issue of simultaneous multiple novel class occurrence, and provide an elegant solution to detect more than one novel classes at the same time. We also consider feature-evolution in text data streams, which occurs because new features (i.e., words) evolve in the stream. Comparison with state-of-the-art data stream classification techniques establishes the effectiveness of the proposed approach.
【Keywords】: data analysis; probability; text analysis; adaptive threshold; concept drifting; concept-evolution phenomenon; data stream classification; discrete Gini Coefficient; feature evolution; infinite length; novel class detection; outlier detection; probabilistic approach; text data streams; concept-evolution; data stream; novel class; outlier
【Paper Link】 【Pages】:935-940
【Authors】: Pauli Miettinen
【Abstract】: Matrix factorizations are commonly used methods in data mining. When the input data is Boolean, replacing the standard matrix multiplication with Boolean matrix multiplication can yield more intuitive results. Unfortunately, finding a good Boolean decomposition is known to be computationally hard, with even many sub-problems being hard to approximate. Many real-world data sets are sparse, and it is often required that also the factor matrices are sparse. This requirement has motivated many new matrix decomposition methods and many modifications of the existing methods. This paper studies how Boolean matrix factorizations behave with sparse data: can we assume some sparsity on the factor matrices, and does the sparsity help with the computationally hard problems. The answer to these problems is shown to be positive.
【Keywords】: Boolean algebra; approximation theory; data mining; matrix decomposition; matrix multiplication; sparse matrices; Boolean matrix multiplication; data mining; matrix decomposition; sparse Boolean matrix factorizations; Boolean rank; Matrix decompositions; approximation algorithms
【Paper Link】 【Pages】:941-946
【Authors】: Mario Navas ; Carlos Ordonez ; Veerabhadran Baladandayuthapani
【Abstract】: Computing Bayesian statistics with traditional techniques is extremely slow, specially when large data has to be exported from a relational DBMS. We propose algorithms for large scale processing of stochastic search variable selection (SSVS) for linear regression that can work entirely inside a DBMS. The traditional SSVS algorithm requires multiple scans of the input data in order to compute a regression model. Due to our optimizations, SSVS can be done in either one scan over the input table for large number of records with sufficient statistics, or one scan per iteration for high-dimensional data. We consider storage layouts which efficiently exploit DBMS parallel processing of aggregate functions. Experimental results demonstrate correctness, convergence and performance of our algorithms. Finally, the algorithms show good scalability for data with a very large number of records, or a very high number of dimensions.
【Keywords】: Bayes methods; data mining; iterative methods; optimisation; regression analysis; relational databases; stochastic processes; Bayesian statistics; SSVS algorithm; UDF; data mining; data scanning; linear regression; parallel processing; relational DBMS; stochastic search variable selection; user defined function; Bayesian statistics; UDF; variable selection
【Paper Link】 【Pages】:947-952
【Authors】: Vit Niennattrakul ; Eamonn J. Keogh ; Chotirat Ann Ratanamahatana
【Abstract】: The problem of finding outliers in data has broad applications in areas as diverse as data cleaning, fraud detection, network monitoring, invasive species monitoring, etc. While there are dozens of techniques that have been proposed to solve this problem for static data collections, very simple distance-based outlier detection methods are known to be competitive or superior to more complex methods. However, distance-based methods have time and space complexities that make them impractical for streaming data and/or resource limited sensors. In this work, we show that simple data-editing techniques can make distance-based outlier detection practical for very fast streams and resource limited sensors. Our technique generalizes to produce two algorithms, which, relative to the original algorithm, can guarantee to produce no false positives, or guarantee to produce no false negatives. Our methods are independent of both data type and distance measure, and are thus broadly applicable.
【Keywords】: data acquisition; text editing; data editing techniques; data streaming; distance-based outlier detection; resource limited sensors; static data collections; Anomaly detection; Data editing; Data stream
【Paper Link】 【Pages】:953-958
【Authors】: Keith Noto ; Carla E. Brodley ; Donna K. Slonim
【Abstract】: We present a new approach to semi-supervised anomaly detection. Given a set of training examples believed to come from the same distribution or class, the task is to learn a model that will be able to distinguish examples in the future that do not belong to the same class. Traditional approaches typically compare the position of a new data point to the set of ``normal'' training data points in a chosen representation of the feature space. For some data sets, the normal data may not have discernible positions in feature space, but do have consistent relationships among some features that fail to appear in the anomalous examples. Our approach learns to predict the values of training set features from the values of other features. After we have formed an ensemble of predictors, we apply this ensemble to new data points. To combine the contribution of each predictor in our ensemble, we have developed a novel, information-theoretic anomaly measure that our experimental results show selects against noisy and irrelevant features. Our results on 47 data sets show that for most data sets, this approach significantly improves performance over current state-of-the-art feature space distance and density-based approaches.
【Keywords】: data mining; learning (artificial intelligence); anomaly detection; feature model; feature space; semisupervised learning; anomaly detection; feature selection; machine learning
【Paper Link】 【Pages】:959-964
【Authors】: Markus Ojala
【Abstract】: Randomization is a general technique for evaluating the significance of data analysis results. In randomization-based significance testing, a result is considered to be interesting if it is unlikely to obtain as good result on random data sharing some basic properties with the original data. Recently, the randomization approach has been applied to assess data mining results on binary matrices and limited types of real-valued matrices. In these works, the row and column value distributions are approximately preserved in randomization. However, the previous approaches suffer from various technical and practical shortcomings. In this paper, we give solutions to these problems and introduce a new practical algorithm for randomizing various types of matrices while preserving the row and column value distributions more accurately. We propose a new approach for randomizing matrices containing features measured in different scales. Compared to previous work, our approach can be applied to assess data mining results on different types of real-life matrices containing dissimilar features, nominal values, non-Gaussian value distributions, missing values and sparse structure. We provide an easily usable implementation that does not need problematic manual tuning as theoretically justified parameter values are given. We perform extensive experiments on various real-life datasets showing that our approach produces reasonable results on practically all types of matrices while being easy and fast to use.
【Keywords】: data analysis; data mining; matrix algebra; randomised algorithms; binary matrices; data analysis; data mining; randomization approach
【Paper Link】 【Pages】:965-970
【Authors】: Nishith Pathak ; Arindam Banerjee ; Jaideep Srivastava
【Abstract】: This paper presents a generalized version of the linear threshold model for simulating multiple cascades on a network while allowing nodes to switch between them. The proposed model is shown to be a rapidly mixing Markov chain and the corresponding steady state distribution is used to estimate highly likely states of the cascades' spread in the network. Results on a variety of real world networks demonstrate the high quality of the estimated solution.
【Keywords】: Markov processes; graph theory; social networking (online); Markov chain; linear threshold model; multiple cascade simulation; real world network; steady state distribution; cascading processes; graph theory; network diffusion; rapidly mixing markov chains; social networks
【Paper Link】 【Pages】:971-976
【Authors】: Daniele Quercia ; Neal Lathia ; Francesco Calabrese ; Giusy Di Lorenzo ; Jon Crowcroft
【Abstract】: A city offers thousands of social events a day, and it is difficult for dwellers to make choices. The combination of mobile phones and recommender systems can change the way one deals with such abundance. Mobile phones with positioning technology are now widely available, making it easy for people to broadcast their whereabouts, recommender systems can now identify patterns in people's movements in order to, for example, recommend events. To do so, the system relies on having mobile users who share their attendance at a large number of social events: cold-start users, who have no location history, cannot receive recommendations. We set out to address the mobile cold-start problem by answering the following research question: how can social events be recommended to a cold-start user based only on his home location? To answer this question, we carry out a study of the relationship between preferences for social events and geography, the first of its kind in a large metropolitan area. We sample location estimations of one million mobile phone users in Greater Boston, combine the sample with social events in the same area, and infer the social events attended by 2,519 residents. Upon this data, we test a variety of algorithms for recommending social events. We find that the most effective algorithm recommends events that are popular among residents of an area. The least effective, instead, recommends events that are geographically close to the area. This last result has interesting implications for location-based services that emphasize recommending nearby events.
【Keywords】: mobile computing; mobile handsets; mobility management (mobile radio); recommender systems; social sciences computing; cold start user; location estimations; location-based service; metropolitan area; mobile cold start problem; mobile phone location; mobile phone users; positioning technology; recommender system; social events; mobile; recommender systems; web 2.0
【Paper Link】 【Pages】:977-982
【Authors】: Romain Quere ; Hoel Le Capitaine ; N. Fraisseix ; Carl Frélicot
【Abstract】: Most already existing indices used to compare two strict partitions with different number of clusters are based on coincidence matrices. To extend such indices to fuzzy partitions, one can define fuzzy coincidence matrices by means of triangular norms. It has been shown this can require some kind of normalization to reinforce the corresponding indices. We propose in this paper a generic solution to perform this normalization considering the generators of the used triangular norms. Although the solution is not index-dependant, we focus on the Rand index and some of its fuzzy counterparts.
【Keywords】: fuzzy set theory; matrix algebra; pattern classification; unsupervised learning; Rand index; fuzzy coincidence matrices normalization; possibilistic partitions; triangular norms; Rand index; coincidence matrix; fuzzy partition; triangular norms
【Paper Link】 【Pages】:983-988
【Authors】: Han Qin ; Dejing Dou ; Yue Fang
【Abstract】: Financial forecasting is the basis for budgeting activities and estimating future financing needs. Applying machine learning and data mining models to financial forecasting is both effective and efficient. Among different kinds of machine learning models, kernel methods are well accepted since they are more robust and accurate than traditional models, such as neural networks. However, learning from multiple data sources is still one of the main challenges in the financial forecasting area. In this paper, we focus on applying the multiple kernel learning models to the multiple major international stock indexes. Our experiment results indicate that applying multiple kernel learning to the financial forecasting problem suffers from both the short training period problem and non-stationary problem. Therefore we propose a novel multiple kernel learning model to address the challenge by introducing the Gompertz model and considering a non-linear combination of different kernel matrices. The experiment results show that our Gompertz multiple kernel learning model addresses the challenges and achieves better performance than the original multiple kernel learning model and single SVM models.
【Keywords】: budgeting; data mining; economic forecasting; learning (artificial intelligence); support vector machines; SVM model; budgeting; data mining; financial forecasting; gompertz multiple kernel learning; kernel matrix; machine learning; nonstationary problem; stock index; training period problem; financial forecasting; multiple kernel learning; non-linear kernel combination
【Paper Link】 【Pages】:989-994
【Authors】: Matthew J. Rattigan ; David Jensen
【Abstract】: Testing for marginal and conditional independence is a common task in machine learning and knowledge discovery applications. Prior work has demonstrated that conventional independence tests suffer from dramatically increased rates of Type I errors when naively applied to relational data. We use graphical models to specify the conditions under which these errors occur, and use those models to devise novel and accurate conditional independence tests.
【Keywords】: data mining; learning (artificial intelligence); relational databases; D-separation; conditional independence; graphical models; knowledge discovery; machine learning; relational data sets; testing marginal
【Paper Link】 【Pages】:995-1000
【Authors】: Steffen Rendle
【Abstract】: In this paper, we introduce Factorization Machines (FM) which are a new model class that combines the advantages of Support Vector Machines (SVM) with factorization models. Like SVMs, FMs are a general predictor working with any real valued feature vector. In contrast to SVMs, FMs model all interactions between variables using factorized parameters. Thus they are able to estimate interactions even in problems with huge sparsity (like recommender systems) where SVMs fail. We show that the model equation of FMs can be calculated in linear time and thus FMs can be optimized directly. So unlike nonlinear SVMs, a transformation in the dual form is not necessary and the model parameters can be estimated directly without the need of any support vector in the solution. We show the relationship to SVMs and the advantages of FMs for parameter estimation in sparse settings. On the other hand there are many different factorization models like matrix factorization, parallel factor analysis or specialized models like SVD++, PITF or FPMC. The drawback of these models is that they are not applicable for general prediction tasks but work only with special input data. Furthermore their model equations and optimization algorithms are derived individually for each task. We show that FMs can mimic these models just by specifying the input data (i.e. the feature vectors). This makes FMs easily applicable even for users without expert knowledge in factorization models.
【Keywords】: data mining; matrix decomposition; support vector machines; SVM; factorization machine; feature vector; model parameters; parameter estimation; sparse data; support vector machine; tensor factorization; factorization machine; sparse data; support vector machine; tensor factorization
【Paper Link】 【Pages】:1001-1006
【Authors】: Doruk Sart ; Abdullah Mueen ; Walid A. Najjar ; Eamonn J. Keogh ; Vit Niennattrakul
【Abstract】: Many time series data mining problems require subsequence similarity search as a subroutine. Dozens of similarity/distance measures have been proposed in the last decade and there is increasing evidence that Dynamic Time Warping (DTW) is the best measure across a wide range of domains. Given DTW's usefulness and ubiquity, there has been a large community-wide effort to mitigate its relative lethargy. Proposed speedup techniques include early abandoning strategies, lower-bound based pruning, indexing and embedding. In this work we argue that we are now close to exhausting all possible speedup from software, and that we must turn to hardware-based solutions. With this motivation, we investigate both GPU (Graphics Processing Unit) and FPGA (Field Programmable Gate Array) based acceleration of subsequence similarity search under the DTW measure. As we shall show, our novel algorithms allow GPUs to achieve two orders of magnitude speedup and FPGAs to produce four orders of magnitude speedup. We conduct detailed case studies on the classification of astronomical observations and demonstrate that our ideas allow us to tackle problems that would be untenable otherwise.
【Keywords】: computer graphic equipment; coprocessors; data mining; field programmable gate arrays; search problems; time series; DTW subsequence search; FPGA; GPU; astronomical observation classification; dynamic time warping subsequence search; early abandoning strategy; field programmable gate array; graphics processing unit; lower-bound based pruning; subroutine; subsequence similarity search; time series data mining; FPGA; GPU; dynamic time warping; similarity search; time series
【Paper Link】 【Pages】:1007-1012
【Authors】: Tim Schlüter ; Stefan Conrad
【Abstract】: This paper presents an application of data mining to the medical domain sleep research, i.e. an approach for automatic sleep stage scoring and apnea-hypopnea detection. By several combined techniques (Fourier and wavelet transform, DDTW and waveform recognition), our approach extracts meaningful features (frequencies and special patterns) from EEG, ECG, EOG and EMG data, on which a decision trees classifier is built for classifying epochs into their sleep stages (according to the rules by Rechtschaffen and Kales) and annotating occurrences of apnea-hypopnea (total or partial cessation of respiration). After that, case-based reasoning is applied to improve quality. We evaluated our approach on 3 large public databases from PhysioBank, which showed an overall accuracy of 95.2% for sleep stage scoring and 94.5% for classifying apneic/non-apneic minutes.
【Keywords】: decision trees; electro-oculography; electrocardiography; electroencephalography; electromyography; feature extraction; medical computing; pattern classification; sleep; Apnea-Hypopnea detection; ECG; EEG; EMG; EOG; PhysioBank; automatic sleep stage scoring; case-based reasoning; classifier; data mining; decision trees; feature extraction; medical domain sleep research; public databases; Biomedical signal processing; Data processing; Feature extraction; Pattern classification; Sleep; Time series
【Paper Link】 【Pages】:1013-1018
【Authors】: Stephan Seufert ; Srikanta J. Bedathur ; Julián Mestre ; Gerhard Weikum
【Abstract】: Graphs are increasingly used to model a variety of loosely structured data such as biological or social networks and entity-relationships. Given this profusion of large-scale graph data, efficiently discovering interesting substructures buried within is essential. These substructures are typically used in determining subsequent actions, such as conducting visual analytics by humans or designing expensive biomedical experiments. In such settings, it is often desirable to constrain the size of the discovered results in order to directly control the associated costs. In this paper, we address the problem of finding cardinality-constrained connected sub trees in large node-weighted graphs that maximize the sum of weights of selected nodes. We provide an efficient constant-factor approximation algorithm for this strongly NP-hard problem. Our techniques can be applied in a wide variety of application settings, for example in differential analysis of graphs, a problem that frequently arises in bioinformatics but also has applications on the web.
【Keywords】: approximation theory; bioinformatics; optimisation; tree data structures; NP-hard problem; approximation algorithm; bioinformatics; cardinality-constrained connected subtrees; data structure; differential analysis; grpahs; Cardinalty Constrained Subgraphs; Graph Mining; Subgraph Discovery; Visual Analytics
【Paper Link】 【Pages】:1019-1024
【Authors】: Mahdi Shafiei ; Hugh A. Chipman
【Abstract】: Transactional network data can be thought of as a list of one-to-many communications (e.g., email) between nodes in a social network. Most social network models convert this type of data into binary relations between pairs of nodes. We develop a latent mixed membership model capable of modeling richer forms of transactional network data, including relations between more than two nodes. The model can cluster nodes and predict transactions. The block-model nature of the model implies that groups can be characterized in very general ways. This flexible notion of group structure enables discovery of rich structure in transactional networks. Estimation and inference are accomplished via a variational EM algorithm. Simulations indicate that the learning algorithm can recover the correct generative model. Interesting structure is discovered in the Enron email dataset and another dataset extracted from the Reddit website. Analysis of the Reddit data is facilitated by a novel performance measure for comparing two soft clusterings. The new model is superior at discovering mixed membership in groups and in predicting transactions.
【Keywords】: data mining; learning (artificial intelligence); social networking (online); stochastic processes; transaction processing; Enron email dataset; Reddit Website; binary relation; generative model; learning algorithm; mixed membership stochastic block model; one-to-many communication; social network; soft clustering; transaction prediction; transactional network data; variational EM algorithm; Clustering; Email Data; Mixedmembership; Social Network Analysis; Variational EM
【Paper Link】 【Pages】:1025-1030
【Authors】: Hanhuai Shan ; Arindam Banerjee
【Abstract】: Probabilistic matrix factorization (PMF) methods have shown great promise in collaborative filtering. In this paper, we consider several variants and generalizations of PMF framework inspired by three broad questions: Are the prior distributions used in existing PMF models suitable, or can one get better predictive performance with different priors? Are there suitable extensions to leverage side information? Are there benefits to taking into account row and column biases? We develop new families of PMF models to address these questions along with efficient approximate inference algorithms for learning and prediction. Through extensive experiments on movie recommendation datasets, we illustrate that simpler models directly capturing correlations among latent factors can outperform existing PMF models, side information can benefit prediction accuracy, and accounting for row/column biases leads to improvements in predictive performance.
【Keywords】: groupware; inference mechanisms; information filtering; learning (artificial intelligence); matrix decomposition; pattern clustering; approximate inference algorithm; collaborative filtering; learning; probabilistic matrix factorization; probabilistic matrix factorization; topic models; variational inference
【Paper Link】 【Pages】:1031-1036
【Authors】: Zhiyong Shen ; Ping Luo ; Shengwen Yang ; Xukun Shen
【Abstract】: In this paper we propose a framework of topic modeling ensembles, a novel solution to combine the models learned by topic modeling over each partition of the whole corpus. It has the potentials for applications such as distributed topic modeling for large corpora, and incremental topic modeling for rapidly growing corpora. Since only the base models, not the original documents, are required in the ensemble, all these applications can be performed in a privacy preserving manner. We explore the theoretical foundation of the proposed framework, give its geometric interpretation, and implement it for both PLSA and LDA. The evaluation of the implementations over the synthetic and real-life data sets shows that the proposed framework is much more efficient than modeling the original corpus directly while achieves comparable effectiveness in terms of perplexity and classification accuracy.
【Keywords】: data privacy; document handling; learning (artificial intelligence); LDA; PLSA; distributed topic modeling; incremental topic modeling; latent Dirichlet allocation; privacy preserving manner; probabilistic latent semantic analysis; topic modeling ensemble; Ensemble; Topic model
【Paper Link】 【Pages】:1037-1042
【Authors】: Zhiyong Shen ; Liang Du ; Xukun Shen ; Yi-Dong Shen
【Abstract】: In this paper, we propose the Interval-valued Matrix Factorization (IMF) framework. Matrix Factorization (MF) is a fundamental building block of data mining. MF techniques, such as Nonnegative Matrix Factorization (NMF) and Probabilistic Matrix Factorization (PMF), are widely used in applications of data mining. For example, NMF has shown its advantage in Face Analysis (FA) while PMF has been successfully applied to Collaborative Filtering (CF). In this paper, we analyze the data approximation in FA as well as CF applications and construct interval-valued matrices to capture these approximation phenomenons. We adapt basic NMF and PMF models to the interval-valued matrices and propose Interval-valued NMF (I-NMF) as well as Interval-valued PMF (I-PMF). We conduct extensive experiments to show that proposed I-NMF and I-PMF significantly outperform their single-valued counterparts in FA and CF applications.
【Keywords】: data mining; face recognition; matrix decomposition; probability; MF techniques; NMF; PMF; collaborative filtering; data mining; face analysis; interval-valued matrices; nonnegative matrix factorization; probabilistic matrix factorization; Matrix factorization; uncertainty
【Paper Link】 【Pages】:1043-1048
【Authors】: Xiaoxiao Shi ; Wei Fan ; Philip S. Yu
【Abstract】: Co-clustering was proposed to simultaneously cluster objects and features to explore inter-correlated patterns. For example, by analyzing the blog click-through data, one finds the group of users who are interested in a specific group of blogs in order to perform applications such as recommendations. However, it is usually very difficult to achieve good co-clustering quality by just analyzing the object-feature correlation data due to the sparsity of the data and the noise. Meanwhile, one may have some prior knowledge that indicates the internal structure of the co-clusters. For instance, one may find user cluster information from the social network system, and the blog-blog similarity from the social tags or contents. This prior information provides some supervision toward the co-cluster structures, and may help reduce the effect of sparsity and noise. However, most co-clustering algorithms do not use this information and may produce unmeaningful results. In this paper we study the problem of finding the optimal co-clusters when some objects and features are believed to be in the same cluster a priori. A matrix decomposition based approach is proposed to formulate as a trace minimization problem, and solve it efficiently with the selected eigenvectors. The asymptotic complexity of the proposed approach is the same as co-clustering without constraints. Experiments include graph-pattern co-clustering and document-word co-clustering. For instance, in graph-pattern data set, the proposed model can improve the normalized mutual information by as much as 5.5 times and 10 times faster than two naive solutions that expand the edges and vertices in the graphs.
【Keywords】: constraint handling; correlation methods; eigenvalues and eigenfunctions; graph theory; matrix decomposition; pattern clustering; social networking (online); asymptotic complexity; data sparsity; document-word co-clustering; graph-pattern co-clustering; matrix decomposition based approach; normalized mutual information; object feature analysis; semisupervised spectral coclustering algorithm; social network system; trace minimization problem; Co-clustering; Semi-supervised Learning; Spectral
【Paper Link】 【Pages】:1049-1054
【Authors】: Xiaoxiao Shi ; Qi Liu ; Wei Fan ; Philip S. Yu ; Ruixin Zhu
【Abstract】: Labeled examples are often expensive and time-consuming to obtain. One practically important problem is: can the labeled data from other related sources help predict the target task, even if they have (a) different feature spaces (e.g., image vs. text data), (b) different data distributions, and (c) different output spaces? This paper proposes a solution and discusses the conditions where this is possible and highly likely to produce better results. It works by first using spectral embedding to unify the different feature spaces of the target and source data sets, even when they have completely different feature spaces. The principle is to cast into an optimization objective that preserves the original structure of the data, while at the same time, maximizes the similarity between the two. Second, a judicious sample selection strategy is applied to select only those related source examples. At last, a Bayesian-based approach is applied to model the relationship between different output spaces. The three steps can bridge related heterogeneous sources in order to learn the target task. Among the 12 experiment data sets, for example, the images with wavelet-transformed-based features are used to predict another set of images whose features are constructed from color-histogram space. By using these extracted examples from heterogeneous sources, the models can reduce the error rate by as much as ~50%, compared with the methods using only the examples from the target task.
【Keywords】: Bayes methods; data mining; feature extraction; image colour analysis; learning (artificial intelligence); optimisation; spectral analysis; wavelet transforms; Bayesian-based approach; color histogram space; data distribution; heterogeneous feature space; image data; labeled data; spectral transformation; text data; wavelet transform; feature space; heterogeneous; spectral; transfer learning
【Paper Link】 【Pages】:1055-1060
【Authors】: Vikas Sindhwani ; Serhat Selcuk Bucak ; Jianying Hu ; Aleksandra Mojsilovic
【Abstract】: Consider a typical recommendation problem. A company has historical records of products sold to a large customer base. These records may be compactly represented as a sparse customer-times-product ``who-bought-what" binary matrix. Given this matrix, the goal is to build a model that provides recommendations for which products should be sold next to the existing customer base. Such problems may naturally be formulated as collaborative filtering tasks. However, this is a {it one-class} setting, that is, the only known entries in the matrix are one-valued. If a customer has not bought a product yet, it does not imply that the customer has a low propensity to {it potentially} be interested in that product. In the absence of entries explicitly labeled as negative examples, one may resort to considering unobserved customer-product pairs as either missing data or as surrogate negative instances. In this paper, we propose an approach to explicitly deal with this kind of ambiguity by instead treating the unobserved entries as optimization variables. These variables are optimized in conjunction with learning a weighted, low-rank non-negative matrix factorization (NMF) of the customer-product matrix, similar to how Transductive SVMs implement the low-density separation principle for semi-supervised learning. Experimental results show that our approach gives significantly better recommendations in comparison to various competing alternatives on one-class collaborative filtering tasks.
【Keywords】: groupware; information filtering; matrix decomposition; optimisation; recommender systems; collaborative filtering; customer-product pairs; low density factorization; matrix completion; matrix factorization; missing data; optimization variables; recommendation problem; Collaborative Filtering; Implicit Feedback; Matrix Completion; Non-negative Matrix Factorizations; Recommender Systems
【Paper Link】 【Pages】:1061-1066
【Authors】: Jimeng Sun ; Daby M. Sow ; Jianying Hu ; Shahram Ebadollahi
【Abstract】: We present a mining system that can predict the future health status of the patient using the temporal trajectories of health status of a set of similar patients. The main novelties of this system are its use of stream processing technology for handling the incoming physiological time series data and incorporating domain knowledge in learning the similarity metric between patients represented by their temporal data. The proposed approach and system were tested using the MIMIC II database, which consists of physiological waveforms, and accompanying clinical data obtained for ICU patients. The study was carried out on 1500 patients from this database. In the experiments we report the efficiency and throughput of the stream processing unit for feature extraction, the effectiveness of the supervised similarity measure both in the context of classification and retrieval tasks compared to unsupervised approaches, and the accuracy of the temporal projections of the patient data.
【Keywords】: data mining; feature extraction; information retrieval; learning (artificial intelligence); medical administrative data processing; patient monitoring; time series; ICU patient; MIMIC II database; advanced prognostic decision support; classification task; clinical data; feature extraction; patient health status; physiological time series data; retrieval task; supervised similarity measure; temporal physiological data stream mining; Patient similarity; Physiological streams
【Paper Link】 【Pages】:1067-1072
【Authors】: Xu Sun ; Hisashi Kashima ; Takuya Matsuzaki ; Naonori Ueda
【Abstract】: On large datasets, the popular training approach has been stochastic gradient descent (SGD). This paper proposes a modification of SGD, called averaged SGD with feedback (ASF), that significantly improves the performance (robustness, accuracy, and training speed) over the traditional SGD. The proposal is based on three simple ideas: averaging the weight vectors across SGD iterations, feeding the averaged weights back into the SGD update process, and deciding when to perform the feedback (linearly slowing down feedback). Theoretically, we demonstrate the reasonable convergence properties of the ASF. Empirically, the ASF outperforms several strong baselines in terms of accuracy, robustness over the noise, and the training speed. To our knowledge, this is the first study of ``feedback'' in stochastic gradient learning. Although we choose latent conditional models for verifying the ASF in this paper, the ASF is a general purpose technique just like SGD, and can be directly applied to other models.
【Keywords】: classification; data mining; feedback; gradient methods; learning (artificial intelligence); stochastic processes; very large databases; averaged stochastic gradient descent; classification; data mining; feedback; large datasets; weight vectors
【Paper Link】 【Pages】:1073-1078
【Authors】: Daniel Svonava ; Michail Vlachos
【Abstract】: We present a novel visualization methodology for graphs and high-dimensional data which combines the neighborhood preservation characteristics of the minimum spanning trees, with the grouping properties of dendrograms. We call the method `minimum spanning dendrogram'. We highlight the ability of the mapping to accurately capture both neighborhood and cluster structures. The technique accommodates the interactive cluster formation at progressively more granular levels, allowing the user to explore data relationships at different resolutions. We also compare our work with other visualization methodologies, such as ISOMAP, and highlight the distinct merits of our approach.
【Keywords】: data analysis; data visualisation; graphs; pattern clustering; tree data structures; ISOMAP; cluster structure; granular level; graph visualization; high dimensional data; interactive cluster formation; minimum spanning dendrogram; minimum spanning tree; neighborhood preservation characteristics; dendrogram; distance preservation; minimum spanning tree; single linkage hierarchical clustering
【Paper Link】 【Pages】:1079-1084
【Authors】: Lu An Tang ; Xiao Yu ; Sangkyum Kim ; Jiawei Han ; Chih-Chieh Hung ; Wen-Chih Peng
【Abstract】: A Cyber-Physical System (CPS) integrates physical devices (e.g., sensors, cameras) with cyber (or informational)components to form a situation-integrated analytical system that responds intelligently to dynamic changes of the real-world scenarios. One key issue in CPS research is trustworthiness analysis of the observed data: Due to technology limitations and environmental influences, the CPS data are inherently noisy that may trigger many false alarms. It is highly desirable to sift meaningful information from a large volume of noisy data. In this paper, we propose a method called Tru-Alarm which finds out trustworthy alarms and increases the feasibility of CPS. Tru-Alarm estimates the locations of objects causing alarms, constructs an object-alarm graph and carries out trustworthiness inferences based on linked information in the graph. Extensive experiments show that Tru-Alarm filters out noises and false information efficiently and guarantees not missing any meaningful alarms.
【Keywords】: alarm systems; computer network security; network theory (graphs); wireless sensor networks; Cyber physical system; Tru-Alarm; object-alarm graph; sensor networks trustworthiness analysis; situation-integrated analytical system; Alarm Detection; Cyber Physical System
【Paper Link】 【Pages】:1085-1090
【Authors】: Kilian Thiel ; Michael R. Berthold
【Abstract】: In this paper we propose two methods to derive two different kinds of node similarities in a network based on their neighborhood. The first similarity measure focuses on the overlap of direct and indirect neighbors. The second similarity compares nodes based on the structure of their - possibly also very distant - neighborhoods. Instead of using standard node measures, both similarities are derived from spreading activation patterns over time. Whereas in the first method the activation patterns are directly compared, in the second method the relative change of activation over time is compared. We apply both methods to a real-world graph dataset and discuss the results.
【Keywords】: data analysis; graph theory; graph dataset; node similarity; spreading activation; graph analysis; node signatures; node similarities; spreading activation
【Paper Link】 【Pages】:1091-1096
【Authors】: Hanghang Tong ; B. Aditya Prakash ; Charalampos E. Tsourakakis ; Tina Eliassi-Rad ; Christos Faloutsos ; Duen Horng Chau
【Abstract】: Given a large graph, like a computer network, which k nodes should we immunize (or monitor, or remove), to make it as robust as possible against a computer virus attack? We need (a) a measure of the 'Vulnerability' of a given network, (b) a measure of the 'Shield-value' of a specific set of k nodes and (c) a fast algorithm to choose the best such k nodes. We answer all these three questions: we give the justification behind our choices, we show that they agree with intuition as well as recent results in immunology. Moreover, we propose NetShield a fast and scalable algorithm. Finally, we give experiments on large real graphs, where NetShield achieves tremendous speed savings exceeding 7 orders of magnitude, against straightforward competitors.
【Keywords】: computer network security; graph theory; set theory; NetShield; fast algorithm; large graph vulnerability; scalable algorithm; shield value; Vulnerability; graph mining; immunization; scalability
【Paper Link】 【Pages】:1097-1102
【Authors】: Niko Vuokko ; Petteri Kaski
【Abstract】: Clustering is one of the basic operations in data analysis, and the cluster structure of a dataset often has a marked effect on observed patterns in data. Testing whether a data mining result is implied by the cluster structure can give substantial information on the formation of the dataset. We propose a new method for empirically testing the statistical significance of patterns in real-valued data in relation to the cluster structure. The method relies on principal component analysis and is based on the general idea of decomposing the data for the purpose of isolating the null model. We evaluate the performance of the method and the information it provides on various real datasets. Our results show that the proposed method is robust and provides nontrivial information about the origin of patterns in data, such as the source of classification accuracy and the observed correlations between attributes.
【Keywords】: data analysis; data mining; pattern clustering; principal component analysis; classification accuracy; cluster structure; data analysis; data mining; dataset formation; nontrivial information; observed correlation; principal component analysis; statistical significance; clustering; principal component analysis; randomization; significance testing
【Paper Link】 【Pages】:1103-1108
【Abstract】: Sparse Coding (SC), which models the data vectors as sparse linear combinations over basis vectors, has been widely applied in machine learning, signal processing and neuroscience. In this paper, we propose a dual random projection method to provide an efficient solution to Nonnegative Sparse Coding (NSC) using small memory. Experiments on real world data demonstrate the effectiveness of the proposed method.
【Keywords】: data reduction; matrix decomposition; source coding; vectors; Nonnegative Sparse Coding; data vector; dual random projection method; sparse linear combination
【Paper Link】 【Pages】:1109-1114
【Authors】: Ke Wang ; Yabo Xu ; Raymond Chi-Wing Wong ; Ada Wai-Chee Fu
【Abstract】: Temporal data are time-critical in that the snapshot at each timestamp must be made available to researchers in a timely fashion. However, due to the limited data, each snapshot likely has a skewed distribution on sensitive values, which renders classical anonymization methods not possible. In this work, we propose the “reposition model” to allow a record to be published within a close proximity of original timestamp. We show that reposition over a small proximity of timestamp is sufficient for reducing the skewness of a snapshot, therefore, minimizing the impact on window queries. We formalize the optimal reposition problem and present a linear-time solution. The contribution of this work is that it enables classical methods on temporal data.
【Keywords】: data privacy; publishing; temporal databases; classical anonymization; linear-time solution; reposition model; skewed distribution; snapshot; temporal data; timestamp; window queries; Anonymization; Privacy; Temporal Data
【Paper Link】 【Pages】:1115-1120
【Authors】: Zheng Wang ; Yangqiu Song ; Changshui Zhang
【Abstract】: In this paper, we present a homotopy regularization algorithm for boosting. We introduce a regularization term with adaptive weight into the boosting framework and compose a homotopy objective function. Optimization of this objective approximately composes a solution path for the regularized boosting. Following this path, we can find suitable solution efficiently using early stopping. Experiments show that this adaptive regularization method gives a more efficient parameter selection strategy than regularized boosting and semi supervised boosting algorithms, and significantly improves the performances of traditional AdaBoost and related methods.
【Keywords】: learning (artificial intelligence); optimisation; AdaBoost; homotopy regularization algorithm; optimization; parameter selection strategy; regularized boosting; semisupervised boosting; boosting; homotopy regularization
【Paper Link】 【Pages】:1121-1126
【Authors】: Jianshu Weng ; Ee-Peng Lim ; Qi He ; Cane Wing-ki Leung
【Abstract】: When micro logging becomes a very popular social media, finding interesting posts from high volume stream of user posts is a challenging research problem. To organize large number of posts, users can assign tags to posts so that these posts can be navigated and searched by tag. In this paper, we focus on modeling the interestingness of hash tags in Twitter, the largest and most active micro logging site. We propose to first construct communities based on both follow links and tagged interactions. We then measure the dispersion and divergence of users and tweets using hash tags among the constructed communities. The interestingness of hash tags are then derived from these community-based dispersion and divergence features. We further introduce a supervised approach to rank hash tags by interestingness. Our experiments on a Twitter dataset show that the proposed approach achieves a fairly good performance.
【Keywords】: social networking (online); Twitter; community-based dispersion; divergence features; hashtags; high volume stream; microblogging site; social media; supervised approach; Twitter; hashtag; interestingness; ranking
【Paper Link】 【Pages】:1127-1132
【Authors】: Raymond Chi-Wing Wong ; Ada Wai-Chee Fu ; Ke Wang ; Yabo Xu ; Jian Pei ; Philip S. Yu
【Abstract】: Background knowledge is an important factor in privacy preserving data publishing. Probabilistic distribution-based background knowledge is a powerful kind of background knowledge which is easily accessible to adversaries. However, to the best of our knowledge, there is no existing work that can provide a privacy guarantee under adversary attack with such background knowledge. The difficulty of the problem lies in the high complexity of the probability computation and the non-monotone nature of the privacy condition. The only solution known to us relies on approximate algorithms with no known error bound. In this paper, we propose a new bounding condition that overcomes the difficulties of the problem and gives a privacy guarantee. This condition is based on probability deviations in the anonymized data groups, which is much easier to compute and which is a monotone function on the grouping sizes.
【Keywords】: computational complexity; data analysis; data privacy; inference mechanisms; anonymized data; background knowledge; computation complexity; nonmonotone nature; probabilistic distribution; probabilistic inference protection; probability deviation; k-anonymity; l-diversity; privacy preserving data publishing
【Paper Link】 【Pages】:1133-1138
【Authors】: Jun Wu ; Mingyu Lu ; Chun-Li Wang
【Abstract】: Similarity measure is a critical component in image retrieval systems, and learning similarity measure from the relevance feedback has become a promising way to enhance retrieval performance. Existing approaches mainly focus on learning the visual similarity measure from online feedbacks or constructing the semantic similarity measure depended on historical feedbacks log. However, there is still a big room to elevate the retrieval performance, because few works take the relationship between the visual similarity and the semantic similarity into account. This paper proposes the collaborative learning similarity measure, CoSim, which focuses on the collaborative learning between the visual content of images and the hidden semantic in log. Concretely, the semantic similarity is first learned from log data and serves as prior knowledge. Then, the visual similarity is learned from a mixture of labeled and unlabeled images. In particular, unlabeled images are exploited for the relevant and irrelevant classes in different ways. Finally, the collaborative learning similarity is produced by integrating the visual similarity and the semantic similarity in a nonlinear way. An empirical study shows that the proposed CoSim is significantly more effective than some existing approaches.
【Keywords】: groupware; image retrieval; learning (artificial intelligence); relevance feedback; CoSim; collaborative learning similarity measure; hidden semantic; image retrieval system; image visual content; learning similarity measure; relevance feedback; semantic similarity measure; collaborative learning; image retrieval; long-term learning; relevance feedback; short-term learning
【Paper Link】 【Pages】:1139-1144
【Authors】: Yan Xie ; Philip S. Yu
【Abstract】: Frequent pattern mining is a fundamental problem in data mining research. We note that almost all state-of-the art algorithms may not be able to mine very long patterns in a large database with a huge set of frequent patterns. In this paper, we point our research to solve this difficult problem from a different perspective: we focus on mining top-k long maximal frequent patterns because long patterns are in general more interesting ones. Different from traditional level-wise mining or tree-growth strategies, our method works in a top-down manner. We pull large maximal cliques from a pattern graph constructed after some fast initial processing, and directly use such large-sized maximal cliques as promising candidates for long frequent patterns. A separate refinement stage is needed to further transform these candidates into true maximal patterns.
【Keywords】: data mining; tree data structures; data mining; frequent pattern mining; large sized maximal clique; level wise mining; max clique; top down graph based approach; top-k long maximal frequent patterns; tree growth strategies; frequent pattern mining; pattern graph; top-down
【Paper Link】 【Pages】:1145-1150
【Authors】: Qingyan Yang ; Ju Fan ; Jianyong Wang ; Lizhu Zhou
【Abstract】: Web-page recommendation is to predict the next request of pages that Web users are potentially interested in when surfing the Web. This technique can guide Web users to find more useful pages without asking for them explicitly and has attracted much attention in the community of Web mining. However, few studies on Web page recommendation consider personalization, which is an indispensable feature to meet various preferences of users. In this paper, we propose a personalized Web page recommendation model called PIGEON (abbr. for PersonalIzed web paGe rEcommendatiON) via collaborative filtering and a topic-aware Markov model. We propose a graph-based iteration algorithm to discover users' interested topics, based on which user similarities are measured. To recommend topically coherent pages, we propose a topic-aware Markov model to learn users' navigation patterns which capture both temporal and topical relevance of pages. A thorough experimental evaluation conducted on a large real dataset demonstrates PIGEON's effectiveness and efficiency.
【Keywords】: Internet; Markov processes; graph theory; groupware; information filtering; iterative methods; recommender systems; PIGEON; collaborative filtering; graph based iteration algorithm; personalized Web page recommendation model; requested Web pages prediction; topic aware Markov model; Collaborative Filtering; Markov model; Personalized Recommendation; Web Page Clustering
【Paper Link】 【Pages】:1151-1156
【Authors】: Hwanjo Yu ; Sungchul Kim
【Abstract】: Active sampling (also called active learning or selective sampling) has been extensively researched for classification and rank learning methods, which is to select the most informative samples from unlabeled data such that, once the samples are labeled, the accuracy of the function learned from the samples is maximized. While active sampling methods require learning a function at each iteration to find the most informative samples, this paper proposes passive sampling techniques for regression, which find the informative samples not based on the learned function but based on the samples' geometric characteristics in the feature space. Passive sampling is more efficient than active sampling, as it does not require, at each iteration, learning and validating the regression functions and evaluating the unlabeled data using the function. For regression, passive sampling is also more effective, Active sampling for regression suffers from serious performance fluctuations in practice, because it selects the samples of highest regression errors and such samples are likely noisy. Passive sampling, on the other hand, shows more stable performance. We observe from our extensive experiments that our passive sampling methods perform even better than the ``omniscient'' active sampling that knows the labels of unlabeled data.
【Keywords】: learning (artificial intelligence); regression analysis; sampling methods; active learning; active sampling; passive sampling; rank learning method; regression; selective sampling; active learning; active sampling; passive sampling; regression; selective sampling
【Paper Link】 【Pages】:1157-1162
【Authors】: Jun Yu ; Weng-Keen Wong ; Rebecca A. Hutchinson
【Abstract】: Citizen scientists, who are volunteers from the community that participate as field assistants in scientific studies, enable research to be performed at much larger spatial and temporal scales than trained scientists can cover. Species distribution modeling, which involves understanding species-habitat relationships, is a research area that can benefit greatly from citizen science. The eBird project is one of the largest citizen science programs in existence. By allowing birders to upload observations of bird species to an online database, eBird can provide useful data for species distribution modeling. However, since birders vary in their levels of expertise, the quality of data submitted to eBird is often questioned. In this paper, we develop a probabilistic model called the Occupancy-Detection-Expertise (ODE) model that incorporates the expertise of birders submitting data to eBird. We show that modeling the expertise of birders can improve the accuracy of predicting observations of a bird species at a site. In addition, we can use the ODE model for two other tasks: predicting birder expertise given their history of eBird checklists and identifying bird species that are difficult for novices to detect.
【Keywords】: biology computing; probability; zoology; birder expertise prediction; citizen science data; citizen scientist; eBird checklists; eBird project; field assistant; modeling expert; modeling novices; occupancy detection expertise model; species distribution modeling; Applications; Citizen Science; Contrast Mining; Graphical Models; Species Distribution Modeling
【Paper Link】 【Pages】:1163-1168
【Authors】: Kui Yu ; Xindong Wu ; Hao Wang ; Wei Ding
【Abstract】: In this paper, we study a new research problem of causal discovery from streaming features. A unique characteristic of streaming features is that not all features can be available before learning begins. Feature generation and selection often have to be interleaved. Managing streaming features has been extensively studied in classification, but little attention has been paid to the problem of causal discovery from streaming features. To this end, we propose a novel algorithm to solve this challenging problem, denoted as CDFSF (Causal Discovery From Streaming Features) which consists of two phases: growing and shrinking. In the growing phase, CDFSF finds candidate parents or children for each feature seen so far, while in the shrinking phase the algorithm dynamically removes false positives from the current sets of candidate parents and children. In order to improve the efficiency of CDFSF, we present S-CDFSF, a faster version of CDFSF, using two symmetry theorems. Experimental results validate our algorithms in comparison with other state-of-art algorithms of causal discovery.
【Keywords】: belief networks; causality; feature extraction; S-CDFSF; causal discovery; feature generation; feature selection; streaming feature; Bayesian networks; causal discovery; streaming features
【Paper Link】 【Pages】:1169-1174
【Authors】: Chongsheng Zhang ; Florent Masseglia ; Yves Lechevallier
【Abstract】: Usage data mining is an important research area with applications in various fields. However, usage data is usually considered streaming, due to its high volumes and rates. Because of these characteristics, we only have access, at any point in time, to a small fraction of the stream. When the data is observed through such a limited window, it is challenging to give a reliable description of the recent usage data. We study the important consequences of these constraints, through the “bounce rate” problem and the clustering of usage data streams. Then, we propose the ABS (Anti-Bouncing Stream) model which combines the advantages of previous models but discards their drawbacks. First, under the same resource constraints as existing models in the literature, ABS can better model the recent data. Second, owing to its simple but effective management approach, the data in ABS is available at any time for analysis. We demonstrate its superiority through a theoretical study and experiments on two real-world data sets.
【Keywords】: data handling; pattern clustering; anti bouncing model; bounce rate problem; data stream; data streams clustering; bounce rate; clustering; data streams; usage
【Paper Link】 【Pages】:1175-1180
【Authors】: Peng Zhang ; Xingquan Zhu ; Jianlong Tan ; Li Guo
【Abstract】: Ensemble learning is a commonly used tool for building prediction models from data streams, due to its intrinsic merits of handling large volumes stream data. Despite of its extraordinary successes in stream data mining, existing ensemble models, in stream data environments, mainly fall into the ensemble classifiers category, without realizing that building classifiers requires labor intensive labeling process, and it is often the case that we may have a small number of labeled samples to train a few classifiers, but a large number of unlabeled samples are available to build clusters from data streams. Accordingly, in this paper, we propose a new ensemble model which combines both classifiers and clusters together for mining data streams. We argue that the main challenges of this new ensemble model include (1) clusters formulated from data streams only carry cluster IDs, with no genuine class label information, and (2) concept drifting underlying data streams makes it even harder to combine clusters and classifiers into one ensemble framework. To handle challenge (1), we present a label propagation method to infer each cluster's class label by making full use of both class label information from classifiers, and internal structure information from clusters. To handle challenge (2), we present a new weighting schema to weight all base models according to their consistencies with the up-to-date base model. As a result, all classifiers and clusters can be combined together, through a weighted average mechanism, for prediction. Experiments on real-world data streams demonstrate that our method outperforms simple classifier ensemble and cluster ensemble for stream data mining.
【Keywords】: data handling; data mining; learning (artificial intelligence); pattern classification; class label information; concept drifting data stream; ensemble classifier category; ensemble clustering; ensemble learning; internal structure information; label propagation method; stream data mining; weighted average mechanism; Classification; Concept Drifting; Data Stream Mining; Ensemble Learning
【Paper Link】 【Pages】:1181-1186
【Authors】: Xianchao Zhang ; Yansheng Jiang ; Wenxin Liang ; Xin Han
【Abstract】: Graph-based semi-supervised learning algorithms have attracted a lot of attention. Constructing a good graph is playing an essential role for all these algorithms. Many existing graph construction methods(e.g. Gaussian Kernel etc.) require user input parameter, which is hard to configure manually. In this paper, we propose a parameter-free similarity measure Adaptive Similarity Estimation (ASE), which constructs the graph by adaptively optimizing linear combination of its neighbors. Experimental results show the effectiveness of our proposed method.
【Keywords】: adaptive estimation; graphs; learning (artificial intelligence); optimisation; pattern classification; pattern matching; adaptive similarity estimation; adaptively optimizing linear combination; graph based semisupervised learning; graph construction method; parameter free similarity; adaptive similarity estimation; classification; semi-supervised learning
【Paper Link】 【Pages】:1187-1192
【Authors】: Xiangliang Zhang ; Wei Wang ; Kjetil Nørvåg ; Michèle Sebag
【Abstract】: The Affinity Propagation (AP) clustering algorithm proposed by Frey and Dueck (2007) provides an understandable, nearly optimal summary of a data set. However, it suffers two major shortcomings: i) the number of clusters is vague with the user-defined parameter called self-confidence, and ii) the quadratic computational complexity. When aiming at a given number of clusters due to prior knowledge, AP has to be launched many times until an appropriate setting of self-confidence is found. The re-launched AP increases the computational cost by one order of magnitude. In this paper, we propose an algorithm, called K-AP, to exploit the immediate results of K clusters by introducing a constraint in the process of message passing. Through theoretical analysis and experimental validation, K-AP was shown to be able to directly generate K clusters as user defined, with a negligible increase of computational cost compared to AP. In the meanwhile, K-AP preserves the clustering quality as AP in terms of the distortion. K-AP is more effective than k-medoids w.r.t. the distortion minimization and higher clustering purity.
【Keywords】: computational complexity; message passing; pattern clustering; K-AP; affinity propagation clustering algorithm; distortion minimization; message passing; quadratic computational complexity; specified K clusters generation; theoretical analysis; affinity propagation; clustering; k-medoids
【Paper Link】 【Pages】:1193-1198
【Authors】: Yuan Zhang ; Jie Tang ; Jimeng Sun ; Yiran Chen ; Jinghai Rao
【Abstract】: Human emotion is one important underlying force affecting and affected by the dynamics of social networks. An interesting question is “can we predict a person's mood based on his historic emotion log and his social network?”. In this paper, we propose a Mood Cast method based on a dynamic continuous factor graph model for modeling and predicting users' emotions in a social network. Mood Cast incorporates users' dynamic status information (e.g., locations, activities, and attributes) and social influence from users' friends into a unified model. Based on the historical information (e.g., network structure and users' status from time 0 to t-1), Mood Cast learns a discriminative model for predicting users' emotion status at time t. To the best of our knowledge, this work takes the first step in designing a principled model for emotion prediction in social networks. Our experimental results on both real social network and virtual web-based network show that we can accurately predict emotion status of more than 62% of users and 8+% improvement than the baseline methods.
【Keywords】: emotion recognition; graph theory; human computer interaction; prediction theory; social networking (online); MoodCast; dynamic continuous factor graph model; emotion prediction; historical information; social networks; virtual Web based network; Emotion dynamics; Predictive model; Social influence; Social network
【Paper Link】 【Pages】:1199-1204
【Authors】: Li Zheng ; Tao Li ; Chris H. Q. Ding
【Abstract】: Ensemble clustering has emerged as an important elaboration of the classical clustering problems. Ensemble clustering refers to the situation in which a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings. Many approaches have been developed to solve ensemble clustering problems over the last few years. However, most of these ensemble techniques are designed for partitional clustering methods. Few research efforts have been reported for ensemble hierarchical clustering methods. In this paper, we propose a hierarchical ensemble clustering framework which can naturally combine both partitional clustering and hierarchical clustering results. We notice the importance of ultra-metric distance for hierarchical clustering and propose a novel method for learning the ultra-metric distance from the aggregated distance matrices and generating final hierarchical clustering with enhanced cluster separation. Experimental results demonstrate the effectiveness of our proposed approaches.
【Keywords】: data mining; matrix algebra; pattern clustering; aggregated distance matrix; ensemble clustering; hierarchical clustering method; partitional clustering method; ultrametric distance; Hierarchical ensemble clustering; Ultra-metric
【Paper Link】 【Pages】:1205-1210
【Authors】: Jia Zou ; Jing Xiao ; Rui Hou ; Yanqi Wang
【Abstract】: When parallelism and heterogeneity has become the trend for computer system design, both the size and the complexity of the hardware sample data generated by Performance Monitoring Unit (PMU) increase fast, thus automatic analysis methods, i.e. data mining methods, are urgently needed to accelerate hardware sample data analysis. We are the first to study instruction sequential pattern mining for hardware sample data. It is a challenging task due to the implicit sequential relationship contained in the data and due to the importance of low frequency patterns. As a solution, we i) provide a novel algorithm ProfSpan, ii) adapt two existing algorithms, which are based on candidate generation and projected database generation, to hardware sample data. Our evaluation results show ProfSpan can reduce up to 75% and 80% of execution time compared with other two algorithms. Particularly, up to 50% of frequent patterns mined by ProfSpan in hardware sample data are crossing basic block boundaries and can not be found by existing methods for source code or disassembly code. We also analyze three example patterns identified by ProfSpan: consecutive loads, JIT entry sequence, and conditional code dependency sequence, to illustrate how ProfSpan can benefit performance analysis. Finally, we apply patterns to module classification and obtain promising results.
【Keywords】: data analysis; data mining; database management systems; parallel processing; source coding; JIT entry sequence; ProfSpan; automatic analysis methods; candidate generation; computer system design; conditional code dependency sequence; data mining; database generation; disassembly code; hardware sample data analysis; instruction sequential pattern mining; pattern classification; performance monitoring unit; source code; hardware sample data; performance analysis; sequential patter mining
【Paper Link】 【Pages】:1211-1216
【Authors】: Huisheng Zhu ; Peng Wang ; Xianmang He ; Yujia Li ; Wei Wang ; Baile Shi
【Abstract】: Frequent serial episodes within an event sequence describe the behavior of users or systems about the application. Existing mining algorithms calculate the frequency of an episode based on overlapping or non-minimal occurrences, which is prone to over-counting the support of long episodes or poorly characterizing the followed-by-closely relationship over event types. In addition, due to utilizing the Apriori-style level wise approach, these algorithms are computationally expensive. In this paper, we propose an efficient algorithm MANEPI (Minimal And Non-overlapping EPIsode) for mining more interesting frequent episodes within the given event sequence. The proposed frequency measure takes both minimal and non-overlapping occurrences of an episode into consideration and ensures better mining quality. The introduced depth first search strategy with the Apriori Property for performing episode growth greatly improves the efficiency of mining long episodes because of scanning the given sequence only once and not generating candidate episodes. Moreover, an optimization technique is presented to narrow down search space and speed up the mining process. Experimental evaluation on both synthetic and real-world datasets demonstrates that our algorithms are more efficient and effective.
【Keywords】: data mining; Apriori style level wise approach; MANEPI algorithm; Minimal And Non-overlapping EPIsode; efficient episode mining; event sequence; frequent serial episode; nonoverlapping occurrence; Data mining; Event sequence; Frequent episode; Minimal and non-overlapping occurrences; Prefix tree
【Paper Link】 【Pages】:1217
【Authors】: Vania Bogorny ; Shashi Shekhar
【Abstract】: Summary form only given. The recent advances and price reduction of technologies for collecting spatial and spatio-temporal data like Satellite Images, Cellular Phones, Sensor Networks, and GPS devices has facilitated the collection of data referenced in space and time. These huge collections of data often hide interesting information which conventional systems and classical data mining techniques are unable to discover. Spatial and spatio-temporal data are embedded in continuous space, whereas classical datasets (e.g. transactions) are often discrete. Spatial and spatio-temporal data require complex data preprocessing, transformation, data mining, and post-processing techniques to extract novel, useful, and understandable patterns. The importance of spatial and spatio-temporal data mining is growing with the increasing incidence and importance of large geo-spatial datasets such as maps, repositories of remote-sensing images, trajectories of moving objects generated by mobile devices, etc. Applications include Mobile-commerce industry (location-based services), climatologically effects of El Nino, land-use classification and global change using satellite imagery, finding crime hot spots, local instability in traffic, migration of birds, fishing control, pedestrian behavior analysis, and so on. Thus, new methods are needed to analyze spatial and spatio-temporal data to extract interesting, useful, and non-trivial patterns. The main goal of this tutorial is to disseminate this research field, giving an overview of the current state of the art and the main methodologies and algorithms for spatial and spatio-temporal data mining. This tutorial is directed to researches and practitioners, experts in data mining, analysts of spatial and spatio-temporal data, as well as knowledge engineers and domain experts from different application areas.
【Keywords】: data encapsulation; data mining; electronic commerce; geographic information systems; image motion analysis; mobile computing; mobile handsets; remote sensing; spatiotemporal phenomena; GPS device; cellular phone; climatologically effect; conventional system; data preprocessing; fishing control; geospatial dataset; information hide; knowledge engineer; land use classification; mobile commerce industry; mobile device; moving object trajectory; pattern extraction; pedestrian behavior analysis; price reduction; remote sensing image; satellite image; satellite imagery; sensor network; spatiotemporal data mining; semantic trajectory data mining; semantic trajectory pattern mining; spatial data mining; trajectory data mining
【Paper Link】 【Pages】:1218
【Authors】: Jun Huan
【Abstract】: In United State several universities and research institutes including the national health institute (NIH) recently started programs aiming for drug discovery. With the initiatives, huge volumes of data have been collected and shared with public free of charge. Those initiatives provide an unprecedented opportunity for data miner and machine learner to study knowledge discovery problems associated with drug design. In this tutorial, the presenter will review the knowledge discovery and management needs in the drug discovery process. Latest methodology development, primarily those from data mining, machine learning, and statistical learning will be discussed.
【Keywords】: data mining; learning (artificial intelligence); pharmaceuticals; Academic Drug Discovery Programs; data mining; drug design; drug discovery process; knowledge discovery; machine learning; national health institute; statistical learning
【Paper Link】 【Pages】:1219
【Authors】: Eamonn J. Keogh
【Abstract】: While ICDM has traditionally enjoyed an unusually high quality of reviewing, there is no doubt that publishing in ICDM is very challenging. In this tutorial Dr. Keogh will demonstrate some simple ideas to enhance the probability of success in getting your paper published in a top data mining conference, and after the work is published, getting it highly cited.
【Keywords】: data mining; ICDM; data mining conference; data mining research; Experimentation; Research
【Paper Link】 【Pages】:1220
【Authors】: Emmanuel Müller ; Stephan Günnemann ; Ines Färber ; Thomas Seidl
【Abstract】: Traditional clustering algorithms identify just a single clustering of the data. Today's complex data, however, allow multiple interpretations leading to several valid groupings hidden in different views of the database. Each of these multiple clustering solutions is valuable and interesting as different perspectives on the same data and several meaningful groupings for each object are given. Especially for high dimensional data where each object is described by multiple attributes, alternative clusters in different attribute subsets are of major interest. In this tutorial, we describe several real world application scenarios for multiple clustering solutions. We abstract from these scenarios and provide the general challenges in this emerging research area. We describe state-of-the-art paradigms, we highlight specific techniques, and we give an overview of this topic by providing a taxonomy of the existing methods. By focusing on open challenges, we try to attract young researchers for participating in this emerging research field.
【Keywords】: pattern clustering; grouping object; high dimensional data; multiple clustering solution; multiple interpretation; alternative clustering; data mining; multiple perspectives; orthogonal clustering; subspace clustering