2015 IEEE International Conference on Data Mining, ICDM 2015, Atlantic City, NJ, USA, November 14-17, 2015. IEEE 【DBLP Link】
【Paper Link】 【Pages】:1-10
【Authors】: Nesreen K. Ahmed ; Jennifer Neville ; Ryan A. Rossi ; Nick G. Duffield
【Abstract】: From social science to biology, numerous applications often rely on graphlets for intuitive and meaningful characterization of networks at both the global macro-level as well as the local micro-level. While graphlets have witnessed a tremendous success and impact in a variety of domains, there has yet to be a fast and efficient approach for computing the frequencies of these subgraph patterns. However, existing methods are not scalable to large networks with millions of nodes and edges, which impedes the application of graphlets to new problems that require large-scale network analysis. To address these problems, we propose a fast, efficient, and parallel algorithm for counting graphlets of size k={3,4}-nodes that take only a fraction of the time to compute when compared with the current methods used. The proposed graphlet counting algorithms leverages a number of proven combinatorial arguments for different graphlets. For each edge, we count a few graphlets, and with these counts along with the combinatorial arguments, we obtain the exact counts of others in constant time. On a large collection of 300+ networks from a variety of domains, our graphlet counting strategies are on average 460x faster than current methods. This brings new opportunities to investigate the use of graphlets on much larger networks and newer applications as we show in the experiments. To the best of our knowledge, this paper provides the largest graphlet computations to date as well as the largest systematic investigation on over 300+ networks from a variety of domains.
【Keywords】: graph theory; mathematics computing; parallel algorithms; graphlet counting algorithm; graphlet counting strategy; large-scale network analysis; parallel algorithm; social science; Algorithm design and analysis; Frequency-domain analysis; Kernel; Peer-to-peer computing; Scalability; Systematics; graph classification; graph kernel; graphlet; motif counting; parallel method; visual analytics
【Paper Link】 【Pages】:11-20
【Authors】: Grey Ballard ; Tamara G. Kolda ; Ali Pinar ; C. Seshadhri
【Abstract】: Given two sets of vectors, A = {a1→, . . . , am→} and B = {b1→, . . . , bn→}, our problem is to find the top-t dot products, i.e., the largest |ai→ · bj→| among all possible pairs. This is a fundamental mathematical problem that appears in numerous data applications involving similarity search, link prediction, and collaborative filtering. We propose a sampling-based approach that avoids direct computation of all mn dot products. We select diamonds (i.e., four-cycles) from the weighted tripartite representation of A and B. The probability of selecting a diamond corresponding to pair (i, j) is proportional to (ai→ · bj→)2, amplifying the focus on the largest-magnitude entries. Experimental results indicate that diamond sampling is orders of magnitude faster than direct computation and requires far fewer samples than any competing approach. We also apply diamond sampling to the special case of maximum inner product search, and get significantly better results than the state-of-theart hashing methods.
【Keywords】: sampling methods; search problems; vectors; approximate MAD search; diamond sampling; maximum all-pairs dot-product; mn dot products; vectors; weighted tripartite representation; Collaboration; Data mining; Diamonds; Indexes; Manganese; Search problems; Sparse matrices
【Paper Link】 【Pages】:21-30
【Authors】: Biao Chang ; Feida Zhu ; Enhong Chen ; Qi Liu
【Abstract】: The problem of information source detection, whose goal is to identify the source of a piece of information from a diffusion process (e.g., computer virus, rumor, epidemic, and so on), has attracted ever-increasing attention from research community in recent years. Although various methods have been proposed, such as those based on centrality, spectral and belief propagation, the existing solutions still suffer from high time complexity and inadequate effectiveness. To this end, we revisit this problem in the paper and present a comprehensive study from the perspective of likelihood approximation. Different from many previous works, we consider both infected and uninfected nodes to estimate the likelihood for the detection. Specifically, we propose a Maximum A Posteriori (MAP) estimator to detect the information source for general graphs with rumor centrality as the prior. To further improve the efficiency, we design two approximate estimators, namely Brute Force Search Approximation (BFSA) and Greedy Search Bound Approximation (GSBA). BFSA tries to traverse the permitted permutations and directly computes the likelihood, while GSBA exploits a strategy of greedy search to find a surrogate upper bound of the probabilities of permitted permutations for a given node, and derives an approximate MAP estimator. Extensive experiments on several network data sets clearly demonstrate the effectiveness of our methods in detecting the single information source.
【Keywords】: greedy algorithms; information networks; maximum likelihood estimation; search problems; BFSA; GSBA; MAP estimator; brute force search approximation; greedy search bound approximation; information source detection; maximum a posteriori estimation; Conferences; Data mining; greedy search; information source detection; likelihood approximation; maximum a posteriori
【Paper Link】 【Pages】:31-40
【Authors】: Chien-Wei Chang ; Po-An Yang ; Ming-Han Lyu ; Kun-Ta Chuang
【Abstract】: In this paper, we study a novel paradigm of viral marketing with the goal to sustain the influential effectiveness in the network. We study from real cases such as the Ice Bucket Challenges for the ALS awareness, and figure out the "easy come and easy go" phenomenon in the marketing promotion. Such a natural property is fully unexplored in the literature, but it will violate the need of many marketing applications which attempt to receive the perpetual attention and support. We thus highlight the problem of Influential Sustainability, to pursue the long-term and effective influence on the network. Given the set of initial seeds S and a threshold ρ, the goal of Influential Sustainability is to best decide the timing to activate each seed in S so as to maximize the number of iterations in which each iteration will activate the number of inactive nodes more than ρ. The Influential Sustainability problem is challenging due to its #P-hard nature. In addition to the greedy idea, we further present three strategies to heuristically decide the activating timing for each seed. As demonstrated in the empirical study on real data, instead of only providing the flexibility of striking a compromise between the execution efficiency and the resulting quality, these heuristic algorithms can be executed highly efficiently and meanwhile it is able to sustain the longer period which can continuously activate inactive nodes effectively. The results demonstrate their prominent advantage to be practical algorithms for the promising viral marketing paradigm.
【Keywords】: marketing; social networking (online); ALS awareness; heuristic algorithms; ice bucket challenges; inactive nodes; influential sustainability problem; marketing applications; marketing promotion; social networks; viral marketing; Algorithm design and analysis; Approximation algorithms; Heuristic algorithms; Integrated circuit modeling; Market research; Social network services; Timing; Influential Sustainability; Network Diffusion; Social Networks
【Paper Link】 【Pages】:41-50
【Authors】: Yifan Chen ; Xiang Zhao ; Xuemin Lin ; Yang Wang
【Abstract】: Uncertainty is intrinsic to a wide spectrum of real-life applications, which inevitably applies to graph data. Representative uncertain graphs are seen in bio-informatics, social networks, etc. This paper motivates the problem of frequent subgraph mining on single uncertain graphs. We present an enumeration-evaluation algorithm to solve the problem. By showing support computation on an uncertain graph is #P-hard, we develop an approximation algorithm with accuracy guarantee for this purpose. To enhance the solution, we devise optimization techniques to achieve better mining performance. Experiment results on real-life data confirm the usability of the algorithm.
【Keywords】: approximation theory; computational complexity; data mining; graph theory; #P-hard; approximation algorithm; enumeration-evaluation algorithm; frequent subgraph mining; optimization techniques; single uncertain graphs; Conferences; Data mining; FSM; single uncertain graph
【Paper Link】 【Pages】:51-60
【Authors】: Yi-Ling Chen ; Ming-Syan Chen ; Philip S. Yu
【Abstract】: Previous research has aimed to lower the cost of handling large networks by reducing the network size via sparsification. However, when many edges are removed from the network, the information that can be used for link prediction becomes rather limited, and the prediction accuracy thereby drops significantly. To address this issue, we propose a framework called Diverse Ensemble of Drastic Sparsification (DEDS), which constructs ensemble classifiers with good accuracy while keeping the prediction time short. DEDS includes various sparsification methods that are designed to preserve different measures of a network. Therefore, DEDS can generate sparsified networks with significant structural differences and increase the diversity of the ensemble classifier, which is key to improving prediction performance. When a network is drastically sparsified to 0.1% of the original one, DEDS effectively relieves the drop in prediction accuracy and raises the AUC value from 0.52 to 0.70. With a larger sparsification ratio, DEDS is even able to outperform the classifier trained from the original network. As for the efficiency, more than 95% prediction cost can be saved when the network is sparsified to 1% of the original one. If the original network is disk-resident but can fit into main memory after being sparsified, as much as 99.94% of the prediction cost can be saved.
【Keywords】: graph theory; network theory (graphs); DEDS; diverse ensemble of drastic sparsification; large-scale network; link prediction; network analysis; Algorithm design and analysis; Approximation algorithms; Approximation methods; Clustering algorithms; Computational efficiency; Prediction algorithms; Size measurement; ensemble classifier; link prediction; network sparsification
【Paper Link】 【Pages】:61-70
【Authors】: Yong Chen ; Hui Zhang ; Junjie Wu ; Xingguang Wang ; Rui Liu ; Mengxiang Lin
【Abstract】: Dynamic topic models (DTM) are of great use to analyze the evolution of unobserved topics of a text collection over time. Recent years have witnessed the explosive growth of streaming text data emerging from online media, which creates an unprecedented need for DTMs for timely event analysis. While there have been some matrix factorization methods in the literature for dynamic topic modeling, further study is still in great need to model emerging, evolving and fading topics in a more natural and effective way. In light of this, we first propose a matrix factorization model called SONMFSR (Soft Orthogonal NMF with Sparse Representation), which makes full use of soft orthogonal and sparsity constraints for static topic modeling. Furthermore, by introducing the constraints of emerging, evolving and fading topics to SONMFSR, we easily obtain a novel DTM called SONMFSRd for dynamic event analysis. Extensive experiments on two public corpora demonstrate the superiority of SONMFSRd to some state-of-the-art DTMs in both topic detection and tracking. In particular, SONMFSRd shows great potential in real-world applications, where popular topics in Two Sessions 2015 are captured and traced dynamically for possible insights.
【Keywords】: data mining; matrix decomposition; text analysis; DTM; SONMFSRd; dynamic event analysis; dynamic topic models; emerging topics; evolving topics; fading topics; matrix factorization model; soft orthogonal NMF with sparse representation; soft orthogonal constraints; sparsity constraints; static topic mining; static topic modeling; topic detection; topic tracking; Biological system modeling; Data mining; Data models; Fading; Probabilistic logic; Sparse matrices; Vocabulary; Dynamic Topic Model (DTM); Non-negative Matrix Factorization (NMF); Soft Orthogonality; Sparse Representation; Topic Detection and Tracking (TDT)
【Paper Link】 【Pages】:71-80
【Authors】: Zhenyu Chen ; Yiqiang Chen ; Xingyu Gao ; Shuangquan Wang ; Lisha Hu ; Chenggang Clarence Yan ; Nicholas D. Lane ; Chunyan Miao
【Abstract】: By utilizing captured characteristics of surrounding contexts through widely used Bluetooth sensor, user-centric social contexts can be effectively sensed and discovered by dynamic Bluetooth information. At present, state-of-the-art approaches for building classifiers can basically recognize limited classes trained in the learning phase; however, due to the complex diversity of social contextual behavior, the built classifier seldom deals with newly appeared contexts, which results in degrading the recognition performance greatly. To address this problem, we propose, an OSELM (online sequential extreme learning machine) based class incremental learning method for continuous and unobtrusive sensing new classes of social contexts from dynamic Bluetooth data alone. We integrate fuzzy clustering technique and OSELM to discover and recognize social contextual behaviors by real-world Bluetooth sensor data. Experimental results show that our method can automatically cope with incremental classes of social contexts that appear unpredictably in the real-world. Further, our proposed method have the effective recognition capability for both original known classes and newly appeared unknown classes, respectively.
【Keywords】: Bluetooth; learning (artificial intelligence); pattern clustering; social aspects of automation; ubiquitous computing; captured characteristics; class incremental learning method; dynamic Bluetooth data alone; dynamic Bluetooth information; fuzzy clustering technique; learning phase; online sequential extreme learning machine; real-world Bluetooth sensor data; social contextual behaviors; unobtrusive sensing incremental social contexts; user-centric social contexts; Bluetooth; Context; Data mining; Feature extraction; Learning systems; Mobile handsets; Sensors; Bluetooth; Class incremental learning; Context-aware; Online Sequential Extreme Learning Machine (OSELM); Social Contextual behavior
【Paper Link】 【Pages】:81-90
【Authors】: Xuan Hong Dang ; Hongyuan You ; Petko Bogdanov ; Ambuj K. Singh
【Abstract】: Learning a succinct set of substructures that predicts global network properties plays a key role in understanding complex network data. Existing approaches address this problem by sampling the exponential space of all possible subnetworks to find ones of high prediction accuracy. In this paper, we develop a novel framework that avoids sampling by formulating the problem of predictive subnetwork learning as node selection, subject to network-constrained regularization. Our framework involves two steps: (i) subspace learning, and (ii) predictive substructures discovery with network regularization. The framework is developed based upon two mathematically sound techniques of spectral graph learning and gradient descent optimization, and we show that their solutions converge to a global optimum solution - a desired property that cannot be guaranteed by sampling approaches. Through experimental analysis on a number of real world datasets, we demonstrate the performance of our framework against state-of-the-art algorithms, not only based on prediction accuracy but also in terms of domain relevance of the discovered substructures.
【Keywords】: data handling; gradient methods; graph theory; learning (artificial intelligence); complex network data; global network properties; gradient descent optimization; learning predictive substructures; mathematically sound techniques; network data regularization; network-constrained regularization; node selection; predictive subnetwork learning; spectral graph learning; subspace learning; Algorithm design and analysis; Diseases; Electronic mail; Optimization; Prediction algorithms; Silicon; Yttrium; convex optimization; graph theory
【Paper Link】 【Pages】:91-100
【Authors】: Vinay Deolalikar
【Abstract】: Naïve Bayes (NB) classifiers are well-suited to several applications owing to their easy interpretability and maintainability. However, text classification is often hampered by the lack of adequate training data. This motivates the question: how can we train NB more effectively whentraining data is very scarce?In this paper, we introduce an established subsampling techniquefrom statistics -- the jackknife -- into machine learning. Our approachjackknifes documents themselves to create new "pseudo-documents." Theunderlying idea is that although these pseudo-documents do not havesemantic meaning, they are equally representative of the underlyingdistribution of terms. Therefore, they could be used to train any classifierthat learns this underlying distribution, namely, any parametric classifiersuch as NB (but not, for example, non-parametric classifiers such as SVMand k-NN). Furthermore, the marginal value of this additional trainingdata should be the highest precisely when the original data is inadequate. We then show that our jackknife technique is related to the questionof additively smoothing NB via an appropriately defined notion of"adjointness." This relation is surprising since it connects a statisticaltechnique for handling scarce data to a question about the NB model. Accordingly, we are able to shed light on optimal values of the smoothingparameter for NB in the very scarce data regime. We validate our approach on a wide array of standard benchmarks -- both binary and multi-class -- for two event models of multinomial NB. Weshow that the jackknife technique can dramatically improve the accuracyfor both event models of NB in the regime of very scarce training data. Inparticular, our experiments show that the jackknife can make NB moreaccurate than SVM for binary problems in the very scarce training dataregime. We also provide a comprehensive characterization of the accuracyof these important classifiers (for both binary and multiclass) in the veryscarc- data regime for benchmark text datasets, without feature selectionand class imbalance.
【Keywords】: Bayes methods; document handling; feature selection; learning (artificial intelligence); pattern classification; sampling methods; statistical analysis; NB classifiers; NB training; Naive Bayes classifiers; SVM; additive smoothing; binary problems; class imbalance; document jackknife technique; feature selection; machine learning; multinomial NB model; parametric classifier; scarce data; statistical technique; subsampling technique; text classification; Additives; Data models; Niobium; Smoothing methods; Support vector machines; Training; Training data; Comparison between Naive Bayes and SVM; Jackknife Subsampling; Multinomial Event Models; Naive Bayes; Scarce Data
【Paper Link】 【Pages】:101-110
【Authors】: Thang N. Dinh ; Xiang Li ; My T. Thai
【Abstract】: Many social networks and complex systems are found to be naturally divided into clusters of densely connected nodes, known as community structure (CS). Finding CS is one of fundamental yet challenging topics in network science. One of the most popular classes of methods for this problem is to maximize Newman's modularity. However, there is a little understood on how well we can approximate the maximum modularity as well as the implications of finding community structure with provable guarantees. In this paper, we settle definitely the approximability of modularity clustering, proving that approximating the problem within any (multiplicative) positive factor is intractable, unless P = NP. Yet we propose the first additive approximation algorithm for modularity clustering with a constant factor. Moreover, we provide a rigorous proof that a CS with modularity arbitrary close to maximum modularity QOPT might bear no similarity to the optimal CS of maximum modularity. Thus even when CS with near-optimal modularity are found, other verification methods are needed to confirm the significance of the structure.
【Keywords】: approximation theory; computational complexity; network theory (graphs); pattern clustering; CS; Newman's modularity; QOPT; additive approximation algorithm; approximation algorithms; community structure; complex systems; maximizing modularity; maximum modularity; modularity clustering approximability; multiplicative positive factor; near-optimal modularity; network clustering; network science; provable guarantees; social networks; theoretical limits; verification methods; Algorithm design and analysis; Approximation algorithms; Approximation methods; Cascading style sheets; Clustering algorithms; Electronic mail; Partitioning algorithms; Community structure; Newman's modularity; approximation algorithm; complex networks; inapproximability
【Paper Link】 【Pages】:111-120
【Authors】: Lennart Downar ; Wouter Duivesteijn
【Abstract】: Exceptional Model Mining strives to find coherent subgroups of the dataset where multiple target attributes interact in an unusual way. One instance of such an investigated form of interaction is Pearson's correlation coefficient between two targets. EMM then finds subgroups with an exceptionally linear relation between the targets. In this paper, we enrich the EMM toolbox by developing the more general rank correlation model class. We find subgroups with an exceptionally monotone relation between the targets. Apart from catering for this richer set of relations, the rank correlation model class does not necessarily require the assumption of target normality, which is implicitly invoked in the Pearson's correlation model class. Furthermore, it is less sensitive to outliers.
【Keywords】: data mining; EMM toolbox; Pearson correlation coefficient; coherent subgroups; exceptional model mining; exceptionally-monotone models; general rank correlation model; linear relation; multiple target attributes; Analytical models; Cancer; Correlation; Data mining; Lungs; Microwave integrated circuits; Numerical models
【Paper Link】 【Pages】:121-130
【Authors】: Olof Görnerup ; Daniel Gillblad ; Theodore Vasiloudis
【Abstract】: Appropriately defining and then efficiently calculating similarities from large data sets are often essential in data mining, both for building tractable representations and for gaining understanding of data and generating processes. Here we rely on the premise that given a set of objects and their correlations, each object is characterized by its context, i.e. its correlations to the other objects, and that the similarity between two objects therefore can be expressed in terms of the similarity between their respective contexts. Resting on this principle, we propose a data-driven and highly scalable approach for discovering similarities from large data sets by representing objects and their relations as a correlation graph that is transformed to a similarity graph. Together these graphs can express rich structural properties among objects. Specifically, we show that concepts -- representations of abstract ideas and notions -- are constituted by groups of similar objects that can be identified by clustering the objects in the similarity graph. These principles and methods are applicable in a wide range of domains, and will here be demonstrated for three distinct types of objects: codons, artists and words, where the numbers of objects and correlations range from small to very large.
【Keywords】: data mining; graph theory; correlation graph; data mining; domain-agnostic scheme; similarity discovery; similarity graph; Computational linguistics; Computer science; Context; Correlation; Data mining; Electronic mail; Semantics
【Paper Link】 【Pages】:131-140
【Authors】: Aubrey Gress ; Ian Davidson
【Abstract】: Active learning is a crucial method in settings where a human labeling of instances is challenging to obtain. The typical active learning loop builds a model from a few labeled instances, chooses informative unlabeled instances, asks an Oracle (i.e. a human) to label them and then rebuilds the model. Active learning is widely used with much research attention focused on determining which instances to ask the human to label. However, an understudied problem is estimating the accuracy of the learner when instances are added actively. This is a problem because regular cross validation methods may not work well due to the bias in selecting instances to label. We show that existing methods to address the issue of estimating performance are not suitable for practitioners since the scaling coefficients can have high variance, the estimators can produce nonsensical results and the estimates are empirically inaccurate in the classification setting. We propose a new general active learning method which more accurately estimates generalization performance through a sampling step and a new weighted cross validation estimator. Our method can be used with a variety of query strategies and learners. We empirically illustrate the benefits of our method to the practitioner by showing it is more accurate than the standard weighted cross validation estimator and, when used as part of a termination criterion, obtains more accurate estimates of generalization error while having comparable generalization performance.
【Keywords】: generalisation (artificial intelligence); learning (artificial intelligence); sampling methods; general active learning method; generalization performance estimation; sampling step; weighted cross validation estimator; Conferences; Labeling; Learning systems; Logistics; Standards; Training; Training data; Active Learning; Cross Validation
【Paper Link】 【Pages】:141-150
【Authors】: Chu Guan ; Xinjiang Lu ; Xiaolin Li ; Enhong Chen ; Wenjun Zhou ; Hui Xiong
【Abstract】: College students with financial difficulties refer to those whose families can hardly afford their high tuition in universities, and should be supported by modern funding system. Indeed, students' economic plight negatively impact their mental health, academic performance, as well as their personal and social life. While funding students in financial hardship is widely accepted, there is limited understanding and research on effectively identification of the qualifying students. Traditional approaches relying on advisers' personal assessments are inefficient, and such subjective judgements may not reflect the truth. To this end, in this paper, we explore the data mining techniques for identifying students who are qualified for financial support. Specifically, we investigate students' complex behaviors on campus from multiple perspectives, and develop a learning framework, named Dis-HARD, by jointly incorporating the heterogeneous features to predict the portfolio of stipends a given student should be awarded. Our framework formalizes the above problem as a multi-label learning problem. Along this line, we first extract discriminative features from three perspectives: (i) smartcard usage behavior, (ii) internet usage behavior and (iii) trajectory on campus. Then, we develop a linear loss function with regularization to solve this multi-label classification problem. In addition, to effectively exploit the students' similarity and label dependency, we incorporate the graph Laplacian and composite l2,1-norm into the regularization of our model, and develop are-weighted algorithm to achieve effective optimization. Finally, experiments on real-world data demonstrate that our method consistently provides better performance compared to the existing state-of-the-art methods.
【Keywords】: Internet; data mining; educational institutions; financial management; graph theory; optimisation; pattern classification; psychology; smart cards; Dis-HARD; Internet usage behavior; academic performance; campus; college student discovery; composite ℓ2,1-norm; discriminative feature extraction; financial difficulties; financial hardship; financial support; graph Laplacian; heterogeneous features; label dependency; learning framework; linear loss function; mental health; model regularization; multilabel classification problem; multilabel learning problem; personal assessments; personal life; portfolio predict; reweighted algorithm; smartcard usage behavior; social life; stipends; student complex behaviors; student similarity; tuition; universities; Correlation; Data mining; Feature extraction; Portfolios; Trajectory; Web and internet services; Student behavior analysis; multi-label classification; sequential pattern mining
【Paper Link】 【Pages】:151-160
【Authors】: Nika Haghtalab ; Aron Laszka ; Ariel D. Procaccia ; Yevgeniy Vorobeychik ; Xenofon D. Koutsoukos
【Abstract】: Starting with the seminal work by Kempe et al., a broad variety of problems, such as targeted marketing and the spread of viruses and malware, have been modeled as selecting a subset of nodes to maximize diffusion through a network. In cyber-security applications, however, a key consideration largely ignored in this literature is stealth. In particular, an attacker often has a specific target in mind, but succeeds only if the target is reached (e.g., by malware) before the malicious payload is detected and corresponding countermeasures deployed. The dual side of this problem is deployment of a limited number of monitoring units, such as cyber-forensics specialists, so as to limit the likelihood of such targeted and stealthy diffusion processes reaching their intended targets. We investigate the problem of optimal monitoring of targeted stealthy diffusion processes, and show that a number of natural variants of this problem are NP-hard to approximate. On the positive side, we show that if stealthy diffusion starts from randomly selected nodes, the defender's objective is submodular, and a fast greedy algorithm has provable approximation guarantees. In addition, we present approximation algorithms for the setting in which an attacker optimally responds to the placement of monitoring nodes by adaptively selecting the starting nodes for the diffusion process. Our experimental results show that the proposed algorithms are highly effective and scalable.
【Keywords】: computational complexity; greedy algorithms; invasive software; network theory (graphs); social networking (online); NP-hard problem; cyber-forensics specialists; cyber-security applications; fast greedy algorithm; malicious payload detection; malware spreading; monitoring units; optimal stealthy diffusion process monitoring; randomly selected nodes; stealthy diffusion process; virus spreading; Approximation algorithms; Approximation methods; Computational modeling; Diffusion processes; Grippers; Malware; Monitoring; Monitoring Diffusion; Monitoring influence maximization; Security Games
【Paper Link】 【Pages】:161-170
【Authors】: Zheng Han ; Haifeng Chen ; Tan Yan ; Geoff Jiang
【Abstract】: An accurate and automated identification of operational behavior switching is critical to the autonomic management of complex systems. In this paper, we collect sensor readings from those systems, which are treated as time series, and propose a solution to discover switching behaviors by inferring the relationship changes among massive time series. The method first learns a sequence of local relationship models that can best fit the time series data, and then combines the changes of local relationships to identify the system level behavior switching. In the local relationship modeling, we formulate the underlying switching identification as a segmentation problem, and propose a sophisticated optimization algorithm to accurately discover different segments in time series. In addition, we develop a hierarchical optimization strategy to further improve the efficiency of segmentation. To unveil the system level behavior switching, we present a density estimation and mode search algorithm to effectively aggregate the segmented local relationships so that the global switch points can be captured. Our method has been evaluated on both synthetic data and datasets from real systems. Experimental results demonstrate that it can successfully discover behavior switching in different systems.
【Keywords】: large-scale systems; optimisation; pattern classification; search problems; time series; complex physical systems; density estimation; global switch points; hierarchical optimization; mode search algorithm; segmentation problem; switching identification; system level behavior switching; time series data; time series segmentation; Aggregates; Complex systems; Complexity theory; Mathematical model; Optimization; Switches; Time series analysis; ADMM; Physical systems; optimization; segmentation; time series
【Paper Link】 【Pages】:171-180
【Authors】: Juhua Hu ; Qi Qian ; Jian Pei ; Rong Jin ; Shenghuo Zhu
【Abstract】: Multi-clustering, which tries to find multiple independent ways to partition a data set into groups, has enjoyed many applications, such as customer relationship management, bioinformatics and healthcare informatics. This paper addresses two fundamental questions in multi-clustering: how to model the quality of clusterings and how to find multiple stable clusterings. We introduce to multi-clustering the notion of clustering stability based on Laplacian eigengap, which was originally used in the regularized spectral learning method for similarity matrix learning. We mathematically prove that the larger the eigengap, the more stable the clustering. Consequently, we propose a novel multi-clustering method MSC (for Multiple Stable Clustering). An advantage of our method comparing to the existing multi-clustering methods is that our method does not need any parameter about the number of alternative clusterings in the data set. Our method can heuristically estimate the number of meaningful clusterings in a data set, which is infeasible in the existing multi-clustering methods. We report an empirical study that clearly demonstrates the effectiveness of our method.
【Keywords】: pattern clustering; Laplacian eigengap; MSC method; clustering quality model; eigengap; multiple stable clustering method; multiple stable clusterings; regularized spectral learning method; similarity matrix learning; Clustering algorithms; Clustering methods; Customer relationship management; Eigenvalues and eigenfunctions; Image color analysis; Laplace equations; Stability analysis; clustering stability; feature subspace; multi-clustering
【Paper Link】 【Pages】:181-190
【Authors】: Jun Huang ; Guorong Li ; Qingming Huang ; Xindong Wu
【Abstract】: Binary relevance (BR) is a well-known framework for multi-label classification. It decomposes multi-label classification into binary (one-vs-rest) classification subproblems, one for each label. The BR approach is a simple and straightforward way for multi-label classification, but it still has several drawbacks. First, it does not consider label correlations. Second, each binary classifier may suffer from the issue of class-imbalance. Third, it can become computationally unaffordable for data sets with many labels. Several remedies have been proposed to solve these problems by exploiting label correlations between labels and performing label space dimension reduction. Meanwhile, inconsistency, another potential drawback of BR, is often ignored by researchers when they construct multi-label classification models. Inconsistency refers to the phenomenon that if an example belongs to more than one class label, then during the binary training stage, it can be considered as both positive and negative example simultaneously. This will mislead binary classifiers to learn suboptimal decision boundaries. In this paper, we seek to solve this problem by learning label specific features for each label. We assume that each label is only associated with a subset of features from the original feature set, and any two strongly correlated class labels can share more features with each other than two uncorrelated or weakly correlated ones. The proposed method can be applied as a feature selection method for multi-label learning and a general strategy to improve multi-label classification algorithms comprising a number of binary classifiers. Comparison with the state-of-the-art approaches manifests competitive performance of our proposed method.
【Keywords】: feature selection; learning (artificial intelligence); pattern classification; BR approach; Binary relevance; binary classification; binary classifier; binary training stage; feature selection method; learning label specific feature; multilabel classification model; space dimension reduction; suboptimal decision boundary; Algorithm design and analysis; Bayes methods; Computers; Correlation; Data mining; Prediction algorithms; Training
【Paper Link】 【Pages】:191-200
【Authors】: Tsuyoshi Idé ; Amit Dhurandhar
【Abstract】: Supporting human decision making is a major goal of data mining. The more decision making is critical, the more interpretability is required in the predictive model. This paper proposes a new framework to build a fully interpretable predictive model for questionnaire data, while maintaining high prediction accuracy with regards to the final outcome. Such a model has applications in project risk assessment, in health care, in sentiment analysis and presumably in any real world application that relies on questionnaire data for informative and accurate prediction. Our framework is inspired by models in Item Response Theory (IRT), which were originally developed in psychometrics with applications to standardized tests such as SAT. We first extend these models, which are essentially unsupervised, to the supervised setting. We then derive a distance metric from the trained model to define the informativeness of individual question items. On real-world questionnaire data obtained from information technology projects, we demonstrate the power of this approach in terms of interpretability as well as predictability. To the best of our knowledge, this is the first work that leverages the IRT framework to provide informative and accurate prediction on ordinal questionnaire data.
【Keywords】: computability; data mining; decision making; psychometric testing; IRT; SAT; accurate prediction; data mining; health care; human decision making; information technology project; informative prediction; interpretability; item response theory; ordinal questionnaire data; predictability; prediction accuracy; predictive model; project risk assessment; psychometrics; sentiment analysis; Data mining; Decision making; Measurement; Medical services; Predictive models; Sentiment analysis; Standards; item response theory; metric learning; psychometrics; questionnaire data
【Paper Link】 【Pages】:201-210
【Authors】: Mohamed Kafsi ; Matthias Grossglauser ; Patrick Thiran
【Abstract】: We are interested in inferring the set of waypoints (or intermediate destinations) of a mobility trajectory in the absence of timing information. We find that, by mining a dataset of real mobility traces, computing the entropy of conditional Markov trajectory enables us to uncover waypoints, even though no timing information nor absolute geographic location is provided. We build on this observation and design an efficient algorithm for trajectory segmentation. Our empirical evaluation demonstrates that the entropy-based heuristic used by our segmentation algorithm outperforms alternative approaches as it is 43% more accurate than a geometric approach and 20% more accurate than path-stretch based approach. We further explore the link between trajectory entropy, mobility predictability and the nature of intermediate locations using a route choice model on real city maps.
【Keywords】: Markov processes; entropy; travelling salesman problems; conditional Markov entropy; conditional Markov trajectory; mobility predictability; mobility trajectory; route choice model; trajectory entropy; trajectory segmentation; traveling salesman; Computational modeling; Entropy; Markov processes; Random variables; Timing; Trajectory; Uncertainty; Markov trajectories; Mobility; conditional entropy; trajectory segmentation
【Paper Link】 【Pages】:211-220
【Authors】: Zhao Kang ; Chong Peng ; Qiang Cheng
【Abstract】: Numerous applications in data mining and machine learning require recovering a matrix of minimal rank. Robust principal component analysis (RPCA) is a general framework for handling this kind of problems. Nuclear norm based convex surrogate of the rank function in RPCA is widely investigated. Under certain assumptions, it can recover the underlying true low rank matrix with high probability. However, those assumptions may not hold in real-world applications. Since the nuclear norm approximates the rank by adding all singular values together, which is essentially a l1-norm of the singular values, the resulting approximation erroris not trivial and thus the resulting matrix estimator can be significantly biased. To seek a closer approximation and to alleviate the above-mentioned limitations of the nuclear norm, we propose a nonconvex rank approximation. This approximation to the matrix rank is tighter than the nuclear norm. To solve the associated nonconvex minimization problem, we develop an efficient augmented Lagrange multiplier based optimization algorithm. Experimental results demonstrate that our method outperforms current state-of-the-art algorithms in both accuracy and efficiency.
【Keywords】: approximation theory; minimisation; pattern classification; principal component analysis; PCA; RPCA; augmented Lagrange multiplier; nonconvex minimization problem; nonconvex rank approximation; nuclear norm based convex surrogate; optimization algorithm; robust principal component analysis; Conferences; Data mining
【Paper Link】 【Pages】:221-230
【Authors】: Tobias Kötter ; Stephan Günnemann ; Michael R. Berthold ; Christos Faloutsos
【Abstract】: Given a large bipartite graph that represents objects and their properties, how can we automatically extract semantic information that provides an overview of the data and -- at the same time -- enables us to drill down to specific parts for an in-depth analysis? In this work, we propose extracting a taxonomy that models the relation between the properties via an is a hierarchy. The extracted taxonomy arranges the properties from general to specific providing different levels of abstraction. Our proposed method has the following desirable properties: (a) it requires no user-defined parameters, by exploiting the principle of minimum description length, (b) it is effective, by utilizing the inheritance of objects when representing the hierarchy, and (c) it is scalable, being linear in the number of edges. We demonstrate the effectiveness and scalability of our method on a broad spectrum of real, publicly available graphs from drug-property graphs to social networks with up to 22 million vertices and 286 million edges.
【Keywords】: feature extraction; graph theory; social networking (online); automatic taxonomy extraction; bipartite graphs; drug-property graphs; semantic information extraction; social networks; Animals; Bipartite graph; Chemistry; Data mining; Encoding; Physics; Taxonomy; MDL; graphs; taxonomies
【Paper Link】 【Pages】:231-240
【Authors】: Huayu Li ; Richang Hong ; Shiai Zhu ; Yong Ge
【Abstract】: With the rapid development of Location-based Social Network (LBSN) services, a large number of Point-Of-Interests (POIs) have been available, which consequently raises a great demand of building personalized POI recommender systems. A personalized POI recommender system can significantly assist users to find their preferred POIs and help POI owners to attract more customers. However, it is very challenging to develop a personalized POI recommender system because a user's checkin decision making process is very complex and could be influenced by many factors such as social network and geographical distance. In the literature, a variety of methods have been proposed to tackle this problem. Most of these methods model user's preference for POIs with integrated approaches and consider all candidate POIs as a whole space. However, by carefully examining a longitudinal real-world checkin data, we find that the whole space of users' checkins actually consists of two parts: social friend space and user interest space. The social friend space denotes the set of POI candidates that users' friends have checked-in before and the user interest space refers to the set of POI candidates that are similar to users' historical checkins, but are not visited by their friends yet. Along this line, we develop separate models for the both spaces to recommend POIs. Specifically, in social friend space, we assume users would repeat their friends' historical POIs due to the preference propagation through social networks, and propose a new Social Friend Probabilistic Matrix Factorization (SFPMF) model. In user interest space, we propose a new User Interest Probabilistic Matrix Factorization (UIPMF) model to capture the correlations between a new POI and one user's historical POIs. To evaluate the proposed models, we conduct extensive experiments with many state-of-the-art baseline methods and evaluation metrics on the real-world data set. The experimental results firmly demonstrate th- effectiveness of our proposed models.
【Keywords】: decision making; matrix decomposition; probability; recommender systems; social networking (online); LBSN service; SFPMF model; UIPMF model; decision making process; evaluation metrics; location-based social network service; personalized POI recommender system; point-of-interest recommender system; social friend probabilistic matrix factorization model; social friend space; user interest probabilistic matrix factorization model; user interest space; Computational modeling; Correlation; Data models; Mathematical model; Probabilistic logic; Recommender systems; Social network services; Matrix Factorization; POI Recommendation; SFPMF; UIPMF
【Paper Link】 【Pages】:241-250
【Authors】: Huayu Li ; Rongcheng Lin ; Richang Hong ; Yong Ge
【Abstract】: A large number of online reviews have been accumulated on the Web, such as Amazon.com and Cnet.com. It is increasingly challenging to digest these reviews for both consumers and firms as the volume of reviews increases. A promising direction to ease such a burden is to automatically identify aspects of a product and reveal each individual's ratings on them from these reviews. The identified and rated aspects can help consumers understand the pros and cons of a product and make their purchase decisions, and help firms learn user feedbacks and improve their products and marketing strategy. While different methods have been introduced to tackle this problem in the past, few of them successfully model the intrinsic connection between aspect and aspect rating particularly in short reviews. To this end, in this paper, we first propose the Aspect Identification and Rating (AIR) model to model observed textual reviews and overall ratings in a generative way, where the sampled aspect rating influences the sampling of sentimental words on this aspect. Furthermore, we enhance AIR model to particularly address one unique characteristic of short reviews that aspects mentioned in reviews may be quite unbalanced, and develop another model namely AIRS. Within AIRS model, we allow an aspect to directly affect the sampling of a latent rating on this aspect in order to capture the mutual influence between aspect and aspect rating through the whole generative process. Finally, we examine our two models and compare them with other methods based on multiple real world data sets, including hotel reviews, beer reviews and app reviews. Experimental results clearly demonstrate the effectiveness and improvement of our models. Other potential applications driven by our results are also shown in the experiments.
【Keywords】: Web sites; data mining; marketing data processing; reviews; sentiment analysis; AIRS model; Web reviews; app reviews; aspect identification and rating model; aspect rating; beer reviews; generative models; hotel reviews; individual ratings; latent aspect mining; marketing strategy; online reviews; sampled aspect rating; sentimental word sampling; textual reviews; user feedbacks; Atmospheric modeling; Cameras; Data mining; Data models; Histograms; Indexes; Predictive models; Aspect Identification; Rating; Reviews
【Paper Link】 【Pages】:251-260
【Authors】: Yu-Kun Li ; Min-Ling Zhang ; Xin Geng
【Abstract】: In multi-label learning, each training example is represented by a single instance while associated with multiple labels, and the task is to predict a set of relevant labels for the unseen instance. Existing approaches learn from multi-label data by assuming equal labeling-importance, i.e. all the associated labels are regarded to be relevant while their relative importance for the training example are not differentiated. Nonetheless, this assumption fails to reflect the fact that the importance degree of each associated label is generally different, though the importance information is not explicitly accessible from the training examples. In this paper, we show that effective multi-label learning can be achieved by leveraging the implicit relative labeling-importance (RLI) information. Specifically, RLI degrees are formalized as multinomial distribution over the label space, which are estimated by adapting an iterative label propagation procedure. After that, the multi-label prediction model is learned by fitting the estimated multinomial distribution as regularized with popular multi-label empirical loss. Comprehensive experiments clearly validate the usefulness of leveraging implicit RLI information to learn from multi-label data.
【Keywords】: data analysis; iterative methods; learning (artificial intelligence); RLI degrees; implicit relative labeling-importance information; iterative label propagation; multilabel learning; multilabel prediction model; multinomial distribution; Estimation; Predictive models; Reliability; Semantics; Symmetric matrices; Training; Yttrium; label distribution; multi-label learning; relative labeling-importance
【Paper Link】 【Pages】:261-270
【Authors】: Defu Lian ; Yong Ge ; Fuzheng Zhang ; Nicholas Jing Yuan ; Xing Xie ; Tao Zhou ; Yong Rui
【Abstract】: Location recommendation plays an essential role in helping people find places they are likely to enjoy. Though some recent research has studied how to recommend locations with the presence of social network and geographical information, few of them addressed the cold-start problem, specifically, recommending locations for new users. Because the visits to locations are often shared on social networks, rich semantics (e.g., tweets) that reveal a person's interests can be leveraged to tackle this challenge. A typical way is to feed them into traditional explicit-feedback content-aware recommendation methods (e.g., LibFM). As a user's negative preferences are not explicitly observable in most human mobility data, these methods need draw negative samples for better learning performance. However, prior studies have empirically shown that sampling-based methods don't perform as well as a method that considers all unvisited locations as negative but assigns them a lower confidence. To this end, we propose an Implicit-feedback based Content-aware Collaborative Filtering (ICCF) framework to incorporate semantic content and steer clear of negative sampling. For efficient parameter learning, we develop a scalable optimization algorithm, scaling linearly with the data size and the feature size. Furthermore, we offer a good explanation to ICCF, such that the semantic content is actually used to refine user similarity based on mobility. Finally, we evaluate ICCF with a large-scale LBSN dataset where users have profiles and text content. The results show that ICCF outperforms LibFM of the best configuration, and that user profiles and text content are not only effective at improving recommendation but also helpful for coping with the cold-start problem.
【Keywords】: collaborative filtering; mobile computing; optimisation; recommender systems; social networking (online); ICCF; cold-start problem; data size; feature size; geographical information; human mobility data; implicit-feedback based content-aware collaborative filtering framework; large-scale LBSN dataset; learning performance; location recommendation; negative sampling; parameter learning; person interests; sampling-based methods; scalable optimization algorithm; semantic content; social network; text content; user negative preferences; user similarity; Collaboration; Filtering; Optimization; Semantics; Sparse matrices; Twitter; Content aware; Implicit feedback; Location recommendation
【Paper Link】 【Pages】:271-280
【Authors】: Liyuan Liu ; Linli Xu ; Zhen Wang ; Enhong Chen
【Abstract】: With the recent advances in information networks, the problem of identifying group structure or communities has received a significant amount of attention. Most of the existing principles of community detection or clustering mainly focus on either the topological structure of a network or the node attributes separately, while both of the two aspects provide valuable information to characterize the nature of communities. In this paper we combine the topological structure of a network as well as the content information of nodes in the task of detecting communities in information networks. Specifically, we treat a network as a dynamic system and consider its community structure as a consequence of interactions among nodes. To model the interactions we introduce the principle of content propagation and integrate the aspects of structure and content in a network naturally. We further describe the interactions among nodes in two different ways, including a linear model to approximate influence propagation, and modeling the interactions directly with random walk. Based on interaction modeling, the nature of communities is described by analyzing the stable status of the dynamic system. Extensive experimental results on benchmark datasets demonstrate the superiority of the proposed framework over the state of the art.
【Keywords】: graph theory; information networks; community detection; community structure; content propagation; information network; interaction modeling; topological structure; Analytical models; Benchmark testing; Computational modeling; Data mining; Integrated circuit modeling; Probabilistic logic; Probability; community detection; content propagation; information networks
【Paper Link】 【Pages】:281-290
【Authors】: Qi Liu ; Xianyu Zeng ; Chuanren Liu ; Hengshu Zhu ; Enhong Chen ; Hui Xiong ; Xing Xie
【Abstract】: In the retail market, the consumers' indecisiveness refers to the inability to make quick and assertive decisions when they choose among competing product options. Indeed, indecisiveness has been investigated in a number of fields, such as economics and psychology. However, these studies are usually based on the subjective customer survey data with some manually defined questions. Instead, in this paper, we provide a focused study on automatically mining indecisiveness in massive customer behaviors in online stores. Specifically, we first give a general definition to measure the observed indecisiveness in each behavior session. From these observed indecisiveness, we can learn the latent factors/reasons by a probabilistic factor-based model. These two factors are the indecisive indexes of the customers and the product bundles, respectively. Next, we demonstrate that this indecisiveness mining process could be useful in several potential applications, such as the competitive product detection and personalized product bundles recommendation. Finally, we perform extensive experiments on a large-scale behavioral logs of online customers in a distributed environment. The results reveal that our measurement of indecisiveness agrees with the common sense assessment, and the discoveries are useful in predicting customer behaviors and providing better recommendation services for both customers and online retailers.
【Keywords】: Internet; consumer behaviour; data mining; probability; retail data processing; common sense assessment; competing product options; competitive product detection; consumer indecisiveness; customer behavior indecisiveness mining; distributed environment; large-scale behavioral logs; latent factors; online customers; online retailers; online stores; personalized product bundle recommendation; probabilistic factor-based model; retail market; subjective customer survey data; Data mining; Decision making; Indexes; Manuals; Probabilistic logic; Psychology; Competition detection; Customer Behaviors; Data-driven; Indecisiveness; Item Bundle; Recommendation
【Paper Link】 【Pages】:291-300
【Authors】: Rui Liu ; Wei Cheng ; Hanghang Tong ; Wei Wang ; Xiang Zhang
【Abstract】: Network clustering is an important problem thathas recently drawn a lot of attentions. Most existing workfocuses on clustering nodes within a single network. In manyapplications, however, there exist multiple related networks, inwhich each network may be constructed from a different domainand instances in one domain may be related to instances in otherdomains. In this paper, we propose a robust algorithm, MCA, formulti-network clustering that takes into account cross-domain relationshipsbetween instances. MCA has several advantages overthe existing single network clustering methods. First, it is ableto detect associations between clusters from different domains, which, however, is not addressed by any existing methods. Second, it achieves more consistent clustering results on multiple networksby leveraging the duality between clustering individual networksand inferring cross-network cluster alignment. Finally, it providesa multi-network clustering solution that is more robust to noiseand errors. We perform extensive experiments on a variety ofreal and synthetic networks to demonstrate the effectiveness andefficiency of MCA.
【Keywords】: duality (mathematics); pattern clustering; MCA; cross-domain relationships; cross-network cluster alignment; duality; joint cross-domain cluster alignment; multinetwork clustering via cluster alignment; Algorithm design and analysis; Clustering algorithms; Clustering methods; Computer science; Linear programming; Optimization; Robustness; Graph Clustering; Multi-network
【Paper Link】 【Pages】:301-309
【Authors】: Chunchuan Lyu ; Kaizhu Huang ; Hai-Ning Liang
【Abstract】: Adversarial examples are augmented data points generated by imperceptible perturbation of input samples. They have recently drawn much attention with the machine learning and data mining community. Being difficult to distinguish from real examples, such adversarial examples could change the prediction of many of the best learning models including the state-of-the-art deep learning models. Recent attempts have been made to build robust models that take into account adversarial examples. However, these methods can either lead to performance drops or lack mathematical motivations. In this paper, we propose a unified framework to build robust machine learning models against adversarial examples. More specifically, using the unified framework, we develop a family of gradient regularization methods that effectively penalize the gradient of loss function w.r.t. inputs. Our proposed framework is appealing in that it offers a unified view to deal with adversarial examples. It incorporates another recently-proposed perturbation based approach as a special case. In addition, we present some visual effects that reveals semantic meaning in those perturbations, and thus support our regularization method and provide another explanation for generalizability of adversarial examples. By applying this technique to Maxout networks, we conduct a series of experiments and achieve encouraging results on two benchmark datasets. In particular, we attain the best accuracy on MNIST data (without data augmentation) and competitive performance on CIFAR-10 data.
【Keywords】: data mining; learning (artificial intelligence); CIFAR-10 data; MNIST data; Maxout networks; augmented data points; data augmentation; data mining community; deep learning models; gradient regularization methods; imperceptible perturbation; loss function; machine learning models; mathematical motivations; perturbation based approach; unified framework; unified gradient regularization family; Approximation methods; Data mining; Mathematical model; Optimization; Predictive models; Robustness; Training; Adversarial examples; Deep learning; Regularization; Robust classification
【Paper Link】 【Pages】:310-319
【Authors】: Qi Mao ; Wei Zheng ; Li Wang ; Yunpeng Cai ; Volker Mai ; Yijun Sun
【Abstract】: The rapid development of sequencing technology has led to an explosive accumulation of genomics data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, the standard hierarchical clustering method scales poorly due to its quadratic time and space complexities stemming mainly from the need of computing and storing a pairwise distance matrix. It is thus necessary to minimize the number of pairwise distances computed without degrading clustering performance. On the other hand, as high-performance computing systems are becoming widely accessible, it is highly desirable that a clustering method can be easily adapted to parallel computing environments for further speedup, which is not a trivial task for hierarchical clustering. We proposed a new hierarchical clustering method that achieves good clustering performance and high scalability on large sequence datasets. It consists of two stages. In the first stage, a new landmark-based active hierarchical divisive clustering method was proposed that partitions a large-scale sequence dataset into groups, and in the second stage, a fast hierarchical agglomerative clustering method is applied to each group. By assembling hierarchies from both stages, the hierarchy of the data can be easily recovered. Theoretical results showed that our method can recover the true hierarchy with a high probability under some mild conditions and has a linearithmic time complexity with respect to the number of input sequences. The proposed method also facilitates an efficient parallel implementation. Empirical results on various datasets showed that our method achieved clustering accuracy comparable to ESPRIT-Tree and ran faster than greedy heuristic methods.
【Keywords】: bioinformatics; computational complexity; data analysis; genomics; parallel programming; pattern clustering; probability; ESPRIT-Tree; data hierarchy recovery; fast hierarchical agglomerative clustering method; genomics data; landmark-based active hierarchical divisive clustering method; large-scale sequence analysis; large-scale sequence dataset partitioning; linearithmic time complexity; pairwise distance matrix; parallel computing environment; parallel hierarchical clustering; parallel implementation; probability; quadratic time complexity; sequencing technology; space complexity; Algorithm design and analysis; Clustering algorithms; Clustering methods; Parallel processing; Sequences; Time complexity
【Paper Link】 【Pages】:320-329
【Authors】: Takazumi Matsumoto ; Man Lung Yiu
【Abstract】: In recent years, the use of Graphics Processing Units (GPUs) for data mining tasks has become popular. With modern processors integrating both CPUs and GPUs, it is also important to consider what tasks benefit from GPU processing and which do not, and apply a heterogeneous processing approach to improve the efficiency where applicable. Similarity search, also known as k-nearest neighbor search, is a key part of data mining applications and is used also extensively in applications such as multimedia search, where only a small subset of possible results are used. Our contribution is a new exact kNN algorithm with a compressed partial heapsort that outperforms other state-of-the-art exact kNN algorithms by leveraging both the GPU and CPU.
【Keywords】: data mining; graphics processing units; learning (artificial intelligence); multiprocessing systems; pattern classification; search problems; CPU-GPU systems; GPU processing; compressed partial heapsort; data mining applications; data mining tasks; graphics processing units; heterogeneous processing approach; k-nearest neighbor search; kNN algorithm; modern processors; multimedia search; similarity search; Approximation methods; Data mining; Force; Graphics processing units; Instruction sets; Sorting; GPU; data mining; heapsort; heterogeneous processing; nearest neighbor; parallel; similarity search
【Paper Link】 【Pages】:330-338
【Authors】: Yun-Qian Miao ; Ahmed K. Farahat ; Mohamed S. Kamel
【Abstract】: The Kernel Mean Matching (KMM) is an elegant algorithm that produces density ratios between training and test data by minimizing their maximum mean discrepancy in a kernel space. The applicability of KMM to large-scale problems is however hindered by the quadratic complexity of calculating and storing the kernel matrices over training and test data. To address this problem, this paper proposes a novel ensemble algorithm for KMM, which divides test samples into smaller partitions, estimates a density ratio for each partition and then fuses these local estimates with a weighted sum. Our theoretical analysis shows that the ensemble KMM has a lower error bound than the centralized KMM, which uses all the test data at once to estimate the density ratio. Considering its suitability for distributed implementation, the proposed algorithm is also favorable in terms of time and space complexities. Experiments on benchmark datasets confirm the superiority of the proposed algorithm in terms of estimation accuracy and running time.
【Keywords】: computational complexity; data mining; pattern matching; KMM; data mining; density ratio; elegant algorithm; kernel mean matching; large-scale problem; quadratic complexity; space complexity; theoretical analysis; time complexity; Algorithm design and analysis; Complexity theory; Density functional theory; Estimation; Kernel; Partitioning algorithms; Training; Density ratio estimation; Distributed algorithm; Ensemble method; Kernel mean matching
【Paper Link】 【Pages】:339-348
【Authors】: Leto Peel ; Aaron Clauset
【Abstract】: Professional team sports provide an excellent domain for studying the dynamics of social competitions. These games are constructed with simple, well-defined rules and payoffs that admit a high-dimensional set of possible actions and nontrivial scoring dynamics. The resulting gameplay and efforts to predict its evolution are the object of great interest to both sports professionals and enthusiasts. In this paper, we consider two online prediction problems for team sports: given a partially observed game Who will score next? and ultimately Who will win? We present novel interpretable generative models of within-game scoring that allow for dependence on lead size (restoration) and on the last team to score (anti-persistence). We then apply these models to comprehensive within-game scoring data for four sports leagues over a ten-year period. By assessing these models' relative goodness-of-fit we shed new light on the underlying mechanisms driving the observed scoring dynamics of each sport. Furthermore, in both predictive tasks, the performance of our models consistently outperforms baselines models, and our models make quantitative assessments of the latent team skill, over time.
【Keywords】: data handling; sport; statistical analysis; goodness-of-fit; online prediction problems; professional team sports; quantitative assessment; social competition dynamics; sports scoring dynamics prediction; within-game scoring data; Clocks; Computational modeling; Games; History; Lead; Predictive models; Probabilistic logic
【Paper Link】 【Pages】:349-358
【Authors】: Chenwei Ran ; Wei Shen ; Jianyong Wang ; Xuan Zhu
【Abstract】: The knowledge base is a machine-readable set of knowledge. More and more multi-domain and large-scale knowledge bases have emerged in recent years, and they play an essential role in many information systems and semantic annotation tasks. However we do not have a perfect knowledge base yet and maybe we will never have a perfect one, because all the knowledge bases have limited coverage while new knowledge continues to emerge. Therefore populating and enriching the existing knowledge base become important tasks. Traditional knowledge base population task usually leverages the information embedded in the unstructured free text. Recently researchers found that massive structured tables on the Web are high-quality relational data and easier to be utilized than the unstructured text. Our goal of this paper is to enrich the knowledge base using Wikipedia tables. Here, knowledge means binary relations between entities and we focus on the relations in some specific domains. There are two basic types of information can be used in this task: the existing relation instances and the connection between types and relations. We firstly propose two basic probabilistic models based on two types of information respectively. Then we propose a light-weight aggregated model to combine the advantages of basic models. The experimental results show that our method is an effective approach to enriching the knowledge base with both high precision and recall.
【Keywords】: Web sites; knowledge based systems; probability; text analysis; Wikipedia tables; binary relations; domain-specific knowledge base enrichment; high-quality relational data; light-weight aggregated model; probabilistic models; unstructured free text; Databases; Electronic publishing; Encyclopedias; Internet; Knowledge based systems; Semantics; Knowledge base enrichment; Relation extraction; Web tables
【Paper Link】 【Pages】:359-368
【Authors】: Saber Salah ; Reza Akbarinia ; Florent Masseglia
【Abstract】: The discovery of informative itemsets is a fundamental building block in data analytics and information retrieval. While the problem has been widely studied, only few solutions scale. This is particularly the case when i) the data set is massive, calling for large-scale distribution, and/or ii) the length k of the informative itemset to be discovered is high. In this paper, we address the problem of parallel mining of maximally informative k-itemsets (miki) based on joint entropy. We propose PHIKS (Parallel Highly Informative K-ItemSet) a highly scalable, parallel miki mining algorithm. PHIKS renders the mining process of large scale databases (up to terabytes of data) succinct and effective. Its mining process is made up of only two efficient parallel jobs. With PHIKS, we provide a set of significant optimizations for calculating the joint entropies of miki having different sizes, which drastically reduces the execution time of the mining process. PHIKS has been extensively evaluated using massive real-world data sets. Our experimental results confirm the effectiveness of our proposal by the significant scale-up obtained with high itemsets length and over very large databases.
【Keywords】: Big Data; data analysis; data mining; entropy; parallel processing; PHIKS; big data; data analytics; information retrieval; large-scale distribution; maximally informative-itemset discovery; parallel maximally informative k-itemset mining; parallel miki mining algorithm; parallel mining process; real-world data sets; Data mining; Entropy; Information retrieval; Itemsets; Optimization; informative itemsets; joint entropy; mapreduce; massive distribution; spark
【Paper Link】 【Pages】:369-378
【Authors】: Shahriar Shariat ; Burkay Orten ; Ali Dasdan
【Abstract】: Online media provides opportunities for marketers through which they can deliver effective brand messages to a wide range of audiences at scale. Advertising technology platforms enable advertisers to reach their target audience by delivering ad impressions to online users in real time. In order to identify the best marketing message for a user and to purchase impressions at the right price, we rely heavily on bid prediction and optimization models. Even though the bid prediction models are well studied in the literature, the equally important subject of model evaluation is usually overlooked or not discussed in detail. Effective and reliable evaluation of an online bidding model is crucial for making faster model improvements as well as for utilizing the marketing budgets more efficiently. In this paper, we present an experimentation framework for bid prediction models where our focus is on the practical aspects of model evaluation. Specifically, we outline the unique challenges we encounter in our platform due to a variety of factors such as heterogeneous goal definitions, varying budget requirements across different campaigns, high seasonality and the auction-based environment for inventory purchasing. Then, we introduce return on investment (ROI) as a unified model performance (i.e., success) metric and explain its merits over more traditional metrics such as click-through rate (CTR) or conversion rate (CVR). Most importantly, we discuss commonly used evaluation and metric summarization approaches in detail and propose a more accurate method for online evaluation of new experimental models against the baseline. Our meta-analysis-based approach addresses various shortcomings of other methods and yields statistically robust conclusions that allow us to conclude experiments more quickly in a reliable manner. We demonstrate the effectiveness of our evaluation strategy on real campaign data through some experiments.
【Keywords】: advertising data processing; cost-benefit analysis; investment; optimisation; statistical analysis; tendering; ROI; advertising technology platform; auction-based environment; bid optimization model; bid prediction model; brand message; inventory purchasing; large-scale computational advertising platform; marketing budget; marketing message; meta-analysis-based approach; metric summarization approach; online bidding model; online model evaluation; return-on-investment; Adaptation models; Advertising; Analytical models; Measurement; Predictive models; Real-time systems; Computational Advertising; Multiple Statistical Test; RTB; Statistical Test
【Paper Link】 【Pages】:379-388
【Authors】: Lei Shi ; Hanghang Tong ; Xinzhu Mu
【Abstract】: Why are some people more creative than others? How do human brain networks evolve over time? A key stepping stone to both mysteries and many more is to compare weighted brain networks. In contrast to networks arising from other application domains, the brain network exhibits its own characteristics (e.g., high density, indistinguishability), which makes any off-the-shelf data mining algorithm as well as visualization tool sub-optimal or even mis-leading. In this paper, we propose a shift from the current mining-then-visualization paradigm, to jointly model these two core building blocks (i.e., mining and visualization) for brain network comparisons. The key idea is to integrate the human perception constraint into the mining block earlier so as to guide the analysis process. We formulate this as a multi-objective feature selection problem, and propose an integrated framework, BrainQuest, to solve it. We perform extensive empirical evaluations, both quantitatively and qualitatively, to demonstrate the effectiveness and efficiency of our approach.
【Keywords】: data mining; data visualisation; feature selection; medical computing; visual perception; BrainQuest; data mining algorithm; mining-then-visualization paradigm; multiobjective feature selection problem; perception-guided brain network comparison; Brain modeling; Correlation; Data mining; Indexes; Network topology; Topology; Visualization
【Paper Link】 【Pages】:389-398
【Authors】: Dongjin Song ; David A. Meyer ; Dacheng Tao
【Abstract】: Inferring potential links is a fundamental problem in social networks. In the link recommendation problem, the aim is to suggest a list of potential people to each user, ordered by the preferences of the user. Although various approaches have been developed to solve this problem, the difficulty of producing a ranking list with high precision at the top -- the most important consideration for real world applications -- remains largely an open problem. In this work, we propose two top-k link recommendation algorithms which focus on optimizing the top ranked links. For this purpose, we define a cost-sensitive ranking loss which penalizes the mistakes at the top of a ranked list more than the mistakes at the bottom. In particular, we propose a log loss, derive its surrogate, and formulate a top-k link recommendation model by optimizing this surrogate loss function based upon latent features. Moreover, we extend this top-k link recommendation model by incorporating both the latent features and explicit features of the network. Finally, an efficient learning scheme to learn the model parameters is provided. We conduct empirical studies based upon four real world datasets, i.e., Wikipedia, CondMat, Epinions, and MovieLens 1M, of which the largest network contains more than 70 thousand nodes and over one million links. Our experiments demonstrate that the proposed algorithms outperform several state-of-the-art methods.
【Keywords】: learning (artificial intelligence); recommender systems; social networking (online); CondMat; Epinions; MovieLens 1M; Wikipedia; cost-sensitive ranking; social network; top-k link recommendation algorithm; Electronic mail; Electronic publishing; Encyclopedias; Feature extraction; Internet; Social network services; Link recommendation; cost-sensitive ranking loss; social networks; top-k learning to rank
【Paper Link】 【Pages】:399-408
【Authors】: Constantine Alexander Vitt ; Hui Xiong
【Abstract】: Patent data has been used for generating patent-based indicators for tracking the technology development of high-tech companies. In this paper, we further show the promises of exploiting patent data for the analysis and prospecting of high-tech companies in the stock market. Specifically, we aim at investigating the relationship between the patent activities of high-tech companies and the dynamics of their stock price movement. While stock forecasting is a topic of general interest and has been studied extensively in the literature, the most popular forecasting models do not facilitate the discovery of the patent-activity impact all essential characteristics of the market performance of a given stock. To this end, we propose a new approach to analyze the relationships between patent activities and the statistical characteristics of stock prices. To the best of our knowledge, we are the first to propose a model of this nature, relating patent data mining and financial modeling. Also, we demonstrate the relationships of the market-adjusted stock returns and the number of patent applications as well as the diversity of the corresponding patent categories. Moreover, we establish relationships between the monthly drift and volatility of the market-adjusted stock returns and those patent activity indicators. Here, we exploit a widely accepted diffusion model of the stock returns and estimate its parameters. By adopting the moving window technique, we create fitted models by introducing various lagged terms of patent activity characteristics. For each company, we consider the coefficients of each significant term over the entire time horizon and perform further statistical testing on the overall significance of the corresponding indicator. The analysis has been performed on real-world stock trading data as well as patent data. The results confirm the significant impact of patent activity on stock movement and on its essential statistical characteristics of drift and volat- lity.
【Keywords】: data analysis; data mining; economic indicators; financial data processing; patents; statistical analysis; stock markets; diffusion model; financial modeling; high-tech companies; market-adjusted stock returns; monthly drift; monthly volatility; moving window technique; parameter estimation; patent activity characteristics; patent activity indicators; patent applications; patent categories; patent data mining; patent-activity impact discovery; real-world stock trading data; statistical characteristics; stock forecasting; stock market performance; stock price movement; stock returns; technology development; Biological system modeling; Companies; Data mining; Data models; Patents; Stock markets; Technological innovation; Factor Analysis; Patent Data Mining; Time-series Analysis
【Paper Link】 【Pages】:409-418
【Authors】: Isabel Valera ; Manuel Gomez-Rodriguez
【Abstract】: The emergence and wide-spread use of online social networks has led to a dramatic increase on the availability of social activity data. Importantly, this data can be exploited to investigate, at a microscopic level, some of the problems that have captured the attention of economists, marketers and sociologists for decades, such as, e.g., product adoption, usage and competition. In this paper, we propose a continuous-time probabilistic model, based on temporal point processes, for the adoption and frequency of use of competing products, where the frequency of use of one product can be modulated by those of others. This model allows us to efficiently simulate the adoption and recurrent usages of competing products, and generate traces in which we can easily recognize the effect of social influence, recency and competition. We then develop an inference method to efficiently fit the model parameters by solving a convex program. The problem decouples into a collection of smaller subproblems, thus scaling easily to networks with hundred of thousands of nodes. We validate our model over synthetic and real diffusion data gathered from Twitter, and show that the proposed model does not only provides a good fit to the data and more accurate predictions than alternatives but also provides interpretable model parameters, which allow us to gain insights into some of the factors driving product adoption and frequency of use.
【Keywords】: social networking (online); Twitter; competing products; continuous-time probabilistic model; convex program; microscopic level; modeling adoption; online social networks; product adoption; real diffusion data; social activity data; social influence; temporal point processes; Computational modeling; Data models; History; Mathematical model; Microscopy; Predictive models; Social network services
【Paper Link】 【Pages】:419-428
【Authors】: Hanmo Wang ; Liang Du ; Peng Zhou ; Lei Shi ; Yuhua Qian ; Yi-Dong Shen
【Abstract】: In classification tasks, labeled data is a necessity but sometimes difficult or expensive to obtain. On the contrary, unlabeled data is usually abundant. Recently, different active learning algorithms are proposed to alleviate this issue by selecting the most informative data points to label. One family of active learning methods comes from Optimum Experimental Design (OED) in statistics. Instead of selecting data points one by one iteratively, OED-based approaches select data in a one-shot manner, that is, a fixed-sized subset is selected from the unlabeled dataset for manually labeling. These methods usually use kernels to represent pair-wise similarities between different data points. It is well known that choosing optimal kernel types (e.g. Gaussian kernel) and kernel parameters (e.g. kernel width) is tricky, and a common way to resolve it is by Multiple Kernel Learning (MKL), i.e., to construct a few candidate kernels and merge them to form a consensus kernel. There would be different ways to combine multiple kernels, one of which, called the the globalised approach is to assign a weight to each candidate kernel. In practice different data points in the same candidate kernel may not have the same contribution in the consensus kernel, this requires assigning different weights to different data points in the same candidate kernel, leading to the localized approach. In this paper, we introduce MKL to OED-based active learning, specifically we propose globalised and localized multiple kernel active learning methods, respectively. Our experiments on six benchmark datasets demonstrate that the proposed methods have better performance than existing OED-based active learning methods.
【Keywords】: Gaussian processes; data handling; design of experiments; learning (artificial intelligence); pattern classification; Gaussian kernel; MKL; OED-based active learning; active learning algorithm; candidate kernel weight assignment; classification task; consensus kernel; informative data points; kernel parameter; kernel width; labeled data; multiple kernel learning; optimal kernel type; optimum experimental design; pair-wise similarity representation; statistics; Algorithm design and analysis; Data mining; Kernel; Learning systems; Robustness; Training; Uncertainty; Active learning; Experimental Design; Multiple Kernels
【Paper Link】 【Pages】:429-438
【Authors】: Jingjing Wang ; Wenzhu Tong ; Hongkun Yu ; Min Li ; Xiuli Ma ; Haoyan Cai ; Tim Hanratty ; Jiawei Han
【Abstract】: A major event often has repercussions on both news media and microblogging sites such as Twitter. Reports from mainstream news agencies and discussions from Twitter complement each other to form a complete picture. An event can have multiple aspects (sub-events) describing it from multiple angles, each of which attracts opinions/comments posted on Twitter. Mining such reflections is interesting to both policy makers and ordinary people seeking information. In this paper, we propose a unified framework to mine multi-aspect reflections of news events in Twitter. We propose a novel and efficient dynamic hierarchical entity-aware event discovery model to learn news events and their multiple aspects. The aspects of an event are linked to their reflections in Twitter by a bootstrapped dataless classification scheme, which elegantly handles the challenges of selecting informative tweets under overwhelming noise and bridging the vocabularies of news and tweets. In addition, we demonstrate that our framework naturally generates an informative presentation of each event with entity graphs, time spans, news summaries and tweet highlights to facilitate user digestion.
【Keywords】: data mining; electronic publishing; graph theory; social networking (online); vocabulary; Twitter; Twitter complement; bootstrapped dataless classification scheme; dynamic hierarchical entity-aware event discovery model; entity graphs; informative presentation; microblogging sites; multiaspect reflection mining; news events; news media; news summaries; news vocabularies; policy makers; time spans; tweet highlights; user digestion; Computer hacking; Data mining; Joining processes; Media; Twitter; Vocabulary; Xenon; event discovery; news; twitter
【Paper Link】 【Pages】:439-448
【Authors】: Lijun Wang ; Ming Dong ; Alexander Kotov
【Abstract】: Clustering is a task of finding natural groups in datasets based on measured or perceived similarity between data points. Spectral clustering is a well-known graph-theoretic approach, which is capable of capturing non-convex geometries of datasets. However, it generally becomes infeasible for analyzing large datasets due to relatively high time and space complexity. In this paper, we propose Multi-level Approximate Spectral (MAS) clustering to enable efficient analysis of large datasets. By integrating a series of low-rank matrix approximations (i.e., approximations to the affinity matrix and its subspace, as well as those for the Laplacian matrix and the Laplacian subspace), MAS achieves great computational and spacial efficiency. MAS provides a general framework for fast and accurate spectral clustering, which works with any kernels, various fast sampling strategies and different low-rank approximation algorithms. In addition, it can be easily extended for distributed computing. From a theoretical perspective, we provide rigorous analysis of its approximation error in addition to its correctness and computational complexity. Through extensive experiments we demonstrate superior performance of the proposed method relative to several well-known approximate spectral clustering algorithms.
【Keywords】: approximation theory; computational complexity; data analysis; graph theory; group theory; pattern clustering; MAS clustering; approximation error; computational complexity; dataset analysis; distributed computing; graph-theoretic approach; low-rank approximation algorithms; low-rank matrix approximations; multilevel approximate spectral clustering; natural groups; nonconvex geometries; sampling strategies; space complexity; time complexity; Algorithm design and analysis; Approximation algorithms; Approximation methods; Clustering algorithms; Complexity theory; Kernel; Laplace equations
【Paper Link】 【Pages】:449-458
【Authors】: Yu Ting Wen ; Kae-Jer Cho ; Wen-Chih Peng ; Jinyoung Yeo ; Seung-won Hwang
【Abstract】: With the popularity of social media (e.g., Facebook and Flicker), users could easily share their check-in records and photos during their trips. In view of the huge amount of check-in data and photos in social media, we intend to discover travel experiences to facilitate trip planning. Prior works have been elaborated on mining and ranking existing travel routes from check-in data. We observe that when planning a trip, users may have some keywords about preference on his/her trips. Moreover, a diverse set of travel routes is needed. To provide a diverse set of travel routes, we claim that more features of Places of Interests (POIs) should be extracted. Therefore, in this paper, we propose a Keyword-aware Skyline Travel Route (KSTR) framework that use knowledge extraction from historical mobility records and the user's social interactions. Explicitly, we model the "Where, When, Who" issues by featurizing the geographical mobility pattern, temporal influence and social influence. Then we propose a keyword extraction module to classify the POI-related tags automatically into different types, for effective matching with query keywords. We further design a route reconstruction algorithm to construct route candidates that fulfill the query inputs. To provide diverse query results, we explore Skyline concepts to rank routes. To evaluate the effectiveness and efficiency of the proposed algorithms, we have conducted extensive experiments on real location-based social network datasets, and the experimental results show that KSTR does indeed demonstrate good performance compared to state-of-the-art works.
【Keywords】: classification; feature extraction; mobile computing; mobility management (mobile radio); query processing; social networking (online); travel industry; Facebook; Flicker; KSTR; POI-related tag classification; check-in records; feature extraction; geographical mobility pattern; historical mobility records; keyword-aware skyline travel route recommendation; knowledge extraction; places-of-interests; query keywords; real location-based social network datasets; route ranking; route reconstruction algorithm; social influence; social media; temporal influence; travel experience discovery; trip planning; user social interactions; when issue; where issue; who issue; Feature extraction; Media; Planning; Radio frequency; Semantics; Social network services; Trajectory
【Paper Link】 【Pages】:459-468
【Authors】: Fangzhao Wu ; Yongfeng Huang
【Abstract】: Sentiment classification is a hot research topic in both industrial and academic fields. The mainstream sentiment classification methods are based on machine learning and treat sentiment classification as a text classification problem. However, sentiment classification is widely recognized as a highly domain-dependent task. The sentiment classifier trained in one domain may not perform well in another domain. A simple solution to this problem is training a domain-specific sentiment classifier for each domain. However, it is difficult to label enough data for every domain since they are in a large quantity. In addition, this method omits the sentiment information in other domains. In this paper, we propose to train sentiment classifiers for multiple domains in a collaborative way based on multi-task learning. Specifically, we decompose the sentiment classifier in each domain into two components, a general one and a domain-specific one. The general sentiment classifier can capture the global sentiment information and is trained across various domains to obtain better generalization ability. The domain-specific sentiment classifier is trained using the labeled data in one domain to capture the domain-specific sentiment information. In addition, we explore two kinds of relations between domains, one based on textual content and the other one based on sentiment word distribution. We build a domain similarity graph using domain relations and encode it into our approach as regularization over the domain-specific sentiment classifiers. Besides, we incorporate the sentiment knowledge extracted from sentiment lexicons to help train the general sentiment classifier more accurately. Moreover, we introduce an accelerated optimization algorithm to train the sentiment classifiers efficiently. Experimental results on two benchmark sentiment datasets show that our method can outperform baseline methods significantly and consistently.
【Keywords】: generalisation (artificial intelligence); graph theory; knowledge acquisition; learning (artificial intelligence); pattern classification; sentiment analysis; text analysis; accelerated optimization algorithm; collaborative multidomain sentiment classification; domain similarity graph; domain-specific sentiment classifier; domain-specific sentiment information; generalization ability; global sentiment information; machine learning; mainstream sentiment classification method; multitask learning; sentiment information; sentiment knowledge extraction; sentiment lexicons; sentiment word distribution; text classification problem; textual content; Adaptation models; Benchmark testing; Collaboration; Data mining; Learning systems; Motion pictures; Training; multi-domain; multi-task learning; sentiment classification
【Paper Link】 【Pages】:469-478
【Authors】: Wei Xiong ; Bo Du ; Lefei Zhang ; Ruimin Hu ; Wei Bian ; Jialie Shen ; Dacheng Tao
【Abstract】: The human visual system proves smart in extracting both global and local features. Can we design a similar way for unsupervised feature learning? In this paper, we propose anovel pooling method within an unsupervised feature learningframework, named Rich and Robust Feature Pooling (R2FP), to better explore rich and robust representation from sparsefeature maps of the input data. Both local and global poolingstrategies are further considered to instantiate such a methodand intensively studied. The former selects the most conductivefeatures in the sub-region and summarizes the joint distributionof the selected features, while the latter is utilized to extractmultiple resolutions of features and fuse the features witha feature balancing kernel for rich representation. Extensiveexperiments on several image recognition tasks demonstratethe superiority of the proposed techniques.
【Keywords】: data mining; data visualisation; feature extraction; image recognition; learning (artificial intelligence); R2FP; feature extraction; human visual system; image recognition; rich and robust feature pooling; unsupervised feature learning; visual data mining; Data mining; Electronic mail; Feature extraction; Kernel; Robustness; Spatial resolution; autoencoder; pooling; representation learning
【Paper Link】 【Pages】:479-488
【Authors】: Yi Xu ; Yilin Zhu ; Zhongfei Zhang ; Yaqing Zhang ; Philip S. Yu
【Abstract】: The parameter estimation to mixture models has been shown as a local optimal solution for decades. In this paper, we propose a functional estimation to mixture models using step functions. We show that the proposed functional inference yields a convex formulation and consequently the mixture models are feasible for a global optimum inference. The proposed approach further unifies the existing isolated exemplar-based clustering techniques at a higher level of generality, e.g. it provides a theoretical justification for the heuristics of the clustering by affinity propagation Frey & Dueck (2007), it reproduces Lashkari & Golland (2007)'s's convex formulation as a special case under this step function construction. Empirical studies also verify the theoretic justifications.
【Keywords】: approximation theory; convex programming; inference mechanisms; parameter estimation; pattern clustering; affinity propagation; convex approximation; functional inference; integral mixture models; isolated exemplar-based clustering techniques; parameter estimation; step function; Bayes methods; Electronic mail; Estimation; Function approximation; Inference algorithms; Mixture models; Convex Approximation; mixture models; step functions
【Paper Link】 【Pages】:489-498
【Authors】: Junyu Xuan ; Jie Lu ; Guangquan Zhang ; Richard Yi Da Xu ; Xiangfeng Luo
【Abstract】: Incorporating the side information of text corpus, i.e., authors, time stamps, and emotional tags, into the traditional text mining models has gained significant interests in the area of information retrieval, statistical natural language processing, and machine learning. One branch of these works is the so-called Author Topic Model (ATM), which incorporates the authors's interests as side information into the classical topic model. However, the existing ATM needs to predefine the number of topics, which is difficult and inappropriate in many real-world settings. In this paper, we propose an Infinite Author Topic (IAT) model to resolve this issue. Instead of assigning a discrete probability on fixed number of topics, we use a stochastic process to determine the number of topics from the data itself. To be specific, we extend a gamma-negative binomial process to three levels in orderto capture the author-document-keyword hierarchical structure. Furthermore, each document is assigned a mixed gamma process that accounts for the multi-author's contribution towards this document. An efficient Gibbs sampling inference algorithm witheach conditional distribution being closed-form is developed for the IAT model. Experiments on several real-world datasets show the capabilities of our IAT model to learn the hidden topics, authors' interests on these topics and the number of topics simultaneously.
【Keywords】: data mining; gamma distribution; inference mechanisms; sampling methods; text analysis; ATM; Gibbs sampling inference algorithm; author-document-keyword hierarchical structure; discrete probability; infinite author topic model; information retrieval; machine learning; mixed gamma-negative binomial process; statistical natural language processing; text corpus side information; text mining models; Bayes methods; Data models; Hidden Markov models; Probability distribution; Stochastic processes; Text mining; Weight measurement; Bayesian nonparametric learning; Text mining; Topic models
【Paper Link】 【Pages】:499-508
【Authors】: Jingyuan Yang ; Chuanren Liu ; Mingfei Teng ; Hui Xiong ; March Liao ; Vivian Zhu
【Abstract】: Business to Business (B2B) marketing aims at meeting the needs of other businesses instead of individual consumers. In B2B markets, the buying processes usually involve series of different marketing campaigns providing necessary information to multiple decision makers with different interests and motivations. The dynamic and complex nature of these processes imposes significant challenges to analyze the process logs for improving the B2B marketing practice. Indeed, most of the existing studies only focus on the individual consumers in the markets, such as movie/product recommender systems. In this paper, we exploit the temporal behavior patterns in the buying processes of the business customers and develop a B2B marketing campaign recommender system. Specifically, we first propose the temporal graph as the temporal knowledge representation of the buying process of each business customer. The key idea is to extract and integrate the campaign order preferences of the customer using the temporal graph. We then develop the low-rank graph reconstruction framework to identify the common graph patterns and predict the missing edges in the temporal graphs. We show that the prediction of the missing edges is effective to recommend the marketing campaigns to the business customers during their buying processes. Moreover, we also exploit the community relationships of the business customers to improve the performances of the graph edge predictions and the marketing campaign recommendations. Finally, we have performed extensive empirical studies on real-world B2B marketing data sets and the results show that the proposed method can effectively improve the quality of the campaign recommendations for challenging B2B marketing tasks.
【Keywords】: graph theory; knowledge representation; marketing data processing; recommender systems; B2B marketing campaign recommendations; B2B marketing campaign recommender system; B2B marketing data sets; B2B marketing practice; business customers; business to business marketing; buying processes; campaign order preferences; graph pattern identification; low-rank graph reconstruction framework; missing graph edge predictions; social factors; temporal behavior patterns; temporal graph; temporal knowledge representation; Companies; Electronic mail; Knowledge representation; Markov processes; Recommender systems; Webinars; Community Networks; Graph Reconstruction; Recommender Systems; Temporal Graph; Temporal Patterns
【Paper Link】 【Pages】:509-518
【Authors】: Peng Yang ; Peilin Zhao ; Vincent W. Zheng ; Xiaoli Li
【Abstract】: Traditional online learning algorithms are designed for vector data only, which assume that the labels of all the training examples are provided. In this paper, we study graph classification where only limited nodes are chosen for labelling by selective sampling. Particularly, we first adapt a spectral-based graph regularization technique to derive a novel online learning linear algorithm which can handle graph data, although it still queries the labels of all nodes and thus is not preferred, as labelling is typically time-consuming. To address this issue, we then propose a new confidence-based query method for selective sampling. The theoretical result shows that our online learning algorithm with a fraction of queried labels can achieve a mistake bound comparable with the one learning on all labels of the nodes. In addition, the algorithm based on our proposed query strategy can achieve a mistake bound better than the one based on other query methods. However, our algorithm is conservative to update the model whenever error happens, which obviously wastes training labels that are valuable for the model. To take advantage of these labels, we further propose a novel aggressive algorithm, which can update the model aggressively even if no error occurs. The theoretical analysis shows that our aggressive approach can achieve a mistake bound better than its conservative and fully-supervised counterpart, with substantially fewer queried times. We empirically evaluate our algorithm on several real-world graph datasets and the experimental results demonstrate that our method is highly effective.
【Keywords】: graph theory; learning (artificial intelligence); pattern classification; query processing; sampling methods; aggressive algorithm; aggressive graph-based selective sampling algorithm; confidence-based query method; graph classification; graph data handling; online learning algorithms; online learning linear algorithm; spectral-based graph regularization technique; vector data; Algorithm design and analysis; Kernel; Laplace equations; Prediction algorithms; Predictive models; Training; Uncertainty; Graph Node Classification; Online Learning; Selective Sampling
【Paper Link】 【Pages】:519-528
【Authors】: Yang Yang ; Jie Tang
【Abstract】: Users often fail to find the right keywords to precisely describe their queries in the information seeking process. Techniques such as user intention predictions and personalized recommendations are designed to help the users figure out how to formalize their queries. In this work, we aim to help users identify their search targets using a new approach called Interactive User Intention Understanding. In particular, we construct an automatic questioner that generates yes-or-no questions for the user. Then we infer user intention according to the corresponding answers. In order to generate "smart" questions in an optimal sequence, we propose the IHS algorithm based on heuristic search. We prove an error bound for the proposed algorithm on the ranking of target items given the questions and answers. We conduct experiments on three datasets and compare our result with two baseline methods. Experimental results show that IHS outperforms the baseline methods by 27.83% and 25.98% respectively.
【Keywords】: interactive systems; query processing; question answering (information retrieval); recommender systems; search problems; IHS algorithm; automatic questioner; error bound; heuristic search; information seeking process; interactive user intention understanding; optimal sequence; personalized recommendations; target item ranking; user intention predictions; yes-or-no questions; Algorithm design and analysis; Companies; Engines; Games; Mobile communication; Partitioning algorithms; Search engines; heuristic algorithm; interactive; user intention
【Paper Link】 【Pages】:529-538
【Authors】: Dezhong Yao ; Peilin Zhao ; Chen Yu ; Hai Jin ; Bin Li
【Abstract】: For many data mining and machine learning tasks, the quality of a similarity measure is the key for their performance. To automatically find a good similarity measure from datasets, metric learning and similarity learning are proposed and studied extensively. Metric learning will learn a Mahalanobis distance based on positive semi-definite (PSD) matrix, to measure the distances between objectives, while similarity learning aims to directly learn a similarity function without PSD constraint so that it is more attractive. Most of the existing similarity learning algorithms are online similarity learning method, since online learning is more scalable than offline learning. However, most existing online similarity learning algorithms learn a full matrix with d2 parameters, where d is the dimension of the instances. This is clearly inefficient for high dimensional tasks due to its high memory and computational complexity. To solve this issue, we introduce several Sparse Online Relative Similarity (SORS) learning algorithms, which learn a sparse model during the learning process, so that the memory and computational cost can be significantly reduced. We theoretically analyze the proposed algorithms, and evaluate them on some real-world high dimensional datasets. Encouraging empirical results demonstrate the advantages of our approach in terms of efficiency and efficacy.
【Keywords】: data mining; learning (artificial intelligence); SORS learning algorithms; data mining; learning process; machine learning tasks; sparse model; sparse online relative similarity learning; Algorithm design and analysis; Correlation; Data mining; Image retrieval; Machine learning algorithms; Measurement; Sparse matrices; Data stream mining; High-dimensional similarity learning; Sparse Online learning
【Paper Link】 【Pages】:539-548
【Authors】: Quanming Yao ; James T. Kwok ; Wenliang Zhong
【Abstract】: Low-rank modeling has a lot of important applications in machine learning, computer vision and social network analysis. While the matrix rank is often approximated by the convex nuclear norm, the use of nonconvex low-rank regularizers has demonstrated better recovery performance. However, the resultant optimization problem is much more challenging. A very recent state-of-the-art is based on the proximal gradient algorithm. However, it requires an expensive full SVD in each proximal step. In this paper, we show that for many commonly-used nonconvex low-rank regularizers, a cutoff can be derived to automatically threshold the singular values obtained from the proximal operator. This allows the use of power method to approximate the SVD efficiently. Besides, the proximal operator can be reduced to that of a much smaller matrix projected onto this leading subspace. Convergence, with a rate of O(1/T) where T is the number of iterations, can be guaranteed. Extensive experiments are performed on matrix completion and robust principal component analysis. The proposed method achieves significant speedup over the state-of-the-art. Moreover, the matrix solution obtained is more accurate and has a lower rank than that of the traditional nuclear norm regularizer.
【Keywords】: convergence; learning (artificial intelligence); mathematics computing; singular value decomposition; SVD; convergence; low-rank matrix learning; machine learning; matrix completion; nonconvex low-rank regularizers; proximal operator; robust principal component analysis; singular value decomposition; Approximation algorithms; Closed-form solutions; Matrix decomposition; Minimization; Optimization; Robustness; Sparse matrices; Low-rank matrix; Matrix completion; Nonconvex optimization; Proximal gradient; Robust PCA
【Paper Link】 【Pages】:549-558
【Authors】: Muhammad Yousefnezhad ; Daoqiang Zhang
【Abstract】: Clustering explores meaningful patterns in the non-labeled data sets. Cluster Ensemble Selection (CES) is a new approach, which can combine individual clustering results for increasing the performance of the final results. Although CES can achieve better final results in comparison with individual clustering algorithms and cluster ensemble methods, its performance can be dramatically affected by its consensus diversity metric and thresholding procedure. There are two problems in CES: 1) most of the diversity metrics is based on heuristic Shannon's entropy and 2) estimating threshold values are really hard in practice. The main goal of this paper is proposing a robust approach for solving the above mentioned problems. Accordingly, this paper develops a novel framework for clustering problems, which is called Weighted Spectral Cluster Ensemble (WSCE), by exploiting some concepts from community detection arena and graph based clustering. Under this framework, a new version of spectral clustering, which is called Two Kernels Spectral Clustering, is used for generating graphs based individual clustering results. Further, by using modularity, which is a famous metric in the community detection, on the transformed graph representation of individual clustering results, our approach provides an effective diversity estimation for individual clustering results. Moreover, this paper introduces a new approach for combining the evaluated individual clustering results without the procedure of thresholding. Experimental study on varied data sets demonstrates that the prosed approach achieves superior performance to state-of-the-art methods.
【Keywords】: graph theory; pattern clustering; CES; WSCE; community detection arena; data sets; diversity metrics; graph based clustering; graph based individual clustering; heuristic Shannon's entropy; nonlabeled data sets; two-kernels spectral clustering; weighted spectral cluster ensemble selection; Clustering algorithms; Correlation; Diversity reception; Entropy; Mathematical model; Measurement; Yttrium; cluster ensemble; normalized modularity; spectral clustering; weighted evidence accumulation clustering
【Paper Link】 【Pages】:559-568
【Authors】: Linyun Yu ; Peng Cui ; Fei Wang ; Chaoming Song ; Shiqiang Yang
【Abstract】: Cascades are ubiquitous in various network environments. How to predict these cascades is highly nontrivial in several vital applications, such as viral marketing, epidemic prevention and traffic management. Most previous works mainly focus on predicting the final cascade sizes. As cascades are typical dynamic processes, it is always interesting and important to predict the cascade size at any time, or predict the time when a cascade will reach a certain size (e.g. an threshold for outbreak). In this paper, we unify all these tasks into a fundamental problem: cascading process prediction. That is, given the early stage of a cascade, how to predict its cumulative cascade size of any later time? For such a challenging problem, how to understand the micro mechanism that drives and generates the macro phenomena (i.e. cascading process) is essential. Here we introduce behavioral dynamics as the micro mechanism to describe the dynamic process of a node's neighbors getting infected by a cascade after this node getting infected (i.e. one-hop subcascades). Through data-driven analysis, we find out the common principles and patterns lying in behavioral dynamics and propose a novel Networked Weibull Regression model for behavioral dynamics modeling. After that we propose a novel method for predicting cascading processes by effectively aggregating behavioral dynamics, and present a scalable solution to approximate the cascading process with a theoretical guarantee. We extensively evaluate the proposed method on a large scale social network dataset. The results demonstrate that the proposed method can significantly outperform other state-of-the-art baselines in multiple tasks including cascade size prediction, outbreak time prediction and cascading process prediction.
【Keywords】: behavioural sciences computing; marketing; social networking (online); behavioral dynamics modeling; cumulative cascade size; data-driven analysis; dynamic process; epidemic prevention; information cascading process prediction; micromechanism; network environments; networked Weibull regression model; outbreak time prediction; social network dataset; traffic management; viral marketing; Additives; Analytical models; Computational modeling; Computer science; Data models; Predictive models; Social network services; Dynamic Processes Prediction; Information Cascades; Social Network
【Paper Link】 【Pages】:569-578
【Authors】: Wenchao Yu ; Ariyam Das ; Justin Wood ; Wei Wang ; Carlo Zaniolo ; Ping Luo
【Abstract】: In a sponsored search market, the problem of measuring the intensity of competition among advertisers is increasingly gaining prominence today. Usually, search providers want to monitor the advertiser communities that share common bidding keywords, so that they can intervene when competition slackens. However, to the best of our knowledge, not much research has been conducted in identifying advertiser communities and understanding competition within these communities. In this paper we introduce a novel approach to detect competitive communities in a weighted bi-partite network formed by advertisers and their bidding keywords. The proposed approach is based on an advertiser vertex metric called intensity score, which takes the following two factors into consideration: the competitors that bid on the same keywords, and the advertisers' consumption proportion within the community. Evidence shows that when market competition rises, the revenue for a search provider also increases. Our community detection algorithm Max-Intensity is designed to detect communities which have the maximum intensity score. In this paper, we conduct experiments and validate the performance of Max-Intensity on sponsored search advertising data. Compared to baseline methods, the communities detected by our algorithm have low Herfindahl-Hirschman index (HHI) and comprehensive concentration index (CCI), which demonstrates that the communities given by Max-Intensity can capture the structure of the competitive communities.
【Keywords】: advertising data processing; network theory (graphs); search engines; CCI; HHI; Herfindahl-Hirschman index; Max-Intensity detection algorithm; advertiser vertex metric; bidding keywords; competitive advertiser community detection; comprehensive concentration index; intensity score; search providers; sponsored search market; weighted bi-partite network; Correlation; Data models; Detection algorithms; Indexes; Measurement; Monitoring; Search engines; Competition Coefficient; Competition Community Detection; Max-Intensity; Sponsored Search
【Paper Link】 【Pages】:579-588
【Authors】: Tao Zeng ; Shuiwang Ji
【Abstract】: Multi-instance learning studies problems in which labels are assigned to bags that contain multiple instances. In these settings, the relations between instances and labels are usually ambiguous. In contrast, multi-task learning focuses on the output space in which an input sample is associated with multiple labels. In real world, a sample may be associated with multiple labels that are derived from observing multiple aspects of the problem. Thus many real world applications are naturally formulated as multi-instance multi-task (MIMT) problems. A common approach to MIMT is to solve it task-by-task independently under the multi-instance learning framework. On the other hand, convolutional neural networks (CNN) have demonstrated promising performance in single-instance single-label image classification tasks. However, how CNN deals with multi-instance multi-label tasks still remains an open problem. This is mainly due to the complex multiple-to-multiple relations between the input and output space. In this work, we propose a deep leaning model, known as multi-instance multi-task convolutional neural networks (MIMT-CNN), where a number of images representing a multi-task problem is taken as the inputs. Then a shared sub-CNN is connected with each input image to form instance representations. Those sub-CNN outputs are subsequently aggregated as inputs to additional convolutional layers and full connection layers to produce the ultimate multi-label predictions. This CNN model, through transfer learning from other domains, enables transfer of prior knowledge at image level learned from large single-label single-task data sets. The bag level representations in this model are hierarchically abstracted by multiple layers from instance level representations. Experimental results on mouse brain gene expression pattern annotation data show that the proposed MIMT-CNN model achieves superior performance.
【Keywords】: image classification; image representation; learning (artificial intelligence); neural nets; MIMT-CNN problem; bag level representations; complex multiple-to-multiple relations; deep convolutional neural networks; deep leaning model; image representation; mouse brain gene expression pattern annotation data; multiinstance multilabel tasks; multiinstance multitask convolutional neural networks; multiinstance multitask learning; single-instance single-label image classification tasks; single-label single-task data sets; transfer learning; ultimate multilabel predictions; Biological system modeling; Brain models; Data models; Gene expression; Standards; Deep learning; bioinformatics; multi-instance learning; multi-task learning; transfer learning
【Paper Link】 【Pages】:589-598
【Authors】: Dell Zhang ; Jun Wang ; Xiaoxue Zhao ; Xiaoling Wang
【Abstract】: In multi-class text classification, the performance (effectiveness) of a classifier is usually measured by micro-averaged and macro-averaged F1 scores. However, the scores themselves do not tell us how reliable they are in terms of forecasting the classifier's future performance on unseen data. In this paper, we propose a novel approach to explicitly modelling the uncertainty of average F1 scores through Bayesian reasoning, and demonstrate that it can provide much more comprehensive performance comparison between text classifiers than the traditional frequentist null hypothesis significance testing (NHST).
【Keywords】: Bayes methods; inference mechanisms; pattern classification; text analysis; Bayesian hierarchical model; Bayesian reasoning; average F1 scores comparison; frequentist NHST; null hypothesis significance testing; text classifiers; Bayes methods; Computational modeling; Data models; Electronic mail; Estimation; Testing; Uncertainty; Bayesian inference; hypothesis testing; model comparison; performance evaluation; text classification
【Paper Link】 【Pages】:599-608
【Authors】: Jiawei Zhang ; Philip S. Yu
【Abstract】: Users nowadays are normally involved in multiple (usually more than two) online social networks simultaneously to enjoy more social network services. Some of the networks that users are involved in can share common structures either due to the analogous network construction purposes or because of the similar social network features. However, the social network datasets available in research are usually pre-anonymized and accounts of the shared users in different networks are mostly isolated without any known connections. In this paper, we want to identify such connections between the shared users' accounts in multiple social networks (i.e., the anchor links), which is formally defined as the M-NASA (Multiple Anonymized Social Networks Alignment) problem. M-NASA is very challenging to address due to (1) the lack of known anchor links to build models, (2) the studied networks are anonymized, where no users' personal profile or attribute information is available, and (3) the "transitivity law" and the "one-to-one property" based constraints on anchor links. To resolve these challenges, a novel two-phase network alignment framework UMA (Unsupervised Multi-network Alignment) is proposed in this paper. Extensive experiments conducted on multiple real-world partially aligned social networks demonstrate that UMA can perform very well in solving the M-NASA problem.
【Keywords】: graph theory; network theory (graphs); social networking (online); unsupervised learning; M-NASA; UMA; anchor link; multiple anonymized social networks alignment; one-to-one property; online social network; transitivity law; unsupervised multinetwork alignment; Conferences; Data mining; Indexes; Joining processes; Optimization; Social network services; Data Mining; Multiple Heterogeneous Social Networks; Partial Network Alignment
【Paper Link】 【Pages】:609-618
【Authors】: Li Zhang ; Keli Xiao ; Qi Liu ; Yefan Tao ; Yuefan Deng
【Abstract】: With the rapid growth of usage of social network, the patterns, the scales, and the rate of information exchange have brought profound impacts on research and practice in finance. One important topic is the stock market efficiency analysis. Traditional schemes in finance focus on identifying significant abnormal returns triggered by important events. However, those events are merely identified by regular financial announcements such as mergers, equity issuances, and financial reports. Related data-driven approaches mainly focus on developing trading strategies using social media data, while the results are usually lack of theoretical explanations. In this paper, we fill the gap between the usage of social media data and financial theories. We propose a Degree of Social Attention (DSA) framework for stock analysis based on influence propagation model. Specifically, we define the self-influence for users in a social network and the DSA for stocks. A recursive process is also designed for dynamic value updating. Furthermore, we provide two modified approaches to reduce the computational cost. Our testing results from the Chinese stock market suggest that the proposed framework effectively captures stock abnormal returns based on the related social media data, and DSA is verified to be a key factor to link social media activities to the stock market.
【Keywords】: social networking (online); stock markets; DSA; data-driven approach; degree of social attention; finance; financial theories; influence propagation; information exchange; social media data; social network; stock market efficiency analysis; Analytical models; Computational modeling; Finance; Integrated circuit modeling; Media; Social network services; Stock markets; Influence Propagation; Market Efficiency; Social Network; Stock Social Attention
【Paper Link】 【Pages】:619-628
【Authors】: Yao Zhang ; Abhijin Adiga ; Anil Vullikanti ; B. Aditya Prakash
【Abstract】: Given a network with groups, such as a contact-network grouped by ages, which are the best groups to immunize to control the epidemic? Equivalently, how to best choose communities in social networks like Facebook to stop rumors from spreading? Immunization is an important problem in multiple different domains like epidemiology, public health, cyber security and social media. Additionally, clearly immunization at group scale (like schools and communities) is more realistic due to constraints in implementations and compliance (e.g., it is hard to ensure specific individuals take the adequate vaccine). Hence efficient algorithms for such a "group-based" problem can help public-health experts take more practical decisions. However most prior work has looked into individual-scale immunization. In this paper, we study the problem of controlling propagation at group scale. We formulate novel so-called Group Immunization problems for multiple natural settings (for both threshold and cascade-based contagion models under both node-level and edge-level interventions) and develop multiple efficient algorithms, including provably approximate solutions. Finally, we show the effectiveness of our methods via extensive experiments on real and synthetic datasets.
【Keywords】: social networking (online); Facebook; contact network; controlling propagation; cyber security; epidemiology; group immunization problems; group scale; group-based problem; individual-scale immunization; public health experts; social media; social networks; synthetic datasets; Diffusion processes; Facebook; Immune system; Integrated circuit modeling; Public healthcare; Resource management; Vaccines
【Paper Link】 【Pages】:629-638
【Authors】: Yu Zhang
【Abstract】: In this paper, we develop parallel algorithms for a family of regularized multi-task methods which can model task relations under the regularization framework. Since those multi-task methods cannot be parallelized directly, we use the FISTA algorithm, which in each iteration constructs a surrogate function of the original problem by utilizing the Lipschitz structure of the objective function based on the solution in the last iteration, to solve it. Specifically, we investigate the dual form of the objective function in those methods by adopting the hinge, e-insensitive, and square losses to deal with multi-task classification and regression problems, and then utilize the Lipschitz structure to construct the surrogate function for the dual forms. The surrogate functions constructed in the FISTA algorithm are founded to be decomposable, leading to parallel designs for those multi-task methods. Experiments on several benchmark datasets show that the convergence of the proposed algorithms is as fast as that of SMO-style algorithms and the parallel design can speedup the computation.
【Keywords】: learning (artificial intelligence); parallel algorithms; FISTA algorithm; Lipschitz structure; SMO-style algorithms; iteration constructs; multitask classification; parallel algorithms; parallel multitask learning; regression problems; regularization framework; regularized multitask methods; surrogate functions; Algorithm design and analysis; Convergence; Covariance matrices; Fasteners; Kernel; Linear programming; Parallel algorithms; Multi-Task Learning; Parallel Algorithm
【Paper Link】 【Pages】:639-648
【Authors】: Liang Zhao ; Jiangzhuo Chen ; Feng Chen ; Wei Wang ; Chang-Tien Lu ; Naren Ramakrishnan
【Abstract】: Infectious disease epidemics such as influenza and Ebola pose a serious threat to global public health. It is crucial to characterize the disease and the evolution of the ongoing epidemic efficiently and accurately. Computational epidemiology can model the disease progress and underlying contact network, but suffers from the lack of real-time and fine-grained surveillance data. Social media, on the other hand, provides timely and detailed disease surveillance, but is insensible to the underlying contact network and disease model. This paper proposes a novel semi-supervised deep learning framework that integrates the strengths of computational epidemiology and social media mining techniques. Specifically, this framework learns the social media users' health states and intervention actions in real time, which are regularized by the underlying disease model and contact network. Conversely, the learned knowledge from social media can be fed into computational epidemic model to improve the efficiency and accuracy of disease diffusion modeling. We propose an online optimization algorithm to substantialize the above interactive learning process iteratively to achieve a consistent stage of the integration. The extensive experimental results demonstrated that our approach can effectively characterize the spatiotemporal disease diffusion, outperforming competing methods by a substantial margin on multiple metrics.
【Keywords】: data mining; diseases; learning (artificial intelligence); social networking (online); Ebola; SimNest; accuracy improvement; computational epidemiology; contact network; disease diffusion modeling; disease progress; efficiency improvement; global public health; infectious disease epidemics; influenza; interactive learning process; intervention actions; online optimization algorithm; online semisupervised-deep learning; social media mining techniques; social media nested epidemic simulation; social media user health states; spatiotemporal disease diffusion; Computational modeling; Data models; Diseases; Media; Sociology; Statistics; Surveillance; Twitter; deep learning; epidemic simulation
【Paper Link】 【Pages】:649-658
【Authors】: Peilin Zhao ; Furen Zhuang ; Min Wu ; Xiao-Li Li ; Steven C. H. Hoi
【Abstract】: Cost-Sensitive Online Classification is recently proposed to directly online optimize two well-known cost-sensitive measures: (i) maximization of weighted sum of sensitivity and specificity, and (ii) minimization of weighted misclassification cost. However, the previous existing learning algorithms only utilized the first order information of the data stream. This is insufficient, as recent studies have proved that incorporating second order information could yield significant improvements on the prediction model. Hence, we propose a novel cost-sensitive online classification algorithm with adaptive regularization. We theoretically analyzed the proposed algorithm and empirically validated its effectiveness with extensive experiments. We also demonstrate the application of the proposed technique for solving several online anomaly detection tasks, showing that the proposed technique could be an effective tool to tackle cost-sensitive online classification tasks in various application domains.
【Keywords】: gradient methods; learning (artificial intelligence); pattern classification; ARCSOGD; adaptively regularized cost-sensitive online gradient descent algorithm; cost-sensitive online classification algorithm; machine learning; online anomaly detection; sensitivity weighted sum maximization; specificity weighted sum maximization; weighted misclassification cost minimization; Adaptation models; Classification algorithms; Data mining; Machine learning algorithms; Prediction algorithms; Sensitivity; Training; Adaptive Regularization; Cost-Sensitive Classification; Online Learning
【Paper Link】 【Pages】:659-668
【Authors】: Rong Zhu ; Zhaonian Zou ; Jianzhong Li
【Abstract】: Uncertain graphs have been widely used to represent graph data with inherent uncertainty in structures. Reliability search is a fundamental problem in uncertain graph analytics. This paper studies a new problem, the top-k reliability search problem on uncertain graphs, that is, finding k vertices v with the highest reliabilities of connections from a source vertex s to v. Note that the existing algorithm for the threshold-based reliability search problem is inefficient for the top-k reliability search problem. We propose a new algorithm to efficiently solve the top-k reliability search problem. The algorithm adopts two important techniques, namely the BFS sharing technique and the offline sampling technique. The BFS sharing technique exploits overlaps among different sampled possible worlds of the input uncertain graph and performs a single BFS on all possible worlds simultaneously. The offline sampling technique samples possible worlds offline and stored them using a compact structure. The algorithm also takes advantages of bit vectors and bitwise operations to improve efficiency. Moreover, we generalize the top-k reliability search problem to the multi-source case and show that the multi-source case of the problem can be equivalently converted to the single-source case of the problem. Extensive experiments carried out on both real and synthetic datasets verify that the optimized algorithm outperforms the baselines by 1 - 2 orders of magnitude in execution time while achieving comparable accuracy. Meanwhile, the optimized algorithm exhibits linear scalability with respect to the size of the input uncertain graph.
【Keywords】: data analysis; data structures; graph theory; sampling methods; search problems; vectors; BFS sharing technique; bit vectors; bitwise operations; graph data representation; input uncertain graph; linear scalability; multisource case; offline sampling technique samples; real datasets; single-source case; source vertex; synthetic datasets; threshold-based reliability search problem; top-k reliability search problem; uncertain graph analytics; Approximation algorithms; Approximation methods; Monte Carlo methods; Proteins; Reliability; Search problems; Uncertainty
【Paper Link】 【Pages】:669-678
【Authors】: Yuan Zuo ; Junjie Wu ; Hui Zhang ; Deqing Wang ; Hao Lin ; Fei Wang ; Ke Xu
【Abstract】: Aspect-based opinion mining is to find elaborate opinions towards an underlying theme, perspective or viewpoint as to a subject such as a product or an event. Nowadays, with rapid growing of opinionated text on the Web, mining aspect-level opinions has become a promising means for online public opinion analysis. In particular, the booming of various types of online media provide diverse yet complementary information, bringing unprecedented opportunities for public opinion analysis across different populations. Along this line, in this paper, we propose CAMEL, a novel topic model for complementary aspect-based opinion mining across asymmetric collections. CAMEL gains complementarity by modeling both common and specific aspects across different collections, and keeping all the corresponding opinions for contrastive study. To further boost CAMEL, we propose AME, an automatic labeling scheme for maximum entropy model, to help discriminate aspect and opinion words without heavy human labeling. Extensive experiments on synthetic multicollection data sets demonstrate the superiority of CAMEL to baseline methods, in leveraging cross-collection complementarity to find higher-quality aspects and more coherent opinions as well as aspect-opinion relationships. This is particularly true when the collections get seriously imbalanced. Experimental results also show that the AME model indeed outperforms manual labeling in suggesting true opinion words. Finally, case study on two public events further demonstrates the practical value of CAMEL for real-world public opinion analysis.
【Keywords】: aspect-oriented programming; data mining; entropy; CAMEL; aspect-level opinion mining; aspect-opinion relationships; asymmetric collections; complementary aspect-based opinion mining; cross-collection complementarity; maximum entropy model; online public opinion analysis; synthetic multicollection data sets; topic model; Analytical models; Data mining; Entropy; Labeling; Manuals; Media; Switches; Aspect-based Opinion Mining; LDA Model; Maximum Entropy Model; Topic Detection and Tracking
【Paper Link】 【Pages】:679-684
【Authors】: Kais Allab ; Lazhar Labiod ; Mohamed Nadif
【Abstract】: Cluster analysis is often carried out in combination with dimension reduction. The Semi-Non-negative Matrix Factorization (Semi-NMF) that learns a low-dimensional representation of a data set lends itself to a clustering interpretation. In this work we propose a novel approach to finding an optimal subspace of multi-dimensional variables for identifying a partition of the set of objects. The use of a low-dimensional representation can be of help in providing simpler and more interpretable solutions. We show that by doing so, our model is able to learn low-dimensional representations that are better suited for clustering, outperforming not only Semi-NMF, but also other NMF variants.
【Keywords】: data reduction; data structures; learning (artificial intelligence); matrix decomposition; pattern clustering; principal component analysis; PCA; SemiNMF; cluster analysis; data representation learning; dimension reduction; principal component analysis; seminonnegative matrix factorization; Clustering algorithms; Laplace equations; Linear programming; Manifolds; Matrix decomposition; Optimization; Principal component analysis; Clustering; Dimension Reduction; Semi-NMF
【Paper Link】 【Pages】:685-690
【Authors】: Aris Anagnostopoulos ; Mara Sorella
【Abstract】: A fundamental open question that has been studied by sociologists since the 70s and recently started being addressed by the computer-science community is the understanding of the role that influence and selection play in shaping the evolution of socio-cultural systems. Quantifying these forces in real settings is still a big challenge, especially in the large-scale case in which the entire social network between the users may not be known, and only longitudinal data in terms of masses of cultural groups (e.g., political affiliation, product adoption, market share, cultural tastes) may be available. We propose an influence and selection model encompassing an explicit characterization of the feature space for the different cultural groups in the form of a natural equation-based macroscopic model, following the approach of Kempe et al. [EC 2013]. Our main goal is to estimate edge influence strengths and selection parameters from an observed time series. To do an experimental evaluation on real data, we perform learning on real datasets from Last. FM and Wikipedia.
【Keywords】: cultural aspects; feature selection; humanities; social sciences computing; time series; Last.FM; Wikipedia; computer-science community; cultural groups; edge influence strengths; feature selection model; feature space; longitudinal data; natural equation-based macroscopic model; social network; sociocultural systems; Computational modeling; Cultural differences; Data models; Mathematical model; Microscopy; Sociology; Statistics; cultural dynamics
【Paper Link】 【Pages】:691-696
【Authors】: Christos Anagnostopoulos ; Peter Triantafillou
【Abstract】: Distance-based nearest neighbours (dNN) queries and aggregations over their answer sets are important for exploratory data analytics. We focus on the Set Cardinality Prediction (SCP) problem for the answer set of dNN queries. We contribute a novel, query-driven perspective for this problem, whereby answers to previous dNN queries are used to learn the answers to incoming dNN queries. The proposed novel machine learning (ML) model learns the dynamically changing query patterns space and thus it can focus only on the portion of the data being queried. The model enjoys several comparative advantages in prediction error and space requirements. This is in addition to being applicable in environments with sensitive data and/or environments where data accesses are too costly to execute, where the data-centric state-of-the-art is inapplicable and/or too costly. A comprehensive performance evaluation of our model is conducted, evaluating its comparative advantages versus acclaimed methods (i.e., different self-tuning histograms, sampling, multidimensional histograms, and the power-method).
【Keywords】: data analysis; learning (artificial intelligence); query processing; ML model; SCP problem; dNN query; data access; distance-based nearest neighbour query; exploratory data analytics; learning set cardinality; machine learning model; query pattern space; set cardinality prediction problem; Adaptation models; Estimation; Histograms; Prototypes; Quantization (signal); Solid modeling; Yttrium; Query-driven set cardinality prediction; distance nearest neighbors analytics; hetero-associative competitive learning; local regression vector quantization
【Paper Link】 【Pages】:697-702
【Authors】: Forough Arabshahi ; Furong Huang ; Animashree Anandkumar ; Carter T. Butts ; Sean M. Fitzhugh
【Abstract】: Scalable probabilistic modeling and prediction in high dimensional multivariate time-series, such as dynamic social networks with co-evolving nodes and edges, is a challenging problem, particularly for systems with hidden sources of dependence and/or homogeneity. Here, we address this problem through the discovery of hierarchical latent groups. We introduce a family of Conditional Latent Tree Models (CLTM), in which tree-structured latent variables incorporate the unknown groups. The latent tree itself is conditioned on observed covariates such as seasonality, historical activity, and node attributes. We propose a statistically efficient framework for learning both the hierarchical tree structure and the parameters of the CLTM. We demonstrate competitive performance on two real world datasets, one from the students' attempts at answering questions in a psychology MOOC and the other from Twitter users participating in an emergency management discussion and interacting with one another. In addition, our modeling framework provides valuable and interpretable information about the hidden group structures and their effect on the evolution of the time series.
【Keywords】: computer aided instruction; educational courses; emergency management; probability; psychology; question answering (information retrieval); social networking (online); time series; tree data structures; CLTM parameters; Twitter users; conditional latent tree model; emergency management discussion; hidden group structures; hierarchical latent group discovery; hierarchical tree structure; high dimensional multivariate time-series; observed covariates; psychology MOOC; question answering; real world datasets; scalable probabilistic modeling; scalable probabilistic prediction; student attempts; tree-structured latent variables; Context modeling; Mathematical model; Maximum likelihood estimation; Predictive models; Random variables; Time series analysis; Twitter; Multivariate time series; conditional latent tree models; dynamic networks; hierarchical latent groups
【Paper Link】 【Pages】:703-708
【Authors】: Abhishek Bafna ; Jenna Wiens
【Abstract】: When the amount of training data is limited, the successful application of machine learning techniques typically hinges on the ability to identify useful features or abstractions. Expert knowledge often plays a crucial role in this feature engineering process. However, manual creation of such abstractions can be labor intensive and expensive. In this paper, we propose a feature learning framework that takes advantage of the vast amount of expert knowledge available in unstructured form on the Web. We explore the use of unsupervised learning techniques and non-Euclidean distance measures to automatically incorporate such expert knowledge when building feature representations. We demonstrate the utility of our proposed approach on the task of learning useful abstractions from a list of over two thousand patient medications. Applied to three clinically relevant patient risk stratification tasks, the classifiers built using the learned abstractions outperform several baselines including one based on a manually curated feature space.
【Keywords】: data mining; learning (artificial intelligence); abstraction identification; automated feature learning; feature representation; machine learning techniques; non-Euclidean distance measures; patient medication; unstructured data mining; unsupervised learning techniques; Buildings; Correlation; Data models; Hospitals; Kernel; Knowledge engineering; Taxonomy
【Paper Link】 【Pages】:709-714
【Authors】: Bokai Cao ; Xiangnan Kong ; Jingyuan Zhang ; Philip S. Yu ; Ann B. Ragin
【Abstract】: Mining discriminative subgraph patterns from graph data has attracted great interest in recent years. It has a wide variety of applications in disease diagnosis, neuroimaging, etc. Most research on subgraph mining focuses on the graph representation alone. However, in many real-world applications, the side information is available along with the graph data. For example, for neurological disorder identification, in addition to the brain networks derived from neuroimaging data, hundreds of clinical, immunologic, serologic and cognitive measures may also be documented for each subject. These measures compose multiple side views encoding a tremendous amount of supplemental information for diagnostic purposes, yet are often ignored. In this paper, we study the problem of discriminative subgraph selection using multiple side views and propose a novel solution to find an optimal set of subgraph features for graph classification by exploring a plurality of side views. We derive a feature evaluation criterion, named gSide, to estimate the usefulness of subgraph patterns based upon side views. Then we develop a branch-and-bound algorithm, called gMSV, to efficiently search for optimal subgraph features by integrating the subgraph mining process and the procedure of discriminative feature selection. Empirical studies on graph classification tasks for neurological disorders using brain networks demonstrate that subgraph patterns selected by the multi-side-view guided subgraph selection approach can effectively boost graph classification performances and are relevant to disease diagnosis.
【Keywords】: brain; data mining; diseases; graph theory; medical disorders; neurophysiology; patient diagnosis; pattern classification; brain network mining; branch-and-bound algorithm; clinical measures; cognitive measures; discriminative feature selection; discriminative subgraph pattern mining; discriminative subgraph selection; disease diagnosis; feature evaluation criterion; gMSV; gSide; graph classification; graph data; graph representation; immunologic measures; neuroimaging data; neurological disorder identification; neurological disorders; serologic measures; subgraph features; subgraph pattern usefulness estimation; supplemental information; Computer science; Data mining; Diffusion tensor imaging; Diseases; Kernel; Neuroimaging; Testing; brain network; graph mining; side information; subgraph pattern
【Paper Link】 【Pages】:715-720
【Authors】: Chen Chen ; Jingrui He ; Nadya Bliss ; Hanghang Tong
【Abstract】: Networks appear naturally in many high-impact real-world applications. In an increasingly connected and coupled world, the networks arising from many application domains are often collected from different channels, forming the so-called multi-layered networks, such as cyber-physical systems, organization-level collaboration platforms, critical infrastructure networks and many more. Compared with single-layered networks, multi-layered networks are more vulnerable as even a small disturbance on one supporting layer/network might cause a ripple effect to all the dependent layers, leading to a catastrophic/cascading failure of the entire system. The state-of-the-art has been largely focusing on modeling and manipulating the cascading effect of two-layered interdependent network systems for some specific type of network connectivity measure. This paper generalizes the challenge to multiple dimensions. First, we propose a new data model for multi-layered networks MULAN, which admits an arbitrary number of layers with a much more flexible dependency structure among different layers, beyond the current pair-wise dependency. Second, we unify a wide range of classic network connectivity measures SUBLINE. Third, we show that for any connectivity measure in the SUBLINE family, it enjoys the diminishing returns property which in turn lends itself to a family of provable near-optimal control algorithms with linear complexity. Finally, we conduct extensive empirical evaluations on real network data, to validate the effectiveness of the proposed algorithms.
【Keywords】: critical infrastructures; graph theory; network theory (graphs); optimal control; MULAN; SUBLINE family; cascading failure; catastrophic failure; critical infrastructure networks; cyber-physical systems; linear complexity; multilayered networks; near-optimal control algorithms; network connectivity measure; network connectivity measures; organization-level collaboration platforms; pair-wise dependency; single-layered networks; two-layered interdependent network systems; Atmospheric modeling; Collaboration; Current measurement; Data models; Optimal control; Physical layer; Silicon; connectivity control; multi-layered networks
【Paper Link】 【Pages】:721-726
【Authors】: Edward Choi ; Nan Du ; Robert Chen ; Le Song ; Jimeng Sun
【Abstract】: Modeling disease relationships and temporal progression are two key problems in health analytics, which have not been studied together due to data and technical challenges. Thanks to the increasing adoption of Electronic Health Records (EHR), rich patient information is being collected over time. Using EHR data as input, we propose a multivariate context-sensitive Hawkes process or cHawkes, which simultaneously infers the disease relationship network and models temporal progression of patients. Besides learning disease network and temporal progression model, cHawkes is able to predict when a specific patient might have other related diseases in future given the patient history, which in turn can have many potential applications in predictive health analytics, public health policy development and customized patient care. Extensive experiments on real EHR data demonstrate that cHawkes not only can uncover meaningful disease relations and model accurate temporal progression of patients, but also has significantly better predictive performance compared to several baseline models.
【Keywords】: diseases; electronic health records; health care; inference mechanisms; EHR data; cHawkes; customized patient care; disease network construction; disease network learning; disease relations; disease relationship network inference; disease relationships modeling; electronic health records; health analytics; multivariate context-sensitive Hawkes process; patient history; patient information; predictive health analytics; public health policy development; temporal patient progression; temporal progression modeling; Computational modeling; Context modeling; Data models; Diseases; Hidden Markov models; Hypertension; Predictive models; Disease Prediction; Disease Relation; EHR; Hawkes Process; Health Analytics; Point Process
【Paper Link】 【Pages】:727-732
【Authors】: Peter Christen ; Dinusha Vatsalan ; Qing Wang
【Abstract】: Entity resolution (ER) is the task of deciding which records in one or more databases refer to the same real-world entities. A crucial step in ER is the accurate classification of pairs of records into matches and non-matches. In most practical ER applications, obtaining training data %of high quality is costly and time consuming. Various techniques have been proposed for ER to interactively generate training data and learn an accurate classifier. We propose an approach for training data selection for ER that exploits the cluster structure of the weight vectors (similarities) calculated from compared record pairs. Our approach adaptively selects an optimal number of informative training examples for manual labeling based on a user defined sampling error margin, and recursively splits the set of weight vectors to find pure enough subsets for training. We consider two aspects of ER that are highly significant in practice: a limited budget for the number of manual labeling that can be done, and a noisy oracle where manual labels might be incorrect. Experiments on four real public data sets show that our approach can significantly reduce manual labeling efforts for training an ER classifier while achieving matching quality comparative to fully supervised classifiers.
【Keywords】: interactive systems; pattern classification; pattern matching; ER applications; ER classifier; adaptive training data selection; cluster structure; databases; entity resolution; fully supervised classifiers; interactive training data selection; manual labeling; manual labels; matching quality; noisy oracle; public data sets; user defined sampling error margin; weight vectors; Clustering algorithms; Erbium; Labeling; Manuals; Silicon; Training; Training data; Data matching; active learning; hierarchical clustering; interactive labeling; noisy oracle; record linkage
【Paper Link】 【Pages】:733-738
【Authors】: Rocco De Rosa ; Francesco Orabona ; Nicolò Cesa-Bianchi
【Abstract】: Stream mining poses unique challenges to machine learning: predictive models are required to be scalable, incrementally trainable, must remain bounded in size, and benon parametric in order to achieve high accuracy even in complex and dynamic environments. Moreover, the learning system must be parameterless - traditional tuning methods are problematic in streaming settings - and avoid requiring prior knowledge of the number of distinct class labels occurring in the stream. In this paper, we introduce a new algorithmic approach for nonparametric learning in data streams. Our approach addresses all above mentioned challenges by learning a model that covers the input space using simple local classifiers. The distribution of these classifiers dynamically adapts to the local (unknown) complexity of the classification problem, thus achieving a good balance between model complexity and predictive accuracy. By means of an extensive empirical evaluation against standard nonparametric baselines, we show state-of-the-art results in terms of accuracy versus model size. Our empirical analysis is complemented by a theoretical performance guarantee which does not rely on any stochastic assumption on the source generating the stream.
【Keywords】: data mining; learning (artificial intelligence); nonparametric statistics; pattern classification; data stream mining; input space; local classifiers; machine learning; model complexity; nonparametric classification problem; nonparametric learning; predictive accuracy; predictive models; stochastic assumption; Adaptation models; Algorithm design and analysis; Data mining; Learning systems; Measurement; Prediction algorithms; Predictive models; Constant Budget Model Size; Data Stream; High-Speed Data; Nonparametric Classification
【Paper Link】 【Pages】:739-744
【Authors】: Ye Ding ; Yanhua Li ; Ke Deng ; Haoyu Tan ; Mingxuan Yuan ; Lionel M. Ni
【Abstract】: The impact of inclement weather to urban traffic has been widely observed and studied for many years, with focus primarily on individual road segments by analyzing data from roadside deployed monitors. However, two fundamental questions are still open: (i) how to identify regional weather-traffic sensitivity index throughout a city, that indicates the degree to which the region traffic in a city is impacted by weather changes, (ii) among complex regional features, such as road structure and population density, how to dissect the most influential regional features that drive the urban region traffic to be more vulnerable to weather changes. Answering these questions is unprecedentedly important for urban planners to understand the functional characteristics of various urban regions throughout a city, and to improve traffic prediction and learn the key factors in urban planning. However, these two questions are nontrivial to answer, because urban traffic changes dynamically over time and is essentially affected by many other factors, which may dominate the overall impact. In this work, we make the first study on these questions, by developing a weather-traffic index (WTI) system. The system includes two main components: WTI establishment and key factor analysis. Using the proposed system, we conducted comprehensive empirical study in Shanghai, and the WTI extracted have been validated to be surprisingly consistent with real world observations. Further regional key factor analysis yields interesting results. For example, house age has significant impact on WTI, which sheds light on future urban planning and reconstruction.
【Keywords】: data analysis; road traffic; sensitivity analysis; town and country planning; traffic engineering computing; weather forecasting; Shanghai; WTI system; city region traffic; complex regional features; data analysis; population density; regional key factor analysis; regional weather-traffic sensitivity dissection; regional weather-traffic sensitivity index; road segments; road structure; traffic prediction; urban planners; urban planning; urban reconstruction; urban region traffic; Cities and towns; Correlation; Indexes; Meteorology; Public transportation; Roads; Sensitivity; Trajectories; city dynamics; urban computing; weather-traffic index
【Paper Link】 【Pages】:745-750
【Authors】: Elias Egho ; Dominique Gay ; Marc Boullé ; Nicolas Voisine ; Fabrice Clérot
【Abstract】: Sequential data is generated in many domains of science and technology. Although many studies have been carried out for sequence classification in the past decade, the problem is still a challenge, particularly for pattern-based methods. We identify two important issues related to pattern-based sequence classification which motivate the present work: the curse of parameter tuning and the instability of common interestingness measures. To alleviate these issues, we suggest a new approach and framework for mining sequential rule patterns for classification purpose. We introduce a space of rule pattern models and a prior distribution defined on this model space. From this model space, we define a Bayesian criterion for evaluating the interest of sequential patterns. We also develop a parameter-free algorithm to efficiently mine sequential patterns from the model space. Extensive experiments show that (i) the new criterion identifies interesting and robust patterns, (ii) the direct use of the mined rules as new features in a classification process demonstrates higher inductive performance than the state-of-the-art sequential pattern based classifiers.
【Keywords】: Bayes methods; data mining; pattern classification; statistical distributions; Bayesian criterion; a prior distribution; common interestingness measures instability; parameter tuning curse; parameter-free approach; pattern-based methods; pattern-based sequence classification; robust sequential classification rule mining; rule pattern models; sequential data; Bayes methods; Data mining; Data models; Feature extraction; Frequency measurement; Robustness; Tuning; Data Mining; Sequential Classification Rules
【Paper Link】 【Pages】:751-756
【Authors】: Tao-Yang Fu ; Zhen Lei ; Wang-Chien Lee
【Abstract】: There is a consensus that U. S. patent examiners, who are responsible for identifying prior art relevant to adjudication of patentability of patent applications, often lack the time, resources and/or experience necessary to conduct a dequateprior art search. This study aims to build an automatic and effective system of patent citation recommendation for patent examiners. In addition to focusing on content and bibliographic information, our proposed system considers another important piece of information that is known by patent examiners, namely, applicant citations. We integrate applicant citations and bibliographic information of patents into a heterogeneous citation bibliographic network. Based on this network, we explore metapaths based relationships between a query patent application and a candidate prior patent and classify them into two categories:(1) Bibliographic meta-paths, (2) Applicant Bibliographic metapaths. We propose a framework based on a two-phase ranking approach: the first phase involves selection of a candidate subset from the whole U. S. patent data, and the second phase uses supervised learning models to rank prior patents in the candidate subset. The results show that both bibliographic informationand applicant citation information are very useful for examiner citation recommendation, and that our approach significantly outperforms a search engine.
【Keywords】: citation analysis; learning (artificial intelligence); patents; recommender systems; US patent examiners; United States; applicant bibliographic metapath; applicant citations; bibliographic information; bibliographic meta-path; content information; patent applications; patent citation recommendation; search engine; ssupervised learning models; two-phase ranking approach; Art; Computer science; Data mining; Patents; Search engines; Search problems; Supervised learning; Patent mining; Recommendation
【Paper Link】 【Pages】:757-762
【Authors】: Ming Gao ; Ee-Peng Lim ; David Lo ; Feida Zhu ; Philips Kokoh Prasetyo ; Aoying Zhou
【Abstract】: The popularity of social media has led many users to create accounts with different online social networks. Identifying these multiple accounts belonging to same user is of critical importance to user profiling, community detection, user behavior understanding and product recommendation. Nevertheless, linking users across heterogeneous social networks is challenging due to large network sizes, heterogeneous user attributes and behaviors in different networks, and noises in user generated data. In this paper, we propose an unsupervised method, Collective Network Linkage (CNL), to link users across heterogeneous social networks. CNL incorporates heterogeneous attributes and social features unique to social network users, handles missing data, and performs in a collective manner. CNL is highly accurate and efficient even without training data. We evaluate CNL on linking users across different social networks. Our experiment results on a Twitter network and another Foursquare network demonstrate that CNL performs very well and its accuracy is superior than the supervised Mobius approach.
【Keywords】: social networking (online); CNL; Foursquare network; Twitter network; collective network linkage; community detection; heterogeneous social networks; heterogeneous user attributes; heterogeneous user behaviors; missing data handling; network sizes; online social networks; product recommendation; social features; social media; unsupervised method; user behavior understanding; user profiling; Couplings; Joining processes; Media; Numerical models; Prediction algorithms; Probability distribution; Social network services
【Paper Link】 【Pages】:763-768
【Authors】: Naoki Hamada ; Katsumi Homma ; Hiroyuki Higuchi ; Hideyuki Kikuchi
【Abstract】: The recent development of multi-agent simulations brings about a need for population synthesis. It is a task of reconstructing the entire population from a sampling survey of limited size (1% or so), supplying the initial conditions from which simulations begin. This paper presents a new kernel density estimator for this task. Our method is an analogue of the classical Breiman-Meisel-Purcell estimator, but employs novel techniques that harness the huge degree of freedom which is required to model high-dimensional nonlinearly correlated datasets: the crossover kernel, the k-nearest neighbor restriction of the kernel construction set and the bagging of kernels. The performance as a statistical estimator is examined through real and synthetic datasets. We provide an "optimization-free" parameter selection rule for our method, a theory of how our method works and a computational cost analysis. To demonstrate the usefulness as a population synthesizer, our method is applied to a household synthesis task for an urban micro-simulator.
【Keywords】: multi-agent systems; statistical analysis; town and country planning; Breiman-Meisel-Purcell estimator; computational cost analysis; k-nearest neighbor crossover kernel; k-nearest neighbor restriction; multiagent simulation; optimization-free parameter selection rule; population synthesis; statistical estimator; urban microsimulator; Bagging; Bandwidth; Computational modeling; Estimation; Kernel; Sociology; Statistics; crossover kernel; k-nearest neighbor; kernel density estimation; multi-agent simulation; population synthesis
【Paper Link】 【Pages】:769-774
【Authors】: Kun He ; Yiwei Sun ; David Bindel ; John E. Hopcroft ; Yixuan Li
【Abstract】: Based on the definition of local spectral subspace, we propose a novel approach called LOSP for local overlapping community detection. Using the power method for a few steps, LOSP finds an approximate invariant subspace, which depicts the embedding of the local neighborhood structure around the seeds of interest. LOSP then identifies the local community expanded from the given seeds by seeking a sparse indicator vector in the subspace where the seeds are in its support. We provide a systematic investigation on LOSP, and thoroughly evaluate it on large real world networks across multiple domains. With the prior information of very few seed members, LOSP can detect the remaining members of a target community with high accuracy. Experiments demonstrate that LOSP outperforms the Heat Kernel and PageRank diffusions. Using LOSP as a subroutine, we further address the problem of multiple membership identification, which aims to find all the communities a single vertex belongs to. High F1 scores are achieved in detecting multiple local communities with respect to arbitrary single seed for various large real world networks.
【Keywords】: Web sites; graph theory; network theory (graphs); F1 scores; LOSP approach; approximate invariant subspace; local neighborhood structure; local overlapping community detection; local spectral subspace; multiple membership identification; real world networks; sparse indicator vector; target community; Algorithms; Approximation methods; Electronic mail; Heating; Kernel; Systematics; Clustering; Community detection; Local spectral subspace; Seed set expansion
【Paper Link】 【Pages】:775-780
【Authors】: Jin Huang ; Rui Zhang ; Jeffrey Xu Yu
【Abstract】: A hypergraph allows a hyperedge to connect more than two vertices, using which to capture the high-order relationships, many hypergraph learning algorithms are shown highly effective in various applications. When learning large hypergraphs, converting them to graphs to employ the distributed graph frameworks is a common approach, yet it results in major efficiency drawbacks including an inflated problem size, the excessive replicas, and the unbalanced workloads. To avoid such drawbacks, we take a different approach and propose HyperX, which is a thin layer built upon Spark. To preserve the problem size, HyperX directly operates on a distributed hypergraph. To reduce the replicas, HyperX replicates the vertices but not the hyperedges. To balance the workloads, we investigate the hypergraph partitioning problem aiming at minimizing the space and the communication cost subject to two separate constraints on the hyperedge and the vertex workloads. With experiments on both real and synthetic datasets, we verify that HyperX significantly improves the efficiency of the learning algorithms when compared with the graph conversion approach.
【Keywords】: graph theory; learning (artificial intelligence); HyperX; Spark; distributed graph frameworks; graph conversion; hyperedge; scalable hypergraph learning; vertices; Algorithm design and analysis; Approximation algorithms; Machine learning algorithms; Optimization; Partitioning algorithms; Proteins; Sparks; Distributed Processing; Hypergraph Learning; Scalability
【Paper Link】 【Pages】:781-786
【Authors】: Meng Jiang ; Alex Beutel ; Peng Cui ; Bryan Hooi ; Shiqiang Yang ; Christos Faloutsos
【Abstract】: Which seems more suspicious: 5,000 tweets from 200 users on 5 IP addresses, or 10,000 tweets from 500 users on 500 IP addresses but all with the same trending topic and all in 10 minutes? The literature has many methods that try to find dense blocks in matrices, and, recently, tensors, but no method gives a principled way to score the suspiciouness of dense blocks with different numbers of modes and rank them to draw human attention accordingly. Dense blocks are worth inspecting, typically indicating fraud, emerging trends, or some other noteworthy deviation from the usual. Our main contribution is that we show how to unify these methods and how to give a principled answer to questions like the above. Specifically, (a) we give a list of axioms that any metric of suspicousness should satisfy, (b) we propose an intuitive, principled metric that satisfies the axioms, and is fast to compute, (c) we propose CROSSSPOT, an algorithm to spot dense regions, and sort them in importance ("suspiciousness") order. Finally, we apply CROSSSPOT to real data, where it improves the F1 score over previous techniques by 68% and finds retweet-boosting in a real social dataset spanning 0.3 billion posts.
【Keywords】: data handling; CROSSSPOT; dense blocks; general suspiciousness metric; human attention; multimodal data; retweet-boosting; social dataset; trending topic; Data mining; Facebook; IP networks; Inspection; Measurement; Tensile stress; Twitter; dense block; multimodal data; suspicious behavior
【Paper Link】 【Pages】:787-792
【Authors】: Anuj Karpatne ; Vipin Kumar
【Abstract】: We consider binary classification problems where each of the two classes shows a multi-modal distribution in the feature space, and the classification has to be performed over different test scenarios, where every test scenario only involves a subset of the positive and negative modes in the data. In such conditions, there may exist certain pairs of positive and negative modes, termed as pairs of confusing modes, which may not appear together in the same test scenario but can be highly overlapping in the feature space. Determining the class labels at such pairs of confusing modes is challenging as the labeling decisions depend not only on the feature values but also on the context of the test scenario. To overcome this challenge, we present the Adaptive Heterogeneous Ensemble Learning (AHEL) algorithm, which constructs an ensemble of classifiers in accordance with the multi-modality within the classes, and further assigns adaptive weights to classifiers based on their relevance in the context of a test scenario. We demonstrate the effectiveness of our approach in comparison with baseline approaches on a synthetic dataset and a real-world application involving global water monitoring.
【Keywords】: data analysis; learning (artificial intelligence); AHEL algorithm; adaptive heterogeneous ensemble learning; adaptive weights; baseline approaches; binary classification problems; confusing modes; feature space; global water monitoring; multimodal distribution; negative modes; positive modes; real-world application; synthetic dataset; test instances; Algorithm design and analysis; Context; Earth; Labeling; Monitoring; Training; Training data; adaptive learning; binary classification; data heterogeneity; multi-modality
【Paper Link】 【Pages】:793-798
【Authors】: Saurabh Kataria ; Arvind Agarwal
【Abstract】: In this paper we present a topic model based approach for classifying micro-blog posts into a given topics of interests. The short nature of micro-blog posts make them challenging for directly learning a classification model. To overcome this limitation, we use content of the links embedded in these posts to improve the topic learning. The hypothesis is that since the link content is far richer than the content of the post itself, using link content along with the content of the post will help learning. However, how this link content can be used to construct features for classification remains a challenging issue. Furthermore, in previous methods, user based information is utilized in an ad-hoc manner that only work for certain type of classification, such as characterizing content of microblogs. In this paper, we propose supervised topic model, User-Labeled-LDA and its nonparametric variant that can avoid the ad-hoc feature construction task and model the topics in a discriminative way. Our experiments on a Twitter dataset shows that modeling user interests and link information helps in learning quality topics for sparse tweets as well as helps significantly in classification task. Our experiments further show that modeling this information in a principled way through topic models helps more than simply adding this information through features.
【Keywords】: pattern classification; social networking (online); Twitter; latent Dirichlet allocation; microblog classification; supervised topic model; topic classification; tweets; user-labeled-LDA; Blogs; Data mining; Data models; Electronic publishing; Encyclopedias; Internet
【Paper Link】 【Pages】:799-804
【Authors】: Ankush Khandelwal ; Varun Mithal ; Vipin Kumar
【Abstract】: Classification of instances into different categories in various real world applications suffer from inaccuracies due to lack of representative training data, limitations of classification models, noise and outliers in the input data etc. In this paper we propose a new post classification label refinement method for the scenarios where data instances have an inherent ordering among them that can be leveraged to correct inconsistencies in class labels. We show that by using the ordering constraint, more robust algorithms can be developed than traditional methods. Moreover in most applications where this ordering among instances exists, it is not directly observed. The proposed approach simultaneously estimates the latent ordering among instances and corrects the class labels. We demonstrate the utility of the approach for the application of monitoring the dynamics of lakes and reservoirs. The proposed approach has been evaluated on synthetic datasets with different noise structures and noise levels.
【Keywords】: data handling; lakes; pattern classification; reservoirs; class labels; classification models; data instances; implicit ordering constraint; lakes; noise levels; noise structures; post classification label refinement; representative training data; reservoirs; robust algorithms; synthetic datasets; Colored noise; Earth; Image color analysis; Lakes; Monitoring; Noise measurement; Training data; post classification label refinement; preference based ordering; rank aggregation
【Paper Link】 【Pages】:805-810
【Authors】: Keigo Kimura ; Mineichi Kudo
【Abstract】: Nonnegative Tensor Factorization (NTF) has become a popular tool for extracting informative patterns from tensor data. However, NTF has high computational cost both in space and in time, mostly in iterative calculation of the gradient. In this paper, we consider variable selection to reduce the cost, assuming sparsity of the factor matrices. In fact, it is known that the factor matrices are often very sparse in many applications such as network analysis, text analysis and image analysis. We update only a small subset of important variables in each iterative step. We show the effectiveness of the algorithm analytically and experimentally in comparison with conventional NTF algorithms. The algorithm was five times faster than the naive algorithm in the best case and required one to five hundred times less memory while keeping the approximation accuracy as the same.
【Keywords】: matrix decomposition; tensors; NTF algorithm; factor matrices; iterative calculation; naive algorithm; nonnegative tensor factorization; variable selection algorithm; Algorithm design and analysis; Approximation algorithms; Indexes; Input variables; Sparse matrices; Tensile stress; Yttrium; CANDECOMP/PARAFAC; Nonnegative Tensor Factorization; Tensor Factorization
【Paper Link】 【Pages】:811-816
【Authors】: Raksha Kumaraswamy ; Phillip Odom ; Kristian Kersting ; David Leake ; Sriraam Natarajan
【Abstract】: Transfer learning is typically performed between problem instances within the same domain. We consider the problem of transferring across domains. To this effect, we adopt a probabilistic logic approach. First, our approach automatically identifies predicates in the target domain that are similar in their relational structure to predicates in the source domain. Second, it transfers the logic rules and learns the parameters of the transferred rules using target data. Finally, it refines the rules as necessary using theory refinement. Our experimental evidence supports that this transfer method finds models as good or better than those found with state-of-the-art methods, with and without transfer, and in a fraction of the time.
【Keywords】: data mining; learning (artificial intelligence); probabilistic logic; automatic predicate identification; logic rules; probabilistic logic approach; problem instances; relational structure; relational type matching; source domain; target domain; transfer learning; transferred rule parameter learning; Data mining; Learning systems; Markov processes; Probabilistic logic; Probability distribution; Proteins; Uncertainty; Relational Learning; Transfer Learning
【Paper Link】 【Pages】:817-822
【Authors】: Ludmila I. Kuncheva ; Mikel Galar
【Abstract】: We aim to dispel the blind faith in theoretical criteria for optimisation of the edited nearest neighbour classifier and its version called the Voronoi classifier. Three criteria from past and recent literature are considered: two bounds using Vapnik-Chervonenkis (VC) dimension and a probabilistic criterion derived by a Bayesian approach. We demonstrate the shortcomings of these criteria for selecting the best reference set, and summarise alternative empirical criteria found in the literature.
【Keywords】: Bayes methods; computational geometry; pattern classification; Bayesian approach; Vapnik-Chervonenkis dimension; Voronoi classifier; edited nearest neighbour classifier; empirical criteria; probabilistic criterion; theoretical criteria; Bayes methods; Data mining; Electronic mail; Probabilistic logic; Prototypes; Training; Upper bound
【Paper Link】 【Pages】:823-828
【Authors】: Guang-He Lee ; Shou-De Lin
【Abstract】: This paper emphasizes optimizing ranking measures in a recommendation problem. Since ranking measures are non-differentiable, previous works have been proposed to deal with this problem via approximations or lower/upper bounding of the loss. However, such mismatch between ranking measures and approximations/bounds can lead to non-optimal ranking results. To solve this problem, we propose to model the gradient of non-differentiable ranking measure based on the idea of virtual gradient, which is called lambda in learning to rank. In addition, noticing the difference between learning to rank and recommendation models, we prove that under certain circumstance the existence of popular items can lead to unlimited norm growing of the latent factors in a matrix factorization model. We further create a novel regularization term to remedy such concern. Finally, we demonstrate that our model, LambdaMF, outperforms several state-of-the-art methods. We further show in experiments that in all cases our model achieves global optimum of normalized discount cumulative gain during training. Detailed implementation and supplementary material can be found at (http://www.csie.ntu.edu.tw/~b00902055/).
【Keywords】: learning (artificial intelligence); matrix decomposition; pattern classification; recommender systems; LambdaMF; lambda matrix factorization; latent factors; nondifferentiable ranking measure gradient; nonsmooth ranking functions learning; normalized discount cumulative gain; recommendation problem; regularization term; virtual gradient; Approximation methods; Collaboration; Data models; Loss measurement; Matrix decomposition; Optimization; Predictive models
【Paper Link】 【Pages】:829-834
【Authors】: Jaekoo Lee ; Gunn Kim ; Sungroh Yoon
【Abstract】: By how much is a large-scale graph transformed over time or by a significant event?' or 'how structurally similar are two large-scale graphs?' are the two questions that this paper attempts to address. The proposed method efficiently calculates and accurately produces graph similarity. Our approach is based on the well-known random walk with restart (RWR) algorithm, which quantifies relevance between nodes to express the structural and connection characteristics of graphs. Intergraph compression, which is inspired by interframe compression, merges two input graphs and reorders their nodes contributing to improved process-data storage efficiency and processing convenience. This is a boon to the RWR algorithm for large-scale graphs. The representation of a graph transformed via intergraph compression can be used to accurately show similarity because sub-matrix blocks are reordered to concentrate nonzero elements. In performing the RWR algorithm, which quantifies inter-node relevance, transformed representation of graph with Intergraph compression is efficient in space requirement and produces results more quickly and accurately over conventional graph transformation schemes. We demonstrate the validity of our method through experiments and apply it to the usage data of public transportation SmartCard in a large metropolitan area to suggest usefulness of the proposed algorithm.
【Keywords】: data compression; graph theory; random processes; RICom; RWR algorithm; SmartCard; data processing improvement; graph connection characteristics; graph structural characteristics; input graph merging; interframe compression; intergraph compression; internode relevance; large metropolitan area; large-scale dynamic graph similarity measurement; node reordering; process-data storage efficiency improvement; public transportation; random walk-with-restart algorithm; space requirement; structurally-similar graphs; submatrix blocks; Algorithm design and analysis; Atmospheric measurements; Heuristic algorithms; Image coding; Particle measurements; Public transportation; Time measurement; dynamic graph; graph similarity; random walk with restart (RWR)
【Paper Link】 【Pages】:835-840
【Authors】: Mu-Chu Lee ; Wei-Lin Chiang ; Chih-Jen Lin
【Abstract】: Shared-memory systems such as regular desktops now possess enough memory to store large data. However, the training process for data classification can still be slow if we do not fully utilize the power of multi-core CPUs. Many existing works proposed parallel machine learning algorithms by modifying serial ones, but convergence analysis may be complicated. Instead, we do not modify machine learning algorithms, but consider those that can take the advantage of parallel matrix operations. We particularly investigate the use of parallel sparse matrix-vector multiplications in a Newton method for large scale logistic regression. Various implementations from easy to sophisticated ones are analyzed and compared. Results indicate that under suitable settings excellent speedup can be achieved.
【Keywords】: learning (artificial intelligence); pattern classification; shared memory systems; Newton method; convergence analysis; large data classification; large-scale logistic regression; multicore CPU; parallel machine learning algorithms; parallel matrix operations; parallel sparse matrix-vector multiplications; shared memory systems; Algorithm design and analysis; Instruction sets; Logistics; Machine learning algorithms; Newton method; Sparse matrices; Training; Newton method; classification; parallel matrix-vector multiplication; sparse matrix
【Paper Link】 【Pages】:841-846
【Authors】: Bo Li ; Yevgeniy Vorobeychik ; Muqun Li ; Bradley Malin
【Abstract】: Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aim to share such data while obscuring features that could disclose identities or other sensitive information. Much of the data now collected exhibits weak structure (e.g., natural language text) and machine learning approaches have been developed to identify and remove sensitive entities in such data. Learning-based approaches are never perfect and relying upon them to sanitize data can leak sensitive information as a consequence. However, a small amount of risk is permissible in practice, and, thus, our goal is to balance the value of data published and the risk of an adversary discovering leaked sensitive information. We model data sanitization as a game between 1) a publisher who chooses a set of classifiers to apply to data and publishes only instances predicted to be non-sensitive and 2) an attacker who combines machine learning and manual inspection to uncover leaked sensitive entities (e.g., personal names). We introduce an iterative greedy algorithm for the publisher that provably executes no more than a linear number of iterations, and ensures a low utility for a resource-limited adversary. Moreover, using several real world natural language corpora, we illustrate that our greedy algorithm leaves virtually no automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93% of the original data, and completes after at most 5 iterations.
【Keywords】: greedy algorithms; iterative methods; learning (artificial intelligence); natural language processing; pattern classification; ubiquitous computing; data sanitization; iterative classification; iterative greedy algorithm; large scale datasets; leaked sensitive entities; leaked sensitive information; learning algorithm; machine learning; natural language corpora; personal data; resource-limited adversary; ubiquitous computing; Data models; Inspection; Manuals; Natural languages; Predictive models; Publishing; Yttrium; Privacy preserving; game theory; weak structured data sanitization
【Paper Link】 【Pages】:847-852
【Authors】: Yuemeng Li ; Xintao Wu ; Aidong Lu
【Abstract】: The eigenspace of the adjacency matrix of a graph possesses important information about the network structure. However, analyzing the spectral space properties for directed graphs is challenging due to complex valued decompositions. In this paper, we explore the adjacency eigenspaces of directed graphs. With the aid of the graph perturbation theory, we emphasize on deriving rigorous mathematical results to explain several phenomena related to the eigenspace projection patterns that are unique for directed graphs. Furthermore, we relax the community structure assumption and generalize the theories to the perturbed Perron-Frobenius simple invariant subspace so that the theories can adapt to a much broader range of network structural types. We also develop a graph partitioning algorithm and conduct evaluations to demonstrate its potential.
【Keywords】: directed graphs; matrix algebra; network theory (graphs); perturbation theory; Perron-Frobenius subspace; adjacency matrix eigenspace; complex valued decomposition; directed graph; graph partitioning algorithm; matrix perturbation theory; network structure; spectral space property; Clustering algorithms; Eigenvalues and eigenfunctions; Linear programming; Mathematical model; Partitioning algorithms; Symmetric matrices; Yttrium; Asymmetric adjacency matrices; Directed graphs; Graph partition; Matrix perturbation; Spectral projection
【Paper Link】 【Pages】:853-858
【Authors】: Xiang Li ; Charles X. Ling ; Huaimin Wang
【Abstract】: Large and sparse datasets with a lot of missing values are common in the big data era. Naive Bayes is a good classification algorithm for such datasets, as its time and space complexity scales well with the size of non-missing values. However, several important questions about the behavior of naive Bayes are yet to be answered. For example, how different mechanisms of missing, data sparseness and the number of attributes systematically affect the learning curves and convergence? Recent work in classifying large and sparse real-world datasets still could not address these questions mainly because the data missing mechanisms of these datasets are not taken into account. In this paper, we propose two novel data missing and expansion mechanisms to answer these questions. We use the data missing mechanisms to generate large and sparse data with various properties, and study the entire learning curve and convergence behavior of naive Bayes. We made several observations, which are verified through detailed theoretical study. Our results are useful for learning large sparse data in practice.
【Keywords】: Bayes methods; Big Data; computational complexity; convergence; learning (artificial intelligence); pattern classification; convergence behavior; data expansion mechanisms; data missing mechanisms; data sparseness; large real-world datasets classification; learning curves; naive Bayes behavior; space complexity scales; sparse real-world datasets classification; time complexity scales; Big data; Complexity theory; Convergence; Motion pictures; Prototypes; Training; Upper bound
【Paper Link】 【Pages】:859-864
【Authors】: Xiaoyi Li ; Xiaowei Jia ; Hui Li ; Houping Xiao ; Jing Gao ; Aidong Zhang
【Abstract】: Sequential data modeling has received growing interests due to its impact on real world problems. Sequential data is ubiquitous - financial transactions, advertise conversions and disease evolution are examples of sequential data. A long-standing challenge in sequential data modeling is how to capture the strong hidden correlations among complex features in high volumes. The sparsity and skewness in the features extracted from sequential data also add to the complexity of the problem. In this paper, we address these challenges from both discriminative and generative perspectives, and propose novel stochastic learning algorithms to model nonlinear variances from static time frames and their transitions. The proposed model, Deep Recurrent Network (DRN), can be trained in an unsupervised fashion to capture transitions, or in a discriminative fashion to conduct sequential labeling. We analyze the conditional independence of each functional module and tackle the diminishing gradient problem by developing a two-pass training algorithm. Extensive experiments on both simulated and real-world dynamic networks show that the trained DRN outperforms all baselines in the sequential classification task and obtains excellent performance in the regression task.
【Keywords】: feature extraction; pattern classification; recurrent neural nets; regression analysis; unsupervised learning; advertise conversions; complex feature extraction; deep recurrent network; discriminative perspectives; disease evolution; financial transactions; generative perspectives; nonlinear variances; real-world dynamic networks; regression task; sequential classification task; sequential data modeling; sequential labeling; simulated dynamic networks; static time frames; stochastic learning algorithms; trained DRN; two-pass training algorithm; unsupervised training; Computational modeling; Data mining; Data models; Heuristic algorithms; Hidden Markov models; Mathematical model; Training
【Paper Link】 【Pages】:865-870
【Authors】: Xin Li ; Guandong Xu ; Enhong Chen ; Lin Li
【Abstract】: With the pervasive use of mobile devices, Location Based Social Networks(LBSNs) have emerged in past years. These LBSNs, allowing their users to share personal experiences and opinions on visited merchants, have very rich and useful information which enables a new breed of location-based services, namely, Merchant Recommendation. Existing techniques for merchant recommendation simply treat each merchant as an item and apply conventional recommendation algorithms, e.g., Collaborative Filtering, to recommend merchants to a target user. However, they do not differentiate the user's real preferences on various aspects, and thus can only achieve limited success. In this paper, we aim to address this problem by utilizing and analyzing user reviews to discover user preferences in different aspects. Following the intuition that a user rating represents a personalized rational choice, we propose a novel utility-based approach by combining collaborative and individual views to estimate user preference (i.e., rating). An optimization algorithm based on a Gaussian model is developed to train our merchant recommendation approach. Lastly we evaluate the proposed approach in terms of effectiveness, efficiency and cold-start using two real-world datasets. The experimental results show that our approach outperforms the state-of-the-art methods. Meanwhile, a real mobile application is implemented to demonstrate the practicability of our method.
【Keywords】: Gaussian processes; business data processing; learning (artificial intelligence); mobile computing; recommender systems; utility theory; Gaussian model; learning algorithm; location-based services; merchant recommendation; optimization algorithm; recommendation algorithms; user preference learning; user review analysis; utility-based approach; Analytical models; Collaboration; Context modeling; Economics; Mobile applications; Recommender systems; Social network services; Location based social network; recommender system
【Paper Link】 【Pages】:871-876
【Authors】: Qing Liao ; Naiyang Guan ; Qian Zhang
【Abstract】: Non-negative matrix factorization (NMF) decomposes any non-negative matrix into the product of two low dimensional non-negative matrices. Since NMF learns effective parts-based representation, it has been widely applied in computer vision and data mining. However, traditional NMF has the riskrisk learning rank-deficient basis learning rank-deficient basis on high-dimensional dataset with few examples especially when some examples are heavily corrupted by outliers. In this paper, we propose a Logdet divergence based sparse NMF method (LDS-NMF) to deal with the rank-deficiency problem. In particular, LDS-NMF reduces the risk of rank deficiency by minimizing the Logdet divergence between the product of basis matrix with its transpose and the identity matrix, meanwhile penalizing the density of the coefficients. Since the objective function of LDS-NMF is nonconvex, it is difficult to optimize. In this paper, we develop a multiplicative update rule to optimize LDS-NMF in the frame of block coordinate descent, and theoretically prove its convergence. Experimental results on popular datasets show that LDS-NMF can learn more stable representations than those learned by representative NMF methods.
【Keywords】: image representation; matrix decomposition; LDS-NMF method; Logdet divergence; NMF; block coordinate descent; computer vision; data mining; identity matrix; parts-based representation; risk learning rank-deficiency problem; sparse nonnegative matrix factorization; transpose matrix; Computer science; Data mining; Linear programming; Nuclear magnetic resonance; Principal component analysis; Sparse matrices; Training; Logdet divergence; Non-negative matrix factorization; robust matrix decomposition
【Paper Link】 【Pages】:877-882
【Authors】: Hongfu Liu ; Yun Fu
【Abstract】: Constrained clustering uses pre-given knowledge to improve the clustering performance. Among existing literature, researchers usually focus on Must-Link and Cannot-Link pairwise constraints. However, pairwise constraints not only disobey the way we make decisions, but also suffer from the vulnerability of noisy constraints and the order of constraints. In light of this, we use partition level side information instead of pairwise constraints to guide the process of clustering. Compared with pairwise constraints, partition level side information keeps the consistency within partial structure and avoids self-contradictory and the impact of constraints order. Generally speaking, only small part of the data instances are given labels by human workers, which are used to supervise the procedure of clustering. Inspired by the success of ensemble clustering, we aim to find a clustering solution which captures the intrinsic structure from the data itself, and agrees with the partition level side information as much as possible. Then we derive the objective function and equivalently transfer it into a K-mean-like optimization problem. Extensive experiments on several real-world datasets demonstrate the effectiveness and efficiency of our method compared to pairwise constrained clustering and consensus clustering, which verifies the superiority of partition level side information to pairwise constraints. Besides, our method has high robustness to noisy side information.
【Keywords】: optimisation; pattern clustering; cannot-link pairwise constraint; clustering performance; clustering solution; consensus clustering; data instance; ensemble clustering; human worker; k-mean-like optimization problem; must-link pairwise constraint; noisy constraint; noisy side information; objective function; pairwise constrained clustering; partition level side information; self-contradictory; Clustering algorithms; Labeling; Linear programming; Noise measurement; Optimization; Robustness; Symmetric matrices; Clustering; K-means; Partition level side information; Utility function
【Paper Link】 【Pages】:883-888
【Authors】: Junming Liu ; Qiao Li ; Meng Qu ; Weiwei Chen ; Jingyuan Yang ; Hui Xiong ; Hao Zhong ; Yanjie Fu
【Abstract】: Bike sharing systems, aiming at providing the missing links in the public transportation systems, are becoming popular in urban cities. In an ideal bike sharing network, the station locations are usually selected in a way that there are balanced pick-ups and drop-offs among stations. This can help avoid expensive re-balancing operations and maintain high user satisfaction. However, it is a challenging task to develop such an efficient bike sharing system with appropriate station locations. Indeed, the bike station demand is influenced by multiple factors of surrounding environment and complex public transportation networks. Limited efforts have been made to develop demand-and-balance prediction models for bike sharing systems by considering all these factors. To this end, in this paper, we propose a bike sharing network optimization approach by considering multiple influential factors. The goal is to enhance the quality and efficiency of the bike sharing service by selecting the right station locations. Along this line, we first extract fine-grained discriminative features from human mobility data, point of interests (POI), as well as station network structures. Then, prediction models based on Artificial Neural Networks (ANN) are developed for predicting station demand and balance. In addition, based on the learned patterns of station demand and balance, a genetic algorithm based optimization model is built to choose a set of stations from a large number of candidates in a way such that the station usage is maximized and the number of unbalanced stations is minimized. Finally, the extensive experimental results on the NYC CitiBike sharing system show the advantages of our approach for optimizing the station site allocation in terms of the bike usage as well as the required re-balancing efforts.
【Keywords】: bicycles; genetic algorithms; neural nets; transportation; ANN; NYC CitiBike sharing system; POI; artificial neural networks; balanced pick-ups; bike sharing network optimization model; bike sharing service; bike sharing systems; bike station demand; bike usage; complex public transportation networks; demand-and-balance prediction models; fine-grained discriminative features; genetic algorithm; human mobility data; point of interests; public transportation systems; re-balancing operations; station locations; station network structures; station site allocation; station site optimization; station usage; unbalanced stations; urban cities; user satisfaction; Bicycles; Feature extraction; Neural networks; Optimization; Predictive models; Public transportation; Bike Sharing System; Neural Network Prediction; Site location optimization
【Paper Link】 【Pages】:889-894
【Authors】: Yu Liu ; Martin Ester ; Bo Hu ; David W. Cheung
【Abstract】: Twitter, together with other online social networks, such as Facebook, and Gowalla have begun to collect hundreds of millions of check-ins. Check-in data captures the spatial and temporal information of user movements and interests. To model and analyze the spatio-temporal aspect of check-in data and discover temporal topics and regions, we propose two spatio-temporal topic models: Downstream Spatio-Temporal Topic Model (DSTTM) and Upstream Spatio-Temporal Topic Model (USTTM). Both models can discover temporal topics and regions. We use continuous time to model check-in data, rather than discretized time, avoiding the loss of information through discretization. In order to capture the property that user's interests and activity space will change over time, we propose the USTTM, where users have different region and topic distributions at different times. We conduct experiments on Twitter and Gowalla data sets. In our quantitative analysis, we evaluate the effectiveness of our models by the perplexity, the accuracy of POI recommendations, and user prediction, demonstrating that our models achieve better performance than the state-of-the-art models.
【Keywords】: data acquisition; social networking (online); spatiotemporal phenomena; DSTTM; Facebook; Gowalla data sets; POI recommendations; Twitter data sets; USTTM; check-in data model; downstream spatiotemporal topic model; online social networks; quantitative analysis; spatial information; spatiotemporal aspect analysis; spatiotemporal topic model; temporal information; temporal region discovery; temporal topic discovery; upstream spatiotemporal topic model; user activity space; user interests; user movements; user prediction; Cities and towns; Data models; Gaussian distribution; Predictive models; Probabilistic logic; Probability distribution; Twitter; Check-in; Spatio-temporal; Topic Model
【Paper Link】 【Pages】:895-900
【Authors】: Zitao Liu ; Yan Yan ; Jian Yang ; Milos Hauskrecht
【Abstract】: Hierarchical time series (HTS) is a special class of multivariate time series where many related time series are organized in a hierarchical tree structure and they are consistent across hierarchy levels. HTS modeling is crucial and serves as the basis for business planning and management in many areas such as manufacturing inventory, energy and traffic management. However, due to machine failures, network disturbances or human maloperation, HTS data suffer from missing values across different hierarchical levels. In this paper, we study the missing value estimation problem under hierarchical web traffic settings, where the user-visit traffic are organized in various hierarchical structures, such as geographical structure and website structure. We develop an efficient algorithm, HTSImpute, to accurately estimate the missing value in multivariate noisy web traffic time series with specific hierarchical consistency in HTS settings. Our HTSImpute is able to (1) utilize the temporal dependence information within each individual time series, (2) exploit the intra-relations between time series through hierarchy, (3) guarantee the satisfaction of hierarchical consistency constraints. Results on three synthetic HTS datasets and three real-world hierarchical web traffic datasets demonstrate that our approach is able to provide more accurate and hierarchically consistent estimations than other baselines.
【Keywords】: Internet; telecommunication traffic; time series; HTSImpute; hierarchical Web traffic settings; hierarchical consistency constraints; hierarchical time series; missing value estimation problem; multivariate noisy Web traffic time series; multivariate time series; user-visit traffic; Business; Estimation; Forecasting; High-temperature superconductors; Predictive models; Time series analysis; Yttrium; Forecasting; Hierarchical Time Series; Missing Value Estimation; Time Series Analysis
【Paper Link】 【Pages】:901-906
【Authors】: Charalampos Mavroforakis ; Michael Mathioudakis ; Aristides Gionis
【Abstract】: We study a new notion of graph centrality based on absorbing random walks. Given a graph G = (V, E) and a set of query nodes Q ⊆ V, we aim to identify the k most central nodes in G with respect to Q. Specifically, we consider central nodes to be absorbing for random walks that start at the query nodes Q. The goal is to find the set of k central nodes that minimizes the expected length of a random walk until absorption. The proposed measure, which we call k absorbing random-walk centrality, favors diverse sets, as it is beneficial to place the k absorbing nodes in different parts of the graph so as to “intercept” random walks that start from different query nodes. Although similar problem definitions have been considered in the literature, e.g., in information-retrieval settings where the goal is to diversify web-search results, in this paper we study the problem formally and prove some of its properties. We find that the problem is NP-hard, while the objective function is monotone and supermodular, implying that a greedy algorithm provides solutions with an approximation guarantee. On the other hand, the greedy algorithm involves expensive matrix operations that make it prohibitive to employ on large datasets. To confront this challenge, we explore the performance of efficient heuristics.
【Keywords】: approximation theory; computational complexity; graph theory; greedy algorithms; NP-hard problem; Web-search result; approximation guarantee; graph centrality; greedy algorithm; matrix operation; objective function; query node; random-walk centrality; Approximation algorithms; Computer science; Greedy algorithms; Length measurement; Linear programming; Probability distribution; Q measurement; graph mining; node centrality; random walks
【Paper Link】 【Pages】:907-912
【Authors】: Yannick Meier ; Jie Xu ; Onur Atan ; Mihaela van der Schaar
【Abstract】: To increase efficacy in traditional classroom courses as well as in Massive Open Online Courses (MOOCs), automated systems supporting the instructor are needed. One important problem is to automatically detect students that are going to do poorly in a course early enough to be able to take remedial actions. This paper proposes an algorithm that predicts the final grade of each student in a class. It issues a prediction for each student individually, when the expected accuracy of the prediction is sufficient. The algorithm learns online what is the optimal prediction and time to issue a prediction based on past history of students' performance in a course. We derive demonstrate the performance of our algorithm on a dataset obtained based on the performance of approximately 700 undergraduate students who have taken an introductory digital signal processing over the past 7 years. Using data obtained from a pilot course, our methodology suggests that it is effective to perform early in-class assessments such as quizzes, which result in timely performance prediction for each student, thereby enabling timely interventions by the instructor (at the student or class level) when necessary.
【Keywords】: computer aided instruction; data mining; educational courses; further education; MOOC; data mining approach; early in-class assessments; introductory digital signal processing; massive open online courses; personalized grade prediction; student final grade prediction; student performance; traditional classroom courses; undergraduate students; Algorithm design and analysis; Data mining; Education; Measurement; Prediction algorithms; Signal processing algorithms; Yttrium; Forecasting algorithms; data mining; digital signal processing education; grade prediction; online learning
【Paper Link】 【Pages】:913-918
【Authors】: Rui Meng ; Yongxin Tong ; Lei Chen ; Caleb Chen Cao
【Abstract】: Recently, taxonomy has attracted much attention. Both automatic construction solutions and human-based computation approaches have been proposed. The automatic methods suffer from the problem of either low precision or low recall and human computation, on the other hand, is not suitable for large scale tasks. Motivated by the shortcomings of both approaches, we present a hybrid framework, which combines the power of machine-based approaches and human computation (the crowd) to construct a more complete and accurate taxonomy. Specifically, our framework consists of two steps: we first construct a complete but noisy taxonomy automatically, then crowd is introduced to adjust the entity positions in the constructed taxonomy. However, the adjustment is challenging as the budget (money) for asking the crowd is often limited. In our work, we formulate the problem of finding the optimal adjustment as an entity selection optimization (ESO) problem, which is proved to be NP-hard. We then propose an exact algorithm and a more efficient approximation algorithm with an approximation ratio of 1/2(1-1/e). We conduct extensive experiments on real datasets, the results show that our hybrid approach largely improves the recall of the taxonomy with little impairment for precision.
【Keywords】: approximation theory; computational complexity; information management; optimisation; CrowdTC; ESO problem; NP-hard; approximation algorithm; automatic construction solutions; crowdsourced taxonomy construction; entity selection optimization; human computation; human-based computation approaches; hybrid framework; information management; machine-based approaches; Approximation algorithms; Approximation methods; Noise measurement; Optimization; Syntactics; Taxonomy; Uncertainty; Crowdsourcing; Taxonomy Construction
【Paper Link】 【Pages】:919-924
【Authors】: Robert Moskovitch ; Colin Walsh ; Fei Wang ; George Hripcsak ; Nicholas P. Tatonetti
【Abstract】: The increasing availability of multivariate temporal data in many domains, such as biomedical, security and more, provides exceptional opportunities for temporal knowledge discovery, classification and prediction, but also challenges. Temporal variables are often sparse and in many domains, such as in biomedical data, they have huge number of variables. In recent decades in the biomedical domain events, such as conditions, drugs and procedures, are stored as time intervals, which enables to discover Time Intervals Related Patterns (TIRPs) and use for classification or prediction. In this study we present a framework for outcome events prediction, called Maitreya, which includes an algorithm for TIRPs discovery called KarmaLegoD, designed to handle huge number of symbols. Three indexing strategies for pairs of symbolic time intervals are proposed and compared, showing that the use of FullyHashed indexing is only slightly slower but consumes minimal memory. We evaluated Maitreya on eight real datasets for the prediction of clinical procedures as outcome events. The use of TIRPs outperform the use of symbols, especially with horizontal support (number of instances) as TIRPs feature representation.
【Keywords】: bioinformatics; data mining; indexing; medical computing; pattern classification; KarmaLegoD; Maitreya; TIRP; biomedical data; biomedical domain events; clinical procedure prediction; drugs; multivariate temporal data availability; outcome events prediction; temporal knowledge classification; temporal knowledge discovery; temporal knowledge prediction; temporal variables; time intervals related patterns; Algorithm design and analysis; Data mining; Indexing; Prediction algorithms; Time series analysis; Prediction; Time Intervals Mining
【Paper Link】 【Pages】:925-930
【Authors】: Subhabrata Mukherjee ; Hemank Lamba ; Gerhard Weikum
【Abstract】: Current recommender systems exploit user and item similarities by collaborative filtering. Some advanced methods also consider the temporal evolution of item ratings as a global background process. However, all prior methods disregard the individual evolution of a user's experience level and how this is expressed in the user's writing in a review community. In this paper, we model the joint evolution of user experience, interest in specific item facets, writing style, and rating behavior. This way we can generate individual recommendations that take into account the user's maturity level (e.g., recommending art movies rather than blockbusters for a cinematography expert). As only item ratings and review texts are observables, we capture the user's experience and interests in a latent model learned from her reviews, vocabulary and writing style. We develop a generative HMM-LDA model to trace user evolution, where the Hidden Markov Model (HMM) traces her latent experience progressing over time -- with solely user reviews and ratings as observables over time. The facets of a user's interest are drawn from a Latent Dirichlet Allocation (LDA) model derived from her reviews, as a function of her (again latent) experience level. In experiments with four realworld datasets, we show that our model improves the rating prediction over state-of-the-art baselines, by a substantial margin. In addition, our model can also give some interpretations for the user experience level.
【Keywords】: collaborative filtering; hidden Markov models; recommender systems; user interfaces; collaborative filtering; evolving review community; experience-aware item recommendation; generative HMM-LDA model; global background process; hidden Markov model; item ratings; item similarities; latent dirichlet allocation; latent model; recommender systems; user evolution; user experience level; user maturity level; Hidden Markov models; Lenses; Motion pictures; Predictive models; Resource management; Vocabulary; Writing; Experience; Language Model; Recommender; User Model
【Paper Link】 【Pages】:931-936
【Authors】: Ahmed Najjar ; Christian Gagné ; Daniel Reinharz
【Abstract】: Dealing with real-life databases often implies handling sets of heterogeneous variables. We are proposing in this paper a methodology for exploring and analyzing such databases, with an application in the specific domain of healthcare data analytics. We are thus proposing a two-step heterogeneous finite mixture model, with a first step involving a joint mixture of Gaussian and multinomial distribution to handle numerical (i.e., real and integer numbers) and categorical variables (i.e., discrete values), and a second step featuring a mixture of hidden Markov models to handle sequences of categorical values (e.g., series of events). This approach is evaluated on a real-world application, the clustering of administrative healthcare databases from Québec, with results illustrating the good performances of the proposed method.
【Keywords】: Gaussian processes; data analysis; data mining; health care; hidden Markov models; medical administrative data processing; mixture models; pattern clustering; Gaussian; administrative healthcare database; healthcare data analytics; healthcare database mining; heterogeneous finite mixture model clustering; heterogeneous variables; hidden Markov models; multinomial distribution; Clustering algorithms; Databases; Hidden Markov models; Medical services; Mixture models; Numerical models; Partitioning algorithms; Administrative health care databases; Clustering; Finite mixture model; Mixed attributes; Multivalued categorical variables
【Paper Link】 【Pages】:937-942
【Authors】: Naoki Otani ; Yukino Baba ; Hisashi Kashima
【Abstract】: Repeated labeling is a widely adopted quality control method in crowdsourcing. This method is based on selecting one reliable label from multiple labels collected by workers because a single label from only one worker has a wide variance of accuracy. Hierarchical classification, where each class has a hierarchical relationship, is a typical task in crowdsourcing. However, direct applications of existing methods designed for multi-class classification have the disadvantage of discriminating among a large number of classes. In this paper, we propose a label aggregation method for hierarchical classification tasks. Our method takes the hierarchical structure into account to handle a large number of classes and estimate worker abilities more precisely. Our method is inspired by the steps model based on item response theory, which models responses of examinees to sequentially dependent questions. We considered hierarchical classification to be a question consisting of a sequence of subquestions and built a worker response model for hierarchical classification. We conducted experiments using real crowdsourced hierarchical classification tasks and demonstrated the benefit of incorporating a hierarchical structure to improve the label aggregation accuracy.
【Keywords】: pattern classification; quality control; crowdsourced hierarchical classification tasks; hierarchical relationship; hierarchical structure; item response theory; label aggregation method accuracy; multiclass classification; quality control method; worker abilities; worker response model; Crowdsourcing; Electronic mail; Labeling; Pharmaceuticals; Probabilistic logic; Quality control; Reliability; crowdsoucring; hierarchical classification; quality control
【Paper Link】 【Pages】:943-948
【Authors】: Ioakeim Perros ; Robert Chen ; Richard W. Vuduc ; Jimeng Sun
【Abstract】: We propose a new tensor factorization method, called the Sparse Hierarchical-Tucker (Sparse H-Tucker), for sparse and high-order data tensors. Sparse H-Tucker is inspired by its namesake, the classical Hierarchical Tucker method, which aims to compute a tree-structured factorization of an input data set that may be readily interpreted by a domain expert. However, Sparse H-Tucker uses a nested sampling technique to overcome a key scalability problem in Hierarchical Tucker, which is the creation of an unwieldy intermediate dense core tensor, the result of our approach is a faster, more space-efficient, and more accurate method. We test our method on a real healthcare dataset, which is collected from 30K patients and results in an 18th order sparse data tensor. Unlike competing methods, Sparse H-Tucker can analyze the full data set on a single multi-threaded machine. It can also do so more accurately and in less time than the state-of-the-art: on a 12th order subset of the input data, Sparse H-Tucker is 18x more accurate and 7.5x faster than a previously state-of-the-art method. Moreover, we observe that Sparse H-Tucker scales nearly linearly in the number of non-zero tensor elements. The resulting model also provides an interpretable disease hierarchy, which is confirmed by a clinical expert.
【Keywords】: diseases; health care; matrix decomposition; medical administrative data processing; multi-threading; tensors; 18th order sparse data tensor; clinical expert; domain expert; healthcare dataset; input data set; intermediate dense core tensors; interpretable disease hierarchy; key scalability problem; multithreaded machine; sampling technique; sparse H-Tucker; sparse hierarchical tucker factorization; sparse high-order data tensors; tensor factorization method; tree-structured factorization; Approximation algorithms; Approximation methods; Computational modeling; Diseases; Mathematical model; Sparse matrices; Tensile stress
【Paper Link】 【Pages】:949-954
【Authors】: Habibur Rahman ; Senjuti Basu Roy ; Saravanan Thirumuruganathan ; Sihem Amer-Yahia ; Gautam Das
【Abstract】: A number of emerging applications, such as, collaborative document editing, sentence translation, and citizen journalism require workers with complementary skills and expertise to form groups and collaborate on complex tasks. While existing research has investigated task assignment for knowledge intensive crowdsourcing, they often ignore the aspect of collaboration among workers, that is central to the success of such tasks. Research in behavioral psychology has indicated that large groups hinder successful collaboration. Taking that into consideration, our work is one of the first to investigate and formalize the notion of collaboration among workers and present theoretical analyses to understand the hardness of optimizing task assignment. We propose efficient approximation algorithms with provable theoretical guarantees and demonstrate the superiority of our algorithms through a comprehensive set of experiments using real-world and synthetic datasets. Finally, we conduct a real world collaborative sentence translation application using Amazon Mechanical Turk that we hope provides a template for evaluating collaborative crowdsourcing tasks in micro-task based crowdsourcing platforms.
【Keywords】: approximation theory; language translation; psychology; Amazon Mechanical Turk; approximation algorithms; behavioral psychology; citizen journalism; collaborative crowdsourcing tasks; collaborative document editing; collaborative sentence translation application; complementary skills; knowledge intensive crowdsourcing; microtask based crowdsourcing platforms; real-world datasets; sentence translation; synthetic datasets; task assignment; Algorithm design and analysis; Approximation algorithms; Approximation methods; Collaboration; Crowdsourcing; Human factors; Optimization; algorithms; collaborative crowdsourcing; crowdsourcing; optimization
【Paper Link】 【Pages】:955-960
【Authors】: Santu Rana ; Sunil Kumar Gupta ; Svetha Venkatesh
【Abstract】: Privacy-preserving data mining has become an active focus of the research community in the domains where data are sensitive and personal in nature. For example, highly sensitive digital repositories of medical or financial records offer enormous values for risk prediction and decision making. However, prediction models derived from such repositories should maintain strict privacy of individuals. We propose a novel random forest algorithm under the framework of differential privacy. Unlike previous works that strictly follow differential privacy and keep the complete data distribution approximately invariant to change in one data instance, we only keep the necessary statistics (e.g. variance of the estimate) invariant. This relaxation results in significantly higher utility. To realize our approach, we propose a novel differentially private decision tree induction algorithm and use them to create an ensemble of decision trees. We also propose feasible adversary models to infer about the attribute and class label of unknown data in presence of the knowledge of all other data. Under these adversary models, we derive bounds on the maximum number of trees that are allowed in the ensemble while maintaining privacy. We focus on binary classification problem and demonstrate our approach on four real-world datasets. Compared to the existing privacy preserving approaches we achieve significantly higher utility.
【Keywords】: data mining; data privacy; learning (artificial intelligence); pattern classification; adversary models; attribute inference; binary classification problem; complete data distribution; data instance; decision tree ensemble; differential privacy framework; differentially private random forest; differentially-private decision tree induction algorithm; privacy-preserving data mining; real-world datasets; unknown data class label; Data privacy; Decision trees; Privacy; Sensitivity; Standards; Vegetation; decision trees; differential privacy; privacy preserving data mining; random forest
【Paper Link】 【Pages】:961-966
【Authors】: Suhas Ranganath ; Suhang Wang ; Xia Hu ; Jiliang Tang ; Huan Liu
【Abstract】: Social media is being increasingly used to request information and help in situations like natural disasters, where time is a critical commodity. However, generic social media platforms are not explicitly designed for timely information seeking, making it difficult for users to obtain prompt responses. Algorithms to ensure prompt responders for questions in social media have to understand the factors affecting their response time. In this paper, we draw from sociological studies on information seeking and organizational behavior to model the future availability and past response behavior of the candidate responders. We integrate these criteria with their interests to identify users who can provide timely and relevant responses to questions posted in social media. We propose a learning algorithm to derive optimal rankings of responders for a given question. We present questions posted on Twitter as a form of information seeking activity in social media. Our experiments demonstrate that the proposed framework is useful in identifying timely and relevant responders for questions in social media.
【Keywords】: Internet; learning (artificial intelligence); social networking (online); Twitter; critical commodity; generic social media platforms; information seeking activity; learning algorithm; natural disasters; optimal rankings; organizational behavior; prompt responders; time-critical responses; Algorithm design and analysis; Hurricanes; Media; Real-time systems; Time factors; Training; Twitter; Q&A; Situational Awareness; Timely Information
【Paper Link】 【Pages】:967-972
【Authors】: Avijit Saha ; Ayan Acharya ; Balaraman Ravindran ; Joydeep Ghosh
【Abstract】: Factorization Machine (FM) provides a generic framework that combines the prediction quality of factorization models with the flexibility of feature engineering that discriminative models like SVM offer. The Bayesian Factorization Machine [11], with its impressive predictive performance and the convenience of automatic tuning of parameters, has been one of the most successful and efficient approaches within this framework. However, this model has two major drawbacks. Firstly, it assumes that the data is generated from Gaussian distributions that may not be the best assumption for count data such as integer-valued ratings. Secondly, to get the best performance, one needs to cross-validate over the number of latent factors used for modeling the pairwise interaction in FM, a process that is computationally intensive. This paper introduces the Nonparametric Poisson Factorization Machine (NPFM), which models count data using the Poisson distribution, which provides both modeling and computational advantages for sparse data. The ideal number of latent factors is estimated from the data itself, thereby addressing a key limitation of existing approaches to FM. Additionally, NPFM has linear time complexity with respect to the number of non-zero observations.
【Keywords】: Poisson distribution; learning (artificial intelligence); matrix decomposition; Bayesian factorization machine; FM; NPFM; Poisson distribution; factorization prediction quality; feature engineering; linear time complexity; nonparametric Poisson factorization machine; parameter tuning; Bayes methods; Data models; Frequency modulation; Predictive models; Recommender systems; Sparse matrices; Standards; factorization machine; gamma process; matrix factorization; recommender systems; tensor factorization
【Paper Link】 【Pages】:973-978
【Authors】: Jianqiang Shen ; Burkay Orten ; Sahin Cem Geyik ; Daniel Liu ; Shahriar Shariat ; Fang Bian ; Ali Dasdan
【Abstract】: Real-Time Bidding allows an advertiser to purchase media inventory through an auction system that unfolds in the order of milliseconds. Media providers are increasingly being integrated into such programmatic buying platforms. It is typical for a contemporary Real-Time Bidding system to receive millions of bid requests per second at peak time, and have a large portion of these to be irrelevant to any advertiser. Meanwhile, given a valuable bid request, tens of thousands of advertisements might be qualified for scoring. We present our efforts in building selection models for both bid requests and advertisements to handle this scalability challenge. Our bid request model treats the system load as a hierarchical resource allocation problem and directs traffic based on the estimated quality of bid requests. Next, our exploration/exploitation advertisement model selects a limited number of qualified advertisements for thorough scoring based on the expected value of a bid request to the advertiser given its features. Our combined bid request and advertisement model is able to win more auctions and bring more value to clients by stabilizing the bidding pipeline. We empirically show that our deployed system is capable of handling 5x more bid requests.
【Keywords】: advertising data processing; electronic commerce; resource allocation; tendering; advertisements; advertiser; auction system; bid request model; contemporary real-time bidding system; hierarchical resource allocation problem; media inventory; programmatic buying platforms; Data mining; Digital signal processing; Load modeling; Media; Real-time systems; Resource management; Servers; Online advertising; modeling; real-time bidding; scalability
【Paper Link】 【Pages】:979-984
【Authors】: Olivia Simpson ; C. Seshadhri ; Andrew McGregor
【Abstract】: The degree distribution is one of the most fundamental graph properties of interest for real-world graphs. It has been widely observed in numerous domains that graphs typically have a tailed or scale-free degree distribution. While the average degree is usually quite small, the variance is quite high and there are vertices with degrees at all scales. We focus on the problem of approximating the degree distribution of a large streaming graph, with small storage. We design an algorithm headtail, whose main novelty is a new estimator of infrequent degrees using truncated geometric random variables. We give a mathematical analysis of headtail and show that it has excellent behavior in practice. We can process streams will millions of edges with storage less than 1% and get extremely accurate approximations for all scales in the degree distribution. We also introduce a new notion of Relative Hausdorff distance between tailed histograms. Existing notions of distances between distributions are not suitable, since they ignore infrequent degrees in the tail. The Relative Hausdorff distance measures deviations at all scales, and is a more suitable distance for comparing degree distributions. By tracking this new measure, we are able to give strong empirical evidence of the convergence of headtail.
【Keywords】: algorithm theory; graph theory; mathematical analysis; degree distribution; graph properties; headtail; large streaming graph; mathematical analysis; relative Hausdorff distance measures deviations; streaming algorithm; truncated geometric random variables; Algorithm design and analysis; Approximation algorithms; Approximation methods; Frequency estimation; Histograms; Mathematical analysis; Standards; Degree distribution; Streaming algorithms; graph sampling; small-space algorithms
【Paper Link】 【Pages】:985-990
【Authors】: Shivam Srivastava ; Shiladitya Pande ; Sayan Ranu
【Abstract】: In this paper, we develop an algorithm to cluster places not only based on their locations but also their semantics. Specifically, two places are considered similar if they are spatially close and visited by people of similar communities. With the explosion in the availability of location-tracking technologies, it has become easy to track locations and movements of users through user "check-ins". These check-ins provide insights into the community structure of people visiting a place, which is leveraged and integrated into the proposed geo-social clustering framework called GeoScop. While community detection is typically done on social networks, in our problem, we lack any network data. Rather, two people belong to the same community if they visit similar geo-social clusters. We tackle this chicken-and-egg problem through an iterative procedure of expectation maximization and DBSCAN. Extensive experiments on real check-in data demonstrate that GeoScop mines semantically meaningful clusters that cannot be found by using any of the existing clustering techniques. Furthermore, GeoScop is up to 6 times more pure in social quality than the state-of-the-art technique. The executables for the tool are available at http://www.cse.iitm.ac.in/ ~simsayan/software.html.
【Keywords】: data mining; expectation-maximisation algorithm; iterative methods; pattern clustering; social networking (online); DBSCAN; GeoScop framework; check-in data; chicken-and-egg problem; clustering techniques; community detection; expectation maximization; iterative procedure; location-tracking technologies; places geo-social clustering; social clustering framework; social networks; Business; Clustering algorithms; Libraries; Pipelines; Semantics; Social network services; Standards; check-ins; dbscan; expectation maximization; social network; spatio-temporal
【Paper Link】 【Pages】:991-996
【Authors】: Nikolaj Tatti
【Abstract】: Interactions in many real-world phenomena can be explained by a stronghierarchical structure. Typically, this structure or ranking is not known, instead we only have observed outcomes of the interactions, and the goal is toinfer the hierarchy from these observations. Discovering a hierarchy in the context of directed networks can be formulated asfollows: given a graph, partition vertices into levels such that, ideally, there are only edges from upper levels to lower levels. The ideal case can onlyhappen if the graph is acyclic. Consequently, in practice we have to introducea penalty function that penalizes edges violating the hierarchy. A practicalvariant for such penalty is agony, where each violating edge is penalized basedon the severity of the violation. Hierarchy minimizing agony can be discoveredin O(m^2) time, and much faster in practice. In this paper we introduce severalextensions to agony. We extend the definition for weighted graphs and allow acardinality constraint that limits the number of levels. While, these areconceptually trivial extensions, current algorithms cannot handle them, northey can be easily extended. We provide an exact algorithm of O(m^2 log n) time by showing the connection of agony to the capacitated circulation problem. Wealso show that this bound is in fact pessimistic and we can compute agony forlarge datasets. In addition, we show that we can compute agony in polynomialtime for any convex penalty, and, to complete the picture, we show that minimizinghierarchy with any concave penalty is an NP-hard problem.
【Keywords】: graph theory; hierarchical systems; optimisation; NP-hard problem; acyclic graph; directed networks; hierarchical structure; partition vertices; Conferences; Context; Data mining; NP-hard problem; Optimization; Partitioning algorithms; Transforms; Hierarchy discovery; agony; capacitated circulation
【Paper Link】 【Pages】:997-1002
【Authors】: Leonardo Vilela Teixeira ; Renato Martins Assunção ; Rosangela Helena Loschi
【Abstract】: When performing analysis of spatial data, there is often the need to aggregate geographical areas into larger regions, a process called regionalization or spatially constrained clustering. These algorithms assume that the items to be clustered are non-stochastic, an assumption not held in many applications. In this work, we present a new probabilistic regionalization algorithm that allows spatially varying random variables as features. Hence, an area highly different from its neighbors can still be considered a member of their cluster if it has a large variance. Our proposal is based on a Bayesian generative spatial product partition model. We build an effective Markov Chain Monte Carlo algorithm to carry out a random walk on the space of all trees and their induced spatial partitions by edges' deletion. We evaluate our algorithm using synthetic data and with one problem of municipalities regionalization based on cancer incidence rates. We are able to better accommodate the natural variation of the data and to diminish the effect of outliers, producing better results than state-of-art approaches.
【Keywords】: Bayes methods; Markov processes; Monte Carlo methods; cancer; data analysis; geography; pattern clustering; random processes; trees (mathematics); Bayesian generative spatial product partition model; Markov chain Monte Carlo algorithm; cancer incidence rates; edge deletion; generative spatial clustering model; geographical areas; municipalities regionalization; probabilistic regionalization algorithm; random data; random walk; spanning trees; spatial data analysis; spatially constrained clustering; synthetic data; Clustering algorithms; Data models; Partitioning algorithms; Probability distribution; Space exploration; Stochastic processes; Yttrium; Bayesian spatial model; spatial graphical model; spatially constrained clustering
【Paper Link】 【Pages】:1003-1008
【Authors】: Jan N. van Rijn ; Geoffrey Holmes ; Bernhard Pfahringer ; Joaquin Vanschoren
【Abstract】: Ensembles of classifiers are among the best performing classifiers available in many data mining applications. However, most ensembles developed specifically for the dynamic data stream setting rely on only one type of base-level classifier, most often Hoeffding Trees. In this paper, we study the use of heterogeneous ensembles, comprised of fundamentally different model types. Heterogeneous ensembles have proven successful in the classical batch data setting, however they do not easily transfer to the data stream setting. We therefore introduce the Online Performance Estimation framework, which can be used in data stream ensembles to weight the votes of (heterogeneous) ensemble members differently across the stream. Experiments over a wide range of data streams show performance that is competitive with state of the art ensemble techniques, including Online Bagging and Leveraging Bagging. All experimental results from this work are easily reproducible and publicly available on OpenML for further analysis.
【Keywords】: data mining; learning (artificial intelligence); pattern classification; trees (mathematics); Hoeffding trees; OpenML; base-level classifiers; classical batch data setting; data mining applications; data stream ensembles; dynamic data stream setting; ensemble members; heterogeneous ensembles; leveraging bagging; meta-learning; online bagging; online performance estimation framework; Bagging; Data mining; Data models; Estimation; Predictive models; Stacking; Training; Data Streams; Ensembles; Meta-Learning
【Paper Link】 【Pages】:1009-1014
【Authors】: Hemanth Venkateswara ; Prasanth Lade ; Binbin Lin ; Jieping Ye ; Sethuraman Panchanathan
【Abstract】: Mutual Information (MI) is often used for feature selection when developing classifier models. Estimating the MI for a subset of features is often intractable. We demonstrate, that under the assumptions of conditional independence, MI between a subset of features can be expressed as the Conditional Mutual Information (CMI) between pairs of features. But selecting features with the highest CMI turns out to be a hard combinatorial problem. In this work, we have applied two unique global methods, Truncated Power Method (TPower) and Low Rank Bilinear Approximation (LowRank), to solve the feature selection problem. These algorithms provide very good approximations to the NP-hard CMI based feature selection problem. We experimentally demonstrate the effectiveness of these procedures across multiple datasets and compare them with existing MI based global and iterative feature selection procedures.
【Keywords】: approximation theory; combinatorial mathematics; computational complexity; feature selection; LowRank; NP-hard CMI based feature selection problem; TPower; classifier models; conditional mutual information based global feature selection; hard combinatorial problem; low rank bilinear approximation; truncated power method; Approximation methods; Computational modeling; Entropy; Estimation; Mutual information; Random variables; Yttrium
【Paper Link】 【Pages】:1015-1020
【Authors】: Chenguang Wang ; Yangqiu Song ; Haoran Li ; Ming Zhang ; Jiawei Han
【Abstract】: As a fundamental task, document similarity measure has broad impact to document-based classification, clustering and ranking. Traditional approaches represent documents as bag-of-words and compute document similarities using measures like cosine, Jaccard, and dice. However, entity phrases rather than single words in documents can be critical for evaluating document relatedness. Moreover, types of entities and links between entities/words are also informative. We propose a method to represent a document as a typed heterogeneous information network (HIN), where the entities and relations are annotated with types. Multiple documents can be linked by the words and entities in the HIN. Consequently, we convert the document similarity problem to a graph distance problem. Intuitively, there could be multiple paths between a pair of documents. We propose to use the meta-path defined in HIN to compute distance between documents. Instead of burdening user to define meaningful meta paths, an automatic method is proposed to rank the meta-paths. Given the meta-paths associated with ranking scores, an HIN-based similarity measure, KnowSim, is proposed to compute document similarities. Using Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents, our experiments on 20Newsgroups and RCV1 datasets show that KnowSim generates impressive high-quality document clustering.
【Keywords】: deductive databases; document handling; pattern classification; pattern clustering; 20Newsgroups dataset; Freebase; HIN-based similarity measure; KnowSim; RCV1 dataset; document relatedness evaluation; document represent; document similarity measure; document-based classification; document-based clustering; document-based ranking; graph distance problem; heterogeneous information network; meta-paths; ranking scores; semantic parsing; structured heterogeneous information networks; world knowledge base; Approximation algorithms; Approximation methods; Feature extraction; Knowledge based systems; Laplace equations; Organizations; Semantics; Document similarity; heterogeneous information network; knowledge base; knowledge graph; structured text similarity
【Paper Link】 【Pages】:1021-1026
【Authors】: Dawei Wang ; Wei Ding
【Abstract】: Extreme weather events, like extreme rainfalls, are severe weather hazards and also the triggers for other natural disasters like floods and tornadoes. Accurate forecasting of such events relies on the understanding of the spatiotemporal evolution processes in climate system. Learning from climate science data has been a challenging task, because the variations among spatial, temporal and multivariate spaces have created a huge amount of features and complex regularities within the data. In this study we developed a framework for learning patterns from the spatiotemporal system and forecasting extreme weather events. In this framework, we learned patterns in a hierarchical manner: in each level, new features were learned from data and used as the input for the next level. Firstly, we summarized the temporal evolution process of individual variables by learning the location-based patterns. Secondly, we developed an optimization algorithm for summarizing the spatial regularities, SCOT, by growing spatial clusters from the location-based patterns. Finally, we developed an instance-based algorithm, SPC, to forecast the extreme events through classification. We applied this framework to forecasting extreme rainfall events in the eastern Central Andes area. Our experiments show that this method was able to find climatic process patterns similar to those found in domain studies, and our forecasting results outperformed the state-of-art model.
【Keywords】: atmospheric techniques; climatology; data analysis; feature extraction; geophysics computing; learning (artificial intelligence); pattern classification; pattern clustering; rain; weather forecasting; SCOT; SPC; classification; climate science data learning; climate system; climatic process pattern; data feature learning; eastern Central Andes area; extreme rainfall event forecasting; extreme weather event forecasting; floods; hierarchical pattern learning framework; instance-based algorithm; location-based pattern learning; natural disaster; optimization algorithm; severe weather hazard; spatial clusters; spatial regularity summarization; spatiotemporal evolution process; spatiotemporal system; tornadoes; Clustering algorithms; Data mining; Forecasting; Meteorology; Optimization; Predictive models; Spatiotemporal phenomena; Forecasting; Hierarchical Learning; Spatiotemporal
【Paper Link】 【Pages】:1027-1032
【Authors】: Deqing Wang ; Hui Zhang ; Rui Liu
【Abstract】: Feature selection plays an important role in machinelearning applications. Especially for text data, the highdimensionaland sparse characteristics will affect the performanceof feature selction. In this paper, an unsupervised feature selection algorithm through Random Projection and Gram-Schmidt Orthogonalization (RP-GSO) from the word co-occurrence matrix is proposed. The RP-GSO has three advantages: (1) it takes as input dense word co-occurrence matrix, avoiding the sparseness of original document-term matrix, (2) it selects "basis features" by Gram-Schmidt process, guaranteeing the orthogonalization of feature space, and (3) it adopts random projection to speed upGS process. We did extensive experiments on two real-world textcorpora, and observed that RP-GSO achieves better performancecomparing against supervised and unsupervised methods in textclassification and clustering tasks.
【Keywords】: feature selection; matrix algebra; pattern classification; pattern clustering; random processes; text analysis; unsupervised learning; GS-orthogonalization; Gram-Schmidt orthogonalization; RP-GSO; basis feature selection; feature space orthogonalization; machine learning applications; random projection; text classification; text clustering; text corpora; unsupervised feature selection algorithm; word cooccurrence matrix; Clustering algorithms; Computer science; Feature extraction; MATLAB; Matrix decomposition; Sparse matrices; Training; GS-Orthogonalization; basis feature; random projection
【Paper Link】 【Pages】:1033-1038
【Authors】: Martin Wistuba ; Nicolas Schilling ; Lars Schmidt-Thieme
【Abstract】: Hyperparameter tuning is often done manually but current research has proven that automatic tuning yields effective hyperparameter configurations even faster and does not require any expertise. To further improve the search, recent publications propose transferring knowledge from previous experiments to new experiments. We adapt the sequential model-based optimization by replacing its surrogate model and acquisition function with one policy that is optimized for the task of hyperparameter tuning. This policy generalizes over previous experiments but neither uses a model nor uses meta-features, nevertheless, outperforms the state of the art. We show that a static ranking of hyperparameter combinations yields competitive results and substantially outperforms a random hyperparameter search. Thus, it is a fast and easy alternative to complex hyperparameter tuning strategies and allows practitioners to tune their hyperparameters by simply using a look-up table. We made look-up tables for two classifiers publicly available: SVM and AdaBoost. Furthermore, we propose a similarity measure for data sets that yields more comprehensible results than those using meta-features. We show how this similarity measure can be applied to surrogate models in the SMBO framework and empirically show that this change leads to better hyperparameter configurations in less trials.
【Keywords】: pattern classification; support vector machines; AdaBoost; SMBO framework; SVM; acquisition function; automatic tuning; classifiers; complex hyperparameter tuning strategies; data sets; hyperparameter combinations; hyperparameter configurations; look-up tables; meta-features; random hyperparameter search; sequential model-based optimization; sequential model-free hyperparameter tuning; static ranking; surrogate models; Adaptation models; Data models; Loss measurement; Machine learning algorithms; Optimization; Tuning; hyperparameter optimization; meta-learning; transfer learning
【Paper Link】 【Pages】:1039-1044
【Authors】: Zhiang Wu ; Youquan Wang ; Yaqiong Wang ; Junjie Wu ; Jie Cao ; Lu Zhang
【Abstract】: Driven by profits, spam reviews for product promotion or suppression become increasingly rampant in online shopping platforms. This paper focuses on detecting hidden spam users based on product reviews. In the literature, there have been tremendous studies suggesting diversified methods for spammer detection, but whether these methods can be combined effectively for higher performance remains unclear. Along this line, a hybrid PU-learning-based Spammer Detection (hPSD) model is proposed in this paper. On one hand, hPSD can detect multi-type spammers by injecting or recognizing only a small portion of positive samples, which meets particularly real-world application scenarios. More importantly, hPSD can leverage both user features and user relations to build a spammer classifier via a semi-supervised hybrid learning framework. Experimental results on movie data sets with shilling injection show that hPSD outperforms several state-of-the-art baseline methods. In particular, hPSD shows great potential in detecting hidden spammers as well as their underlying employers from a real-life Amazon data set. These demonstrate the effectiveness and practical value of hPSD for real-life applications.
【Keywords】: Internet; retail data processing; security of data; unsolicited e-mail; hPSD; hidden spammer detection; hybrid PU-learning-based spammer detection model; hybrid model; movie data sets; multitype spammers; online shopping platforms; product promotion; product reviews; product suppression; profits; real-life Amazon data set; semi-supervised hybrid learning framework; shilling injection; spam reviews; spammers detection; Couplings; Data models; Economics; Feature extraction; Labeling; Motion pictures; Reliability
【Paper Link】 【Pages】:1045-1050
【Authors】: Hong Xie ; John C. S. Lui
【Abstract】: Online reputation systems serve as core building blocks in various Internet services such as E-commerce (e.g. eBay) and crowdsourcing (e.g., oDesk). The flaws of real-world online reputation systems were reported extensively. Users who are frustrated about the system will eventually abandon such service. However, no formal studies have explored such flaws. This paper presents the first attempt, which develops a novel data analytical framework to uncover online reputation system deficiencies from data. We develop a novel measure to quantify the efficiency of online reputation systems, i.e., ramp up time of a new service provider. We first show that inherent preferences or personal biases in assigning feedbacks (or ratings) cause the computational infeasibility in evaluating online reputation systems from data. We develop a computationally efficient randomized algorithm with theoretical performance guarantees to address this computational challenge. We apply our methodology to real-life datasets (from eBay and Google Helpouts), we discover that the ramp up time in eBay and Google Helpouts are around 791 and 1,327 days respectively. Around 78.7% sellers have ramped up in eBay and only 1.5% workers have ramped up in Google Helpouts. This small fraction and the long ramp up time (1,327 days) explain why Google Helpouts was eventually shut down in April 2015.
【Keywords】: Internet; data analysis; Google Helpouts; Internet services; data analytical framework; data driven approach; eBay; online reputation systems; service provider; Companies; Crowdsourcing; Delays; Google; Web and internet services; Algorithms; Deficiencies; Online reputation
【Paper Link】 【Pages】:1051-1056
【Authors】: Chang Xu ; Jie Zhang
【Abstract】: Online review fraud has evolved in sophistication by launching intelligent campaigns where a group of coordinated participants work together to deliver deceptive reviews for the designated targets. Such collusive fraud is considered much harder to defend against as these campaign participants are capable of evading detection by shaping their behaviors collectively so as not to appear suspicious. The present work complements existing studies by exploring more subtle behavioral trails connected with collusive review fraud. A novel statistical model is proposed to further characterize, recognize, and forecast collusive fraud in online reviews. The proposed model is completely unsupervised, which bypasses the difficulty of manual annotation required for supervised modeling. It is also highly flexible to incorporate collusion characteristics available for better modeling and prediction. Experiments on two real-world datasets demonstrate the effectiveness of the proposed method and the improvements in learning and predictive abilities.
【Keywords】: fraud; reviews; security of data; statistical analysis; collusive fraud detection; collusive review fraud; online reviews; statistical model; Business; Computational modeling; Computers; Data models; Predictive models; Probabilistic logic; Synchronization; Collusive Review Fraud; Opinion Spam
【Paper Link】 【Pages】:1057-1062
【Authors】: Huang Xu ; Zhiwen Yu ; Hui Xiong ; Bin Guo ; Hengshu Zhu
【Abstract】: Discovering the determinants of job change and predicting the individual job change occasion are essential approaches for understanding the professional careers of human. However, with the evolution of labor division and globalization, modern careers become more self-directed and dynamic, which makes job change occasion difficult to predict. Fortunately, the emerging online professional networks and location-based social networks provide a large amount of work experience and daily activity records of individuals around the world, which open a venue for the accurate job change analysis. Discovering the determinants of job change and predicting the individual job change occasion are essential approaches for understanding the professional careers of human. However, with the evolution of labor division and globalization, modern careers become more self-directed and dynamic, which makes job change occasion difficult to predict. Fortunately, the emerging online professional networks and location-based social networks provide a large amount of work experience and daily activity records of individuals around the world, which open a venue for the accurate job change analysis. In this paper, we aggregate the work experiences and check-in records of individuals to model the job change motivations and correlations between professional and daily life. Specifically, we attempt to reveal to what extent the job change occasion can be predicted based on the career mobility and daily activity patterns at the individual level. Following the classical theory of job mobility determinants, we extract and quantify the environmental conditions and personal preference of careers from the perspective of industrial/regional constraints and personal interests/demands. Besides, we investigate the factors of activity patterns which may be correlated with job change as cause and effect results. First, we quantify the consumption diversity, sentiment fluctuation and geographic movement from the c- eck-in records as indicators. Then, we leverage the center-bias level assignment and multi-point snapshot mechanism to capture historical and parallel migration. Finally, experimental results based on a large real-world dataset show that the job change occasions can be accurately predicted with the aggregated factors.
【Keywords】: employment; globalisation; professional aspects; sentiment analysis; social networking (online); career mobility learning; center-bias level assignment; check-in records; consumption diversity; dynamic careers; environmental conditions; geographic movement; globalization; human activity pattern learning; human professional careers; job change analysis; job change determinant discovery; job change motivations; job change occasion prediction; job mobility determinants; labor division; location-based social networks; multipoint snapshot mechanism; online professional networks; personal interests; regional constraints; self-directed careers; sentiment fluctuation; Correlation; Economics; Engineering profession; Industries; LinkedIn; Organizations
【Paper Link】 【Pages】:1063-1068
【Authors】: Linli Xu ; Qi Zhou ; Aiqing Huang ; Wenjun Ouyang ; Enhong Chen
【Abstract】: The task of feature selection is to select a subset of the original features according to certain predefined criterion with the goal to remove irrelevant and redundant features, improve the prediction performance and reduce the computational costs of data mining algorithms. In this paper, we integrate feature relevance and redundancy explicitly in the feature selection criterion. Spectral feature analysis is applied here which can fit into both supervised and unsupervised learning problems. Specifically, we formulate the problem into a combinatorial problem to maximize the relevance and minimize the redundancy of the selected subset of features at the same time. The problem can be relaxed and solved with an efficient extended power method with global convergence guaranteed. Extensive experiments demonstrate the advantages of the proposed technique in terms of improving the prediction performance and reducing redundancy in data.
【Keywords】: convergence; data mining; feature selection; minimisation; redundancy; set theory; unsupervised learning; combinatorial problem; data mining algorithms; data redundancy reduction; extended power method; feature redundancy; feature relevance; feature selection criterion; feature selection subset; feature selection task; global convergence; predefined criterion; prediction performance; spectral feature analysis; supervised learning problems; unsupervised learning problems; Convergence; Correlation; Data mining; Laplace equations; Optimization; Prediction algorithms; Redundancy; eigen-optimization; redundancy; relevance; spectral feature selection; supervised/unsupervised learning
【Paper Link】 【Pages】:1069-1074
【Authors】: Pranjul Yadav ; Michael Steinbach ; Lisiane Pruinelli ; Bonnie L. Westra ; Connie Delaney ; Vipin Kumar ; György J. Simon
【Abstract】: Electronic Health Records (EHRs) consists of patient information such as demographics, medications, laboratory test results, diagnosis codes and procedures. Mining EHRs could lead to improvement in patient healthcare management as EHRs contain detailed information related to disease prognosis for large patient populations. We hypothesize that a patient's condition does not deteriorate at random, the trajectories, sequences in which diseases appear in a patient, are determined by a finite number of underlying disease mechanisms. In this work, we exploit this idea by predicting a patient's risk of mortality in the context of the metabolic syndrome by assessing which of many available trajectories a patient is following and progression along this trajectory. Implementing this idea required innovative enhancements both for the study design and also for the fitting algorithm. We propose a forensic-style study design, which aligns patients on last follow-up and measures time backwards. We modify the time-dependent covariate Cox proportional hazards model to better capture coefficients of covariate that follow a particular temporal sequence, such as trajectories. Knowledge extracted from such analysis can lead to personalized treatments, thereby forming the basis for future trajectory-centered guidelines.
【Keywords】: data mining; digital forensics; electronic health records; EHR mining; covariate coefficients; disease mechanisms; disease prognosis; electronic health records; fitting algorithm; forensic-style analysis; knowledge extraction; metabolic syndrome; particular temporal sequence; patient condition; patient healthcare management improvement; patient information; patient mortality risk prediction; survival trajectories; time-dependent covariate Cox proportional hazard model; Diabetes; Diseases; Hazards; Medical diagnostic imaging; Time measurement; Trajectory
【Paper Link】 【Pages】:1075-1080
【Authors】: Deqing Yang ; Yanghua Xiao ; Yangqiu Song ; Wei Wang
【Abstract】: Cross-domain recommendation has attracted wide research interest which generally aims at improving the recommendation performance by alleviating the cold start problem in collaborative filtering based recommendation or generating a more comprehensive user profiles from multiple domains. In most previous cross-domain recommendation settings, explicit or implicit relationships can be easily established across different domains. However, many real applications belong to a more challenging setting: recommendation across heterogeneous domains without explicit relationships, where neither explicit user-item relations nor overlapping features exist between different domains. In this new setting, we need to (1) enrich the sparse data to characterize users or items and (2) bridge the gap caused by the heterogenous features in different domains. To overcome the first challenge, we proposed an optimized local tag propagation algorithm to generate descriptive tags for user profiling. For the second challenge, we proposed a semantic relatedness metric by mapping the heterogenous features onto their concept space derived from online encyclopedias. We conducted extensive experiments on two real datasets to justify the effectiveness of our solution.
【Keywords】: collaborative filtering; recommender systems; cold start problem; collaborative filtering based recommendation; concept space; cross-domain recommendation; descriptive tag generation; heterogeneous feature mapping; item characterization; online encyclopedias; optimized local tag propagation algorithm; semantic relatedness metric; semantic-based recommendation; sparse data; user characterization; user profiles; user profiling; Bridges; Computer science; Electronic mail; Media; Motion pictures; Semantics; Twitter; cross-domain recommendation; heterogenous domains; semantic matching; user profiling
【Paper Link】 【Pages】:1081-1086
【Authors】: Pei Yang ; Jingrui He
【Abstract】: Data heterogeneity is an intrinsic property of many high impact applications, such as insider threat detection, traffic prediction, brain image analysis, quality control in manufacturing processes, etc. Furthermore, multiple types of heterogeneity (e.g., task/view/instance heterogeneity) often co-exist in these applications, thus pose new challenges to existing techniques, most of which are tailored for a single or dual types of heterogeneity. To address this problem, in this paper, we propose a novel graph-based hybrid approach to simultaneously model multiple types of heterogeneity in a principled framework. The objective is to maximize the smoothness consistency of the neighboring nodes, bag-instance correlation together with task relatedness on the hybrid graphs, and simultaneously minimize the empirical classification loss. Furthermore, we analyze its performance based on Rademacher complexity, which sheds light on the benefits of jointly modeling multiple types of heterogeneity. To solve the resulting non-convex non-smooth problem, we propose an iterative algorithm named M3 Learning, which combines block coordinate descent and the bundle method for optimization. Experimental results on various data sets show the effectiveness of the proposed algorithm.
【Keywords】: concave programming; data handling; data mining; graph theory; iterative methods; learning (artificial intelligence); minimisation; M3 learning; Rademacher complexity; bag-instance correlation; block coordinate descent; brain image analysis; bundle method; complex heterogeneity modeling; data heterogeneity; empirical classification loss minimization; graph-based hybrid framework; insider threat detection; iterative algorithm; manufacturing processes; neighboring nodes; quality control; resulting nonconvex nonsmooth problem; smoothness consistency maximization; task relatedness; traffic prediction; Analytical models; Bipartite graph; Correlation; Loss measurement; Optimization; Quality control; TV; heterogeneous learning; multi-instance learning; multi-task learning; multi-view learning
【Paper Link】 【Pages】:1087-1092
【Authors】: Lina Yao ; Quan Z. Sheng ; Xue Li ; Sen Wang ; Tao Gu ; Wenjie Ruan ; Wan Zou
【Abstract】: Understanding and recognizing the activities performed by people is a fundamental research topic for a wide range of important applications such as fall detection of elderly people. In this paper, we present the technical details behind Freedom, a low-cost, unobtrusive system that supports independent livingof the older people. The Freedom system interprets what aperson is doing by leveraging machine learning algorithmsand radio-frequency identification (RFID) technology. To dealwith noisy, streaming, unstable RFID signals, we particularlydevelop a dictionary-based approach that can learn dictionariesfor activities using an unsupervised sparse coding algorithm. Our approach achieves efficient and robust activity recognitionvia a more compact representation of the activities. Extensiveexperiments conducted in a real-life residential environmentdemonstrate that our proposed system offers a good overallperformance (e.g., achieving over 96% accuracy in recognizing23 activities) and has the potential to be further developed tosupport the independent living of elderly people.
【Keywords】: assisted living; data analysis; data structures; geriatrics; learning (artificial intelligence); pattern recognition; radiofrequency identification; Freedom system; RFID sensing data; dictionary-based sparse representation; elderly people independent living; machine learning algorithms; online activity recognition; radio-frequency identification; unsupervised sparse coding algorithm; Correlation; Dictionaries; Feature extraction; Legged locomotion; Radiofrequency identification; Senior citizens; Silicon; Activity recognition; RFID; dictionary; feature selection; sensing data; sparse coding
【Paper Link】 【Pages】:1093-1098
【Authors】: Shaodan Zhai ; Chenyang Zhao ; Tian Xia ; Shaojun Wang
【Abstract】: Multi-label classification is a learning task of predicting a set of target labels for a given example. In this paper, we propose an ensemble method for multi-label classification, which is designed to optimize a novel minimum ranking margin objective function. Moreover, a boosting-type strategy is adopted to construct an accurate multi-label ensemble from multiple weak base classifiers. Experiments on different real-world multi-label classification tasks show that better performance can be achieved compared to other well-established methods.
【Keywords】: learning (artificial intelligence); optimisation; pattern classification; boosting-type strategy; learning task; minimum ranking margin maximization; minimum ranking margin objective function optimization; multilabel ensemble method; multiple-weak-base classifiers; real-world multilabel classification tasks; target label prediction; Boosting; Correlation; Linear programming; Prediction algorithms; Training; Turning; Yttrium; ensemble learning; multi-label classification
【Paper Link】 【Pages】:1099-1104
【Authors】: Jie Zhang ; Zhi Wei ; Zhenyu Yan ; Abhishek Pani
【Abstract】: Online advertising delivers promotional marketing messages to consumers through online media. Advertisers often have the desire to optimize their advertising spending and strategies in order to maximize their KPI (Key performance indicator). To build accurate ad performance predictive models, it is crucial to detect the change-points in historical data and therefore apply appropriate strategies to address the data pattern shift. However, with sparse data, which is common in online advertising, online change-point detection often becomes challenging. We propose a novel collaborated online change-point detection method in this paper. Through efficiently leveraging and coordinating with auxiliary time series, it can quickly and accurately identify the change-points in sparse and noisy time series. Simulation studies as well as real data applications have demonstrated its effectiveness in detecting change-point in sparse time series and therefore improving the accuracy of predictive models.
【Keywords】: Internet; advertising data processing; feature extraction; time series; KPI; change-point detection; data pattern shift; key performance indicator; online advertising; promotional marketing message; time series; Advertising; Data models; Media; Noise measurement; Predictive models; Time series analysis; online advertising; online change-point detection; sparse time series
【Paper Link】 【Pages】:1105-1110
【Authors】: Qian Zhang ; Lefei Zhang ; Bo Du ; Wei Zheng ; Wei Bian ; Dacheng Tao
【Abstract】: In data mining and pattern recognition area, the learned objects are often represented by the multiple features from various of views. How to learn an efficient and effective feature embedding for the subsequent learning tasks? In this paper, we address this issue by providing a novel multi-task multiview feature embedding (MMFE) framework. The MMFE algorithm is based on the idea of low-rank approximation, which suggests that the observed multiview feature matrix is approximately represented by the low-dimensional feature embedding multiplied by a projection matrix. In order to fully consider the particular role of each view to the multiview feature embedding, we simultaneously suggest the multitask learning scheme and ensemble manifold regularization into the MMFE algorithm to seek the optimal projection. Since the objection function of MMFE is multi-variable and non-convex, we further provide an iterative optimization procedure to find the available solution. Two real world experiments show that the proposed method outperforms single-task-based as well as state-of-the-art multiview feature embedding methods for the classification problem.
【Keywords】: approximation theory; data mining; iterative methods; matrix algebra; optimisation; MMFE algorithm; classification problem; data mining; ensemble manifold regularization; iterative optimization procedure; low-dimensional feature embedding; low-rank approximation; multitask learning scheme; multitask multiview feature embedding; multiview feature matrix; objection function; pattern recognition; projection matrix; Algorithm design and analysis; Approximation methods; Linear programming; Manifolds; Optimization; Principal component analysis; Yttrium; Classification; Dimension reduction; Multitask; Multiview
【Paper Link】 【Pages】:1111-1116
【Authors】: Qin Zhang ; Peng Zhang ; Guodong Long ; Wei Ding ; Chengqi Zhang ; Xindong Wu
【Abstract】: We study a new problem of learning from doubly-streaming data where both data volume and feature space increase over time. We refer to the problem as mining trapezoidal data streams. The problem is challenging because both data volume and feature space are increasing, to which existing online learning, online feature selection and streaming feature selection algorithms are inapplicable. We propose a new Sparse Trapezoidal Streaming Data mining algorithm (STSD) and its two variants which combine online learning and online feature selection to enable learning trapezoidal data streams with infinite training instances and features. Specifically, when new training instances carrying new features arrive, the classifier updates the existing features by following the passive-aggressive update rule used in online learning and updates the new features with the structural risk minimization principle. Feature sparsity is also introduced using the projected truncation techniques. Extensive experiments on the demonstrated UCI data sets show the performance of the proposed algorithms.
【Keywords】: data mining; feature selection; learning (artificial intelligence); STSD; UCI data; data feature space; data volume space; doubly-streaming data; feature sparsity; online feature selection; online learning; sparse trapezoidal streaming data mining algorithm; streaming feature selection algorithm; structural risk minimization principle; truncation technique; Algorithm design and analysis; Data mining; Electronic mail; Heuristic algorithms; Machine learning algorithms; Prediction algorithms; Training; Feature Selection; Online Learning; Sparsity; Trapezoidal Data Streams
【Paper Link】 【Pages】:1117-1122
【Authors】: Yuanyang Zhang ; Bernie J. Daigle Jr. ; Mitchell Cohen ; Linda Petzold
【Abstract】: The Cox model has been widely used in time-to-outcome predictions, particularly in studies of medical patients, where prediction of the time of death is desired. In addition, the cure model has been proposed to model times of death for discharged patients. However, neither the Cox model nor the cure model allow explicit cure information and prediction of patient cure times (discharge times). In this paper we propose a new model, the "cure time model", which models the static data for dying patients, surviving patients, and their death/cure times jointly. It models (1) mortality via logistic regression and (2) death and discharge times via Cox models. We extend the cure time model to situations with censored data, where neither time of death nor discharge time are known, as well as to multiple (>2) outcomes. In addition, we propose a joint log-odds ratio which can predict the mortality of patients using the information from both the logistic regression and Cox models. We compare our model with the Cox and cure models on a trauma patient dataset from UCSF/San Francisco General Hospital. Our results show that the cure time model more accurately predicts both mortality and time-to-mortality for patients from these datasets.
【Keywords】: medical information systems; regression analysis; Cox model; UCSF/San Francisco General Hospital; censored data; cure information; cure time model; death/cure time; discharged patient; dying patient; joint prediction; logistic regression; medical patient; patient cure time; static data; surviving patient; time-to-mortality; time-to-outcome prediction; trauma patient dataset; Analytical models; Conferences; Data models; Hazards; Hospitals; Logistics; Predictive models; clinical data; prediction; survival analysis; trauma
【Paper Link】 【Pages】:1123-1128
【Authors】: Handong Zhao ; Zhengming Ding ; Ming Shao ; Yun Fu
【Abstract】: Graph-based semi-supervised learning method has been influential in the data mining and machine learning fields. The key is to construct an effective graph to capture the intrinsic data structure, which further benefits for propagating the unlabeled data over the graph. The existing methods have shown the effectiveness of a graph regularization term on measuring the similarities among samples, which further uncovers the data structure. However, all the existing graph-based methods are on the sample-level, i.e. calculate the similarity based on sample-level representation coefficients, inevitably overlooking the underlying part-level structure within sample. Inspired by the strong interpretability of Non-negative Matrix Factorization (NMF) method, we design a more robust and discriminative graph, by integrating low-rank factorization and graph regularizer into a unified framework. Specifically, a novel low-rank factorization through Semi-Non-negative Matrix Factorization (SNMF) is proposed to extract the semantically part-level representation. Moreover, instead of incorporating a graph regularization on sample-level, we propose a sparse graph regularization term built on the decomposed part-level representation. This practice results in a more accurate measurement among samples, generating a more discriminative graph for semi-supervised learning. As a non-trivial contribution, we also provide an optimization solution to the proposed method. Comprehensive experimental evaluations show that our proposed method is able to achieve superior performance compared with the state-of-the-art semi-supervised classification baselines in both transductive and inductive scenarios.
【Keywords】: encoding; graph theory; learning (artificial intelligence); matrix decomposition; SNMF; discriminative graph; graph regularizer; low-rank factorization; part-level regularized seminonnegative coding; semantic part-level representation extraction; seminonnegative matrix factorization; semisupervised learning; sparse graph regularization term; Data mining; Encoding; Face; Linear programming; Matrix decomposition; Semisupervised learning; Sparse matrices
【Paper Link】 【Pages】:1129-1134
【Authors】: Yijun Zhao ; Tanuja Chitnis ; Brian C. Healy ; Jennifer G. Dy ; Carla E. Brodley
【Abstract】: Predicting disease course is critical in chronic progressive diseases such as multiple sclerosis (MS) for determining treatment. Forming an accurate predictive model based on clinical data is particularly challenging when data is gathered from multiple clinics/physicians as the labels vary with physicians' subjective judgment about clinical tests and further we have no a priori knowledge of the various types of physician subjectivity. At the same time, we often have some (limited) domain knowledge on how to group patients into disease progression subgroups. In this paper, we first present our rationale for choosing a Dirichlet mixture of Gaussian processes (DPMGP) model to address the subjectivity in our data. We then introduce a new approach to incorporating domain knowledge into the non-parametric mixture model. We demonstrate the efficacy of our model by applying it to two medical datasets to predict disease progression in MS patients and disability levels in early Parkinson's patients.
【Keywords】: Gaussian processes; diseases; medical computing; patient treatment; DPMGP model; Gaussian processes; MS patients; Parkinson patients; chronic progressive diseases; clinical data; clinical tests; disability levels; disease course; disease progression subgroups; domain induced dirichlet mixture; domain knowledge; medical datasets; multiple sclerosis patients; nonparametric mixture model; physician subjectivity; predictive model; Clustering algorithms; Data models; Diseases; Gaussian processes; Mathematical model; Predictive models; Baysian non-parametric; Gaussian process; domain knowledge; mixture models; predictive medicine; semi-supervised learning and application
【Paper Link】 【Pages】:1135-1140
【Authors】: Dawei Zhou ; Kangyang Wang ; Nan Cao ; Jingrui He
【Abstract】: Rare category detection(RCD) is an important topicin data mining, focusing on identifying the initial examples fromrare classes in imbalanced data sets. This problem becomes more challenging when the data is presented as time-evolving graphs, as used in synthetic ID detection and insider threat detection. Most existing techniques for RCD are designed for static data sets, thus not suitable for time-evolving RCD applications. To address this challenge, in this paper, we first proposetwo incremental RCD algorithms, SIRD and BIRD. They arebuilt upon existing density-based techniques for RCD, andincrementally update the detection models, which provide 'timeflexible' RCD. Furthermore, based on BIRD, we propose amodified version named BIRD-LI to deal with the cases wherethe exact priors of the minority classes are not available. Wealso identify a critical task in RCD named query distribution. Itaims to allocate the limited budget into multiple time steps, suchthat the initial examples from the rare classes are detected asearly as possible with the minimum labeling cost. The proposedincremental RCD algorithms and various query distributionstrategies are evaluated empirically on both synthetic and real data.
【Keywords】: data mining; graph theory; learning (artificial intelligence); query processing; BIRD-LI; SIRD; critical task; data mining; density-based techniques; empirical evaluation; imbalanced data sets; incremental RCD algorithms; incremental detection model update; insider threat detection; minimum labeling cost; minority classes; query distribution; rare category detection; rare classes; real data; static data sets; synthetic ID detection; synthetic data; time-evolving graphs; time-flexible RCD; Birds; Data mining; Distribution strategy; Heuristic algorithms; Labeling; Mathematical model; Matrix decomposition; Incremental Learning; Rare Category Detection; Time-evolving Graph Mining
【Paper Link】 【Pages】:1141-1146
【Authors】: Fuzhen Zhuang ; Dan Luo ; Xin Jin ; Hui Xiong ; Ping Luo ; Qing He
【Abstract】: Multi-task learning aims at learning multiple related but different tasks. In general, there are two ways for multi-task learning. One is to exploit the small set of labeled data from all tasks to learn a shared feature space for knowledge sharing. In this way, the focus is on the labeled training samples while the large amount of unlabeled data is not sufficiently considered. Another way has a focus on how to share model parameters among multiple tasks based on the original features space. Here, the question is whether it is possible to combine the advantages of both approaches and develop a method, which can simultaneously learn a shared subspace for multiple tasks and learn the prediction models in this subspace? To this end, in this paper, we propose a feature representation learning framework, which has the ability in combining the autoencoders, an effective way to learn good representation by using large amount of unlabeled data, and model parameter regularization methods into a unified model for multi-task learning. Specifically, all the tasks share the same encoding and decoding weights to find their latent feature representations, based on which a regularized multi-task softmax regression method is used to find a distinct prediction model for each task. Also, some commonalities are considered in the prediction models according to the relatedness of multiple tasks. There are several advantages of the proposed model: 1) it can make full use of large amount of unlabeled data from all the tasks to learn satisfying representations, 2) the learning of distinct prediction models can benefit from the success of autoencoder, 3) since we incorporate the labeled information into the softmax regression method, so the learning of feature representation is indeed in a semi-supervised manner. Therefore, our model is a semi-supervised autoencoder for multi-task learning (SAML for short). Finally, extensive experiments on three real-world data sets demonstrate the effectiv- ness of the proposed framework. Moreover, the feature representation obtained in this model can be used by other methods to obtain improved results.
【Keywords】: knowledge representation; learning (artificial intelligence); regression analysis; SAML; feature representation learning framework; knowledge sharing; model parameter regularization methods; multitask learning; regularized multitask softmax regression method; semisupervised autoencoder; Data models; Decoding; Encoding; Logistics; Organizations; Predictive models; Training; Autoencoder; Multi-task learning; Representation Learning; Semi-supervised Learning