15. SDM 2015:Vancouver, BC, Canada

Proceedings of the 2015 SIAM International Conference on Data Mining, Vancouver, BC, Canada, April 30 - May 2, 2015. SIAM 【DBLP Link

Paper Num: 110 || Session Num: 0

1. Front Matter.

Paper Link】 【Pages】:

【Authors】:

【Abstract】: Frontmatter includes welcome letter and table of contents.

【Keywords】:

2. Functional Node Detection on Linked Data.

Paper Link】 【Pages】:1-9

【Authors】: Kang Li ; Jing Gao ; Suxin Guo ; Nan Du ; Aidong Zhang

【Abstract】: Networks, which characterize object relationships, are ubiquitous in various domains. One very important problem is to detect the nodes of a specific function in these networks. For example, is a user normal or anomalous in an email network? Does a protein play a key role in a protein-protein interaction network? In many applications, the information we have about the networks usually includes both node characteristics and network structures. Both types of information can contribute to the task of learning functional nodes, and we call the collection of node and link information as linked data. However, existing methods only use a few subjectively selected topological features from network structures to detect functional nodes, thus fail to include highly discriminative and meaningful patterns hidden in linked data. To address this problem, a novel Feature Integration based Functional Node Detection (FIND) algorithm is presented. Specifically, FIND extracts the most discriminative information from both node characteristics and network structures in the form of a unified latent feature representation with the guidance of several labeled nodes. Experiments on two real world data sets validate that the proposed method significantly outperforms the baselines on the detection of three different types of functional nodes.

【Keywords】:

3. Where Graph Topology Matters: The Robust Subgraph Problem.

Paper Link】 【Pages】:10-18

【Authors】: Hau Chan ; Shuchu Han ; Leman Akoglu

【Abstract】: Robustness is a critical measure of the resilience of large networked systems, such as transportation and communication networks. Most prior works focus on the global robustness of a given graph at large, e.g., by measuring its overall vulnerability to external attacks or random failures. In this paper, we turn attention to local robustness and pose a novel problem in the lines of subgraph mining: given a large graph, how can we find its most robust local subgraph (RLS)? We define a robust subgraph as a subset of nodes with high communicability [15] among them, and formulate the RLS-PROBLEM of finding a subgraph of given size with maximum robustness in the host graph. Our formulation is related to the recently proposed general framework [39] for the densest subgraph problem, however differs from it substantially in that besides the number of edges in the subgraph, robustness also concerns with the placement of edges, i.e., the subgraph topology. We show that the RLS-PROBLEM is NP-hard and propose two heuristic algorithms based on top-down and bottom-up search strategies. Further, we present modifications of our algorithms to handle three practical variants of the RLS-PROBLEM. Experiments on synthetic and real-world graphs demonstrate that we find subgraphs with larger robustness than the densest subgraphs [9, 39] even at lower densities, suggesting that the existing approaches are not suitable for the new problem setting.

【Keywords】:

4. Same bang, fewer bucks: efficient discovery of the cost-influence skyline.

Paper Link】 【Pages】:19-27

【Authors】: Matthijs van Leeuwen ; Antti Ukkonen

【Abstract】: Influence maximization aims to find a set of persons in a social network that can be used as seeds for a viral marketing campaign, such that the expected spread of influence is maximized. Standard approaches to this problem produce a single seed set that either maximizes influence, or the “bang for the buck” if the vertices are associated with a cost. In this paper we consider the problem of finding the cost-influence skyline, i.e., the collection of all seed sets that are Pareto optimal w.r.t. seeding cost and expected influence. Computing the cost-influence skyline has a number of advantages over finding a single solution only. First, it provides a better understanding of the trade-off between cost and influence, which enables the user to make an informed choice regarding the budget. Second, by computing the cost-influence skyline we obtain the optimal seed set for any given seeding budget, not only the one that corresponds to singleton solutions found by existing algorithms. In practice, the problem is to discover the skyline w.r.t. two functions spanned by all subsets of size k of a set of vertices. Due to the extremely large number of such subsets, this is a very hard problem. We present an efficient heuristic algorithm for computing the skyline when one of the functions is linear (e.g., the seeding cost) and the other submodular (e.g., expected influence). The experiments show that the cost-influence skyline can be computed in reasonable time for networks with up to a million vertices.

【Keywords】:

5. Selecting Shortcuts for a Smaller World.

Paper Link】 【Pages】:28-36

【Authors】: Nikos Parotsidis ; Evaggelia Pitoura ; Panayiotis Tsaparas

【Abstract】: The small world phenomenon is a desirable property of social networks, since it guarantees short paths between the nodes of the social graph and thus efficient information spread on the network. It is thus in the benefit of both network users and network owners to enforce and maintain this property. In this work, we study the problem of finding a subset of k edges from a set of candidate edges whose addition to a network leads to the greatest reduction in its average shortest path length. We formulate the problem as a combinatorial optimization problem, and show that it is NP-hard and that known approximation techniques are not applicable. We describe an efficient method for computing the exact effect of a single edge insertion on the average shortest path length, as well as several heuristics for efficiently estimating this effect. We perform experiments on real data to study the performance of our algorithms in practice.

【Keywords】:

6. Significant Subgraph Mining with Multiple Testing Correction.

Paper Link】 【Pages】:37-45

【Authors】: Mahito Sugiyama ; Felipe Llinares-López ; Niklas Kasenburg ; Karsten M. Borgwardt

【Abstract】: The problem of finding itemsets that are statistically significantly enriched in a class of transactions is complicated by the need to correct for multiple hypothesis testing. Pruning untestable hypotheses was recently proposed as a strategy for this task of significant itemset mining. It was shown to lead to greater statistical power, the discovery of more truly significant itemsets, than the standard Bonferroni correction on real-world datasets. An open question, however, is whether this strategy of excluding untestable hypotheses also leads to greater statistical power in subgraph mining, in which the number of hypotheses is much larger than in itemset mining. Here we answer this question by an empirical investigation on eight popular graph benchmark datasets. We propose a new efficient search strategy, which always returns the same solution as the state-of-the-art approach and is approximately two orders of magnitude faster. Moreover, we exploit the dependence between subgraphs by considering the effective number of tests and thereby further increase the statistical power.

【Keywords】:

7. From Categorical to Numerical: Multiple Transitive Distance Learning and Embedding.

Paper Link】 【Pages】:46-54

【Authors】: Kai Zhang ; Qiaojun Wang ; Zhengzhang Chen ; Ivan Marsic ; Vipin Kumar ; Guofei Jiang ; Jie Zhang

【Abstract】: Categorical data are ubiquitous in real-world databases. However, due to the lack of an intrinsic proximity measure, many powerful algorithms for numerical data analysis may not work well on their categorical counterparts, making it a bottleneck in practical applications. In this paper, we propose a novel method to transform categorical data to numerical representations, so that abundant numerical learning methods can be exploited in categorical data mining. Our key idea is to learn a pairwise dissimilarity among categorical symbols, henceforth a continuous embedding, which can then be used for subsequent numerical treatment. There are two important criteria for learning the dissimilarities. First, it should capture the important “transitivity” which has shown to be particularly useful in measuring the proximity relation in categorical data. Second, the pairwise sample geometry arising from the learned symbol distances should be maximally consistent with prior knowledge (e.g., class labels) to obtain a good generalization performance. We achieve them through multiple transitive distance learning and embedding. Encouraging results are observed on a number of benchmark classification tasks against state-of-the-art.

【Keywords】:

8. Spectral Embedding of Signed Networks.

Paper Link】 【Pages】:55-63

【Authors】: Quan Zheng ; David B. Skillicorn

【Abstract】: Social networks are often modelled by graphs, with nodes representing individuals, and positively weighted edges representing the strength of the relationships between them. Working directly with such graphs is difficult, and it has become common to use spectral techniques that embed graphs in a geometry, and then work with the geometry instead. In a good embedding, edges that are heavily (positively) weighted, and so represent strong interactions, cause the vertices they connect to be embedded close to one another. However, in some social networks there are also antagonistic relationships that are naturally represented by negatively weighted edges. The resulting graphs are called signed graphs. Clearly an embedding of such a signed graph should place nodes connected by positively weighted edges close together, but nodes connected by negatively weighted edges far apart. Existing spectral techniques to embed signed graphs have serious drawbacks. We derive two normalized spectral analysis methods for signed graphs and show, using real-world data, that they produce robust embeddings.

【Keywords】:

9. An LLE based Heterogeneous Metric Learning for Cross-media Retrieval.

Paper Link】 【Pages】:64-72

【Authors】: Peng Zhou ; Liang Du ; Mingyu Fan ; Yi-Dong Shen

【Abstract】: With unstructured heterogeneous multimedia data such as texts, images being more and more widely used on the web, cross-media retrieval has become an increasingly important task. One of the key techniques in cross-media retrieval is how to compute distances or similarities among different types of media data. In this paper, we propose a novel heterogeneous metric learning method to compute distances between images and texts. We extend Locally Linear Embedding (LLE) to deal with heterogeneous data, so that we can not only preserve homogeneous local information but also capture heterogeneous constraints. In order to handle the out-of-sample problem, we learn two map functions from the embedding, and use them to transform heterogeneous data into a homogeneous space and do the retrieval in the new space. The experimental results on two real-world datasets show the effectiveness of our approach.

【Keywords】:

10. Feature Selection for Nonlinear Regression and its Application to Cancer Research.

Paper Link】 【Pages】:73-81

【Authors】: Yijun Sun ; Jin Yao ; Steve Goodison

【Abstract】: Feature selection is a fundamental problem in machine learning. With the advent of high-throughput technologies, it becomes increasingly important in a wide range of scientific disciplines. In this paper, we consider the problem of feature selection for high-dimensional nonlinear regression. This problem has not yet been well addressed in the community, and existing methods suffer from issues such as local minima, simplified model assumptions, high computational complexity and selected features not directly related to learning accuracy. We propose a new wrapper method that addresses some of these issues. We start by developing a new approach to estimating sample responses and prediction errors, and then deploy a feature weighting strategy to find a feature subspace where a prediction error function is minimized. We formulate it as an optimization problem within the SVM framework and solve it using an iterative approach. In each iteration, a gradient descent based approach is derived to efficiently find a solution. A large-scale simulation study is performed on four synthetic and nine cancer microarray datasets that demonstrates the effectiveness of the proposed method.

【Keywords】:

11. Efficient Partial Order Preserving Unsupervised Feature Selection on Networks.

Paper Link】 【Pages】:82-90

【Authors】: Xiaokai Wei ; Sihong Xie ; Philip S. Yu

【Abstract】: In the past decade, research on network data has attracted much attention and many interesting phenomena have been discovered. Such data are often characterized by high dimensionality but how to select meaningful and more succinct features for network data received relatively less attention. In this paper, we investigate unsupervised feature selection problem on networks. To effectively incorporate linkage information, we propose a Partial Order Preserving (POP) principle for evaluating features. We show the advantage of this novel formulation in several respects: effectiveness, efficiency and its connection to optimizing AUC. We propose three instantiations derived from the POP principle and evaluate them using three real-world datasets. Experimental results show that our approach has significantly better performance than state-of-the-art methods under several different metrics.

【Keywords】:

12. NetCodec: Community Detection from Individual Activities.

Paper Link】 【Pages】:91-99

【Authors】: Long Q. Tran ; Mehrdad Farajtabar ; Le Song ; Hongyuan Zha

【Abstract】: The real social network and associated communities are often hidden under the declared friend or group lists in social networks. We usually observe the manifestation of these hidden networks and communities in the form of recurrent and time-stamped individuals' activities in the social network. Inferring the underlying network and finding coherent communities are therefore two key challenges in social networks analysis. In this paper, we address the following question: Could we simultaneously detect community structure and network infectivity among individuals from their activities? Based on the fact that the two characteristics intertwine and that knowing one will help better revealing the other, we propose a multidimensional Hawkes process that can address them simultaneously. To this end, we parametrize the network infectivity in terms of individuals' participation in communities and the popularity of each individual. We show that this modeling approach has many benefits, both conceptually and experimentally. We utilize Bayesian variational inference to design NetCodec, an efficient inference algorithm which is verified with both synthetic and real world data sets. The experiments show that NetCodec can discover the underlying network infectivity and community structure more accurately than baseline method.

【Keywords】:

13. Efficient Algorithms for a Robust Modularity-Driven Clustering of Attributed Graphs.

Paper Link】 【Pages】:100-108

【Authors】: Patricia Iglesias Sánchez ; Emmanuel Müller ; Uwe Leo Korn ; Klemens Böhm ; Andrea Kappes ; Tanja Hartmann ; Dorothea Wagner

【Abstract】: Clustering methods based on modularity are wellestablished and widely used for graph data. However, today's applications store additional attribute information for each node in the graph. This attribute information may even be contradicting with the graph structure, which raises a major challenge for the simultaneous mining of both information sources. For attributed graphs it is essential to be aware of such contradicting effects caused by irrelevant attributes and highly deviating attribute values of outlier nodes. In this work, we focus on the robustness of graph clustering w.r.t. irrelevant attributes and outliers. We propose a modularity-driven approach for parameter-free clustering of attributed graphs and several efficient algorithms for its computation. The efficiency is achieved by our incremental calculation of attribute information within these modularity-driven algorithms. In our experiments, we evaluate our modularity-driven algorithms w.r.t. the new challenges in attributed graphs and show that they outperform existing approaches on large attributed graphs.

【Keywords】:

14. Vertex Clustering of Augmented Graph Streams.

Paper Link】 【Pages】:109-117

【Authors】: Ryan McConville ; Weiru Liu ; Paul C. Miller

【Abstract】: In this paper we propose a graph stream clustering algorithm with a unified similarity measure on both structural and attribute properties of vertices, with each attribute being treated as a vertex. Unlike others, our approach does not require an input parameter for the number of clusters, instead, it dynamically creates new sketch-based clusters and periodically merges existing similar clusters. Experiments on two publicly available datasets reveal the advantages of our approach in detecting vertex clusters in the graph stream. We provide a detailed investigation into how parameters affect the algorithm performance. We also provide a quantitative evaluation and comparison with a well-known offline community detection algorithm which shows that our streaming algorithm can achieve comparable or better average cluster purity.

【Keywords】:

15. Tensor Spectral Clustering for Partitioning Higher-order Network Structures.

Paper Link】 【Pages】:118-126

【Authors】: Austin R. Benson ; David F. Gleich ; Jure Leskovec

【Abstract】: Spectral graph theory-based methods represent an important class of tools for studying the structure of networks. Spectral methods are based on a first-order Markov chain derived from a random walk on the graph and thus they cannot take advantage of important higher-order network substructures such as triangles, cycles, and feed-forward loops. Here we propose a Tensor Spectral Clustering (TSC) algorithm that allows for modeling higher-order network structures in a graph partitioning framework. Our TSC algorithm allows the user to specify which higher-order network structures (cycles, feed-forward loops, etc.) should be preserved by the network clustering. Higher-order network structures of interest are represented using a tensor, which we then partition by developing a multilinear spectral method. Our framework can be applied to discovering layered flows in networks as well as graph anomaly detection, which we illustrate on synthetic networks. In directed networks, a higher-order structure of particular interest is the directed 3-cycle, which captures feedback loops in networks. We demonstrate that our TSC algorithm produces large partitions that cut fewer directed 3-cycles than standard spectral clustering algorithms.

【Keywords】:

16. Community Detection for Emerging Networks.

Paper Link】 【Pages】:127-135

【Authors】: Jiawei Zhang ; Philip S. Yu

【Abstract】: Nowadays, many new social networks offering specific services spring up overnight. In this paper, we want to detect communities for emerging networks. Community detection for emerging networks is very challenging as information in emerging networks is usually too sparse for traditional methods to calculate effective closeness scores among users and achieve good community detection results. Meanwhile, users nowadays usually join multiple social networks simultaneously, some of which are developed and can share common information with the emerging networks. Based on both link and attribution information across multiple networks, a new general closeness measure, intimacy, is introduced in this paper. With both micro and macro controls, an effective and efficient method, CAD (Cold stArt community Detector), is proposed to propagate information from developed network to calculate effective intimacy scores among users in emerging networks. Extensive experiments conducted on real-world social networks demonstrate that CAD can perform very well in addressing the emerging network community detection problem.

【Keywords】:

17. Labeling Educational Content with Academic Learning Standards.

Paper Link】 【Pages】:136-144

【Authors】: Danish Contractor ; Kashyap Popat ; Shajith Ikbal ; Sumit Negi ; Bikram Sengupta ; Mukesh K. Mohania

【Abstract】: Learning standards (frequently referred to as academic standards, course curriculum etc.) define the specific structure of an educational program. Learning standards contain a list of instructions specifying various skills that students should learn at different points during their learning progression. For example,“calculate the area of a triangle” is one such instruction in a 6th grade geometry curriculum. Currently these instructions are imparted using prescribed textbooks or lesson plans which have been labeled with learning standard instructions. Teachers and students use this labeled learning content to identify relevant material for teaching and studying. However with an increasing amount of users as well as publisher generated content in recent days, teachers and students may want to refer to additional content apart from prescribed textbooks for their teaching/learning needs which is not labeled with learning standard instructions. Manually identifying the appropriate learning standard instruction for each learning content is time consuming and not scalable especially since learning standards frequently contain thousands of instructions, and subject to periodic revision. In this paper, we address the problem of automatically labeling digital learning content with the learning standards. Towards this goal, we first build semantic representations of the learning standard instructions using external knowledge sources such as Wikipedia and domain text books. These semantic representations are then used in a framework which utilizes structural constraints imposed by the hierarchy of the learning standards to assign labels to the learning materials. We demonstrate the usefulness of our approach on a collection of high school learning materials that were labeled by curriculum experts from a US school district according to a publicly available learning standard. The system developed has been deployed and is in use by the school district. To the best of our knowledge we are the first to attempt this novel task and develop such a system.

【Keywords】:

18. Data mining for real mining: A robust algorithm for prospectivity mapping with uncertainties.

Paper Link】 【Pages】:145-153

【Authors】: Justin Granek ; Eldad Haber

【Abstract】: Mineral prospectivity mapping is an emerging application for machine learning algorithms which presents a series of practical difficulties. The goal is to learn the mapping function which can predict the existence or absence of economic mineralization from a compilation of geoscience datasets (ie: bedrock type, magnetic signature, geochemical response etc). The challenges include sparse, imbalanced labels (mineralization occurrences), varied label reliability, and a wide range in data quality and uncertainty. In order to address these issues an algorithm was developed based on total least squares and support vector machine regression which incorporates both data and label uncertainty into the objective function. This was done without losing sparsity in the residuals, thus maintaining minimal support vectors. Mineral prospectivity mapping is an application for machine learning which presents a series of practical difficulties. The goal is to learn the mapping function which can predict the existence of mineralization from a compilation of geoscience datasets. Challenges include sparse, imbalanced labels, varied label reliability, and a wide range in data uncertainty. To address this, an algorithm was developed based on TLS and SVM which incorporates both data and label uncertainty into the objective function.

【Keywords】:

19. Product Adoption Rate Prediction: A Multi-factor View.

Paper Link】 【Pages】:154-162

【Authors】: Le Wu ; Qi Liu ; Enhong Chen ; Xing Xie ; Chang Tan

【Abstract】: As the worlds of commerce and Internet technology become more inextricably linked, a large number of user consumption series become available for creative use. A critical demand along this line is to predict the future product adoption for the merchants, which enables a wide range of applications such as targeted marketing. However, previous works only aimed at predicting if one user will adopt this product or not; the problem of adoption rate (or percentage of use) prediction for each user is still underexplored due to the complexity of user decision-making process. To that end, in this paper we present a comprehensive study for this product adoption rate prediction problem. Specifically, we first introduce a decision function to capture the change of users' product adoption rate, where various factors that may influence the decision can be generally leveraged. Then, we propose two models to solve this function, the Generalized Adoption Model (GAM) that assumes all users are influenced equally by these factors and the Personalized Adoption Model (PAM) that argues each factor contributes differently among people. Furthermore, we extend the PAM to a totally Bayesian model (BPAM) that can automatically learn all parameters. Finally, extensive experiments on two real-world datasets not only show the improvement of our proposed three models, but also give insights to track the effects of the various factors for product adoption decisions.

【Keywords】:

20. PatentCom: A Comparative View of Patent Document Retrieval.

Paper Link】 【Pages】:163-171

【Authors】: Longhui Zhang ; Lei Li ; Chao Shen ; Tao Li

【Abstract】: Patent document retrieval, as a recall-orientated search task, does not allow missing relevant patent documents due to the great commercial value of patents and significant costs of processing a patent application or patent infringement case. Thus, it is important to retrieve all possible relevant documents rather than only a small subset of patents from the top ranked results. However, patents are often lengthy and rich in technical terms, and it often requires enormous human efforts to compare a given document with retrieved results. In this paper, we formulate the problem of comparing patent documents as a comparative summarization problem, and explore automatic strategies that generate comparative summaries to assist patent analysts in quickly reviewing any given patent document pairs. To this end, we present a novel approach, named PatentCom, which first extracts discriminative terms from each patent document, and then connects the dots on a term co-occurrence graph. In this way, we are able to comprehensively extract the gists of the two patent documents being compared, and meanwhile highlight their relationship in terms of commonalities and differences. Extensive quantitative analysis and case studies on real world patent documents demonstrate the effectiveness of our proposed approach.

【Keywords】:

21. Combating Product Review Spam Campaigns via Multiple Heterogeneous Pairwise Features.

Paper Link】 【Pages】:172-180

【Authors】: Chang Xu ; Jie Zhang

【Abstract】: Spam campaigns spotted in popular product review websites (e.g., amazon.com) have attracted mounting attention from both industry and academia, where a group of online posters are hired to collaboratively craft deceptive reviews for some target products. The goal is to manipulate perceived reputations of the targets for their best interests. Many efforts have been made to detect such colluders by extracting pointwise features from individual reviewers/reviewer-groups, however, pairwise features which can potentially capture the underlying correlations among colluders are either ignored or just explored insufficiently in the literature. We observed that pairwise features can be more robust to model the relationships among colluders since they, as the ingredients of spam campaigns, are correlated in nature. In this paper, we explore multiple heterogeneous pairwise features in virtue of some collusion signals found in reviewers' rating behaviors and linguistic patterns. In addition, an unsupervised and intuitive colluder detecting framework has been proposed which can benefit from these pairwise features. Extensive experiments on real dataset show the effectiveness of our method and satisfactory superiority over several competitors.

【Keywords】:

22. A Bayesian Framework for Modeling Human Evaluations.

Paper Link】 【Pages】:181-189

【Authors】: Himabindu Lakkaraju ; Jure Leskovec ; Jon M. Kleinberg ; Sendhil Mullainathan

【Abstract】: Several situations that we come across in our daily lives involve some form of evaluation: a process where an evaluator chooses a correct label for a given item. Examples of such situations include a crowd-worker labeling an image or a student answering a multiple-choice question. Gaining insights into human evaluations is important for determining the quality of individual evaluators as well as identifying true labels of items. Here, we generalize the question of estimating the quality of individual evaluators, extending it to obtain diagnostic insights into how various evaluators label different kinds of items. We propose a series of increasingly powerful hierarchical Bayesian models which infer latent groups of evaluators and items with the goal of obtaining insights into the underlying evaluation process. We apply our framework to a wide range of real-world domains, and demonstrate that our approach can accurately predict evaluator decisions, diagnose types of mistakes evaluators tend to make, and infer true labels of items.

【Keywords】:

23. Feature-based factorized Bilinear Similarity Model for Cold-Start Top-n Item Recommendation.

Paper Link】 【Pages】:190-198

【Authors】: Mohit Sharma ; Jiayu Zhou ; Junling Hu ; George Karypis

【Abstract】: Recommending new items to existing users has remained a challenging problem due to absence of user's past preferences for these items. The user personalized non-collaborative methods based on item features can be used to address this item cold-start problem. These methods rely on similarities between the target item and user's previous preferred items. While computing similarities based on item features, these methods overlook the interactions among the features of the items and consider them independently. Modeling interactions among features can be helpful as some features, when considered together, provide a stronger signal on the relevance of an item when compared to case where features are considered independently. To address this important issue, in this work we introduce the Feature-based factorized Bilinear Similarity Model (FBSM), which learns factorized bilinear similarity model for Top-n recommendation of new items, given the information about items preferred by users in past as well as the features of these items. We carry out extensive empirical evaluations on benchmark datasets, and we find that the proposed FBSM approach improves upon traditional non-collaborative methods in terms of recommendation performance. Moreover, the proposed approach also learns insightful interactions among item features from data, which lead to deep understanding on how these interactions contribute to personalized recommendation.

【Keywords】:

24. Cross-Modal Retrieval: A Pairwise Classification Approach.

Paper Link】 【Pages】:199-207

【Authors】: Aditya Krishna Menon ; Didi Surian ; Sanjay Chawla

【Abstract】: Content is increasingly available in multiple modalities (such as images, text, and video), each of which provides a different representation of some entity. The cross-modal retrieval problem is: given the representation of an entity in one modality, find its best representation in all other modalities. We propose a novel approach to this problem based on pairwise classification. The approach seamlessly applies to both the settings where ground-truth annotations for the entities are absent and present. In the former case, the approach considers both positive and unlabelled links that arise in standard cross-modal retrieval datasets. Empirical comparisons show improvements over state-of-the-art methods for cross-modal retrieval.

【Keywords】:

25. Binary Classifier Calibration Using a Bayesian Non-Parametric Approach.

Paper Link】 【Pages】:208-216

【Authors】: Mahdi Pakdaman Naeini ; Gregory F. Cooper ; Milos Hauskrecht

【Abstract】: Learning probabilistic predictive models that are well calibrated is critical for many prediction and decision-making tasks in Data mining. This paper presents two new non-parametric methods for calibrating outputs of binary classification models: a method based on the Bayes optimal selection and a method based on the Bayesian model averaging. The advantage of these methods is that they are independent of the algorithm used to learn a predictive model, and they can be applied in a post-processing step, after the model is learned. This makes them applicable to a wide variety of machine learning models and methods. These calibration methods, as well as other methods, are tested on a variety of datasets in terms of both discrimination and calibration performance. The results show the methods either outperform or are comparable in performance to the state-of-the-art calibration methods.

【Keywords】:

26. Semi-supervised learning for structured regression on partially observed attributed graphs.

Paper Link】 【Pages】:217-225

【Authors】: Jelena Stojanovic ; Milos Jovanovic ; Djordje Gligorijevic ; Zoran Obradovic

【Abstract】: Conditional probabilistic graphical models provide a powerful framework for structured regression in spatio-temporal datasets with complex correlation patterns. However, in real-life applications a large fraction of observations is often missing, which can severely limit the representational power of these models. In this paper we propose a Marginalized Gaussian Conditional Random Fields (m-GCRF) structured regression model for dealing with missing labels in partially observed temporal attributed graphs. This method is aimed at learning with both labeled and unlabeled parts and effectively predicting future values in a graph. The method is even capable of learning from nodes for which the response variable is never observed in history, which poses problems for many state-of-the-art models that can handle missing data. The proposed model is characterized for various missingness mechanisms on 500 synthetic graphs. The benefits of the new method are also demonstrated on a challenging application for predicting precipitation based on partial observations of climate variables in a temporal graph that spans the entire continental US. We also show that the method can be useful for optimizing the costs of data collection in climate applications via active reduction of the number of weather stations to consider. In experiments on these real-world and synthetic datasets we show that the proposed model is consistently more accurate than alternative semi-supervised structured models, as well as models that either use imputation to deal with missing values or simply ignore them altogether.

【Keywords】:

27. Health Insurance Market Risk Assessment: Covariate Shift and k-Anonymity.

Paper Link】 【Pages】:226-234

【Authors】: Dennis Wei ; Karthikeyan Natesan Ramamurthy ; Kush R. Varshney

【Abstract】: Health insurance companies prefer to enter new markets in which individuals likely to enroll in their plans have a low annual cost. When deciding which new markets to enter, health cost data for the new markets is unavailable to them, but health cost data for their own enrolled members is available. To address the problem of assessing risk in new markets, i.e., estimating the cost of likely enrollees, we pose a regression problem with demographic data as predictors combined with a novel three-population covariate shift. Since this application deals with health data that is protected by privacy laws, we cannot use the raw data of the insurance company's members directly for training the regression and covariate shift. Therefore, to construct a full solution, we also develop a novel method to achieve k-anonymity with the workload-driven quality of data distribution preservation achieved through dithered quantization and Rosenblatt's transformation. We illustrate the efficacy of the solution using real-world, publicly available data.

【Keywords】:

28. Attacking DBSCAN for Fun and Profit.

Paper Link】 【Pages】:235-243

【Authors】: Jonathan Crussell ; W. Philip Kegelmeyer

【Abstract】: Many security applications depend critically on clustering. However, we do not know of any clustering algorithms that were designed with an adversary in mind. An intelligent adversary may be able to use this to her advantage to subvert the security of the application. Already, adversaries use obfuscation and other techniques to alter the representation of their inputs in feature space to avoid detection. As one example, spam email often mimics normal email. In this work, we investigate a more active attack, in which an adversary attempts to subvert clustering analysis by feeding in carefully crafted data points. Specifically, in this work we explore how an attacker can subvert DBSCAN, a popular density-based clustering algorithm. We explore a “confidence attack,” where an adversary seeks to poison the clusters to the point that the defender loses confidence in the utility of the system. This may result in the system being abandoned, or worse, waste the defender's time investigating false alarms. While our attacks generalize to all DBSCAN-based tools, we focus our evaluation on AnDarwin, a tool designed to detect plagiarized Android apps. We show that an adversary can merge arbitrary clusters by connecting them with “bridges”, that even a small number of merges can greatly degrade clustering performance, and that the defender has limited recourse when relying solely on DBSCAN. Finally, we propose a remediation process that uses machine learning and features based on outlier measures that are orthogonal to the underlying clustering problem to detect and remove injected points.

【Keywords】:

29. Result Integrity Verification of Outsourced Privacy-preserving Frequent Itemset Mining.

Paper Link】 【Pages】:244-252

【Authors】: Ruilin Liu ; Wendy Hui Wang

【Abstract】: In the recently-emerged Data-Mining-as-a-Service (DMaS) paradigm, a client outsources her data and the data mining needs to a third party service provider. It raises a few security issues including privacy protection and result integrity verification. Most of the recent work studied these two issues separately. In this paper, we focus on the problem of result integrity verification of outsourced privacy-preserving frequent itemset mining. It is challenging to discover the incorrect results by the service provider's misbehaviors from the mining output that intends to be inaccurate due to privacy protection techniques. We design efficient approaches that can provide high probabilistic guarantee for both correctness and completeness of the frequent itemset mining results. Our experiment results show the efficiency and effectiveness of our approaches.

【Keywords】:

30. Modeling Users' Adoption Behaviors with Social Selection and Influence.

Paper Link】 【Pages】:253-261

【Authors】: Ziqi Liu ; Fei Wang ; Qinghua Zheng

【Abstract】: Massive users' online adoption behaviors were recorded thanks to the various emerging web services such as Facebook, Twitter, G+, Netflix and so on. Two key factors that affect users' adoption behaviors are social selection and social influence. Understanding such factors underlying each behavior can potentially help web service providers gain much more insights into their users and improve predictive power. In this paper, we try to answer (1) How do the roles of selection and influence play in a user-level adoption? (2) Capturing those factors can benefit the modeling and prediction of users' adoption behaviors or not. Quantitatively capturing the two factors could be challenging since the known “ballot box communication”. Moreover, though both social selection and influence are well studied in collaborative filtering and information diffusions respectively, it's still non-trivial to jointly model them. We propose a probabilistic Latent Factors with Diffusion Model (LFDM) which explicitly considers both social selection and influence by projecting cascading processes into latent factor spaces. We also develop an effective EM styled algorithm for estimating the proposed model. Finally we validate our methodology on three kinds of real world data sets.

【Keywords】:

31. Exploring the Impact of Dynamic Mutual Influence on Social Event Participation.

Paper Link】 【Pages】:262-270

【Authors】: Tong Xu ; Hao Zhong ; Hengshu Zhu ; Hui Xiong ; Enhong Chen ; Guannan Liu

【Abstract】: Nowadays, it is commonly seen that an offline social event is organized through online social network services (SNS), in this way cyber strangers can be connected in physical world. While there are some preliminary studies on social event participation through SNS, they usually have more focus on the mining of event profiles and have less focus on the social relationships among target users. In particular, the importance of dynamic mutual influence among potential event participants has been largely ignored. In this paper, we develop a novel discriminant framework, which allows to integrate the dynamic mutual dependence of potential event participants into the discrimination process. Specifically, we formulate the group-oriented event participation problem as a variant two-stage discriminant framework to capture the users' preferences as well as their latent social connections. The experimental results on real-world data show that our method can effectively predict the event participation with a significant margin compared with several state-of-the-art baselines, which validates the hypothesis that dynamic mutual influence could play an important role in the decision-making process of social event participation.

【Keywords】:

32. Efficient Online Relative Comparison Kernel Learning.

Paper Link】 【Pages】:271-279

【Authors】: Eric Heim ; Matthew Berger ; Lee M. Seversky ; Milos Hauskrecht

【Abstract】: Learning a kernel matrix from relative comparison human feedback is an important problem with applications in collaborative filtering, object retrieval, and search. For learning a kernel over a large number of objects, existing methods face significant scalability issues inhibiting the application of these methods to settings where a kernel is learned in an online and timely fashion. In this paper we propose a novel framework called Efficient online Relative comparison Kernel LEarning (ERKLE), for efficiently learning the similarity of a large set of objects in an online manner. We learn a kernel from relative comparisons via stochastic gradient descent, one query response at a time, by taking advantage of the sparse and low-rank properties of the gradient to efficiently restrict the kernel to lie in the space of positive semidefinite matrices. In addition, we derive a passive-aggressive online update for minimally satisfying new relative comparisons as to not disrupt the influence of previously obtained comparisons. Experimentally, we demonstrate a considerable improvement in speed while obtaining improved or comparable accuracy compared to current methods in the online learning setting.

【Keywords】:

33. Cheetah: Fast Graph Kernel Tracking on Dynamic Graphs.

Paper Link】 【Pages】:280-288

【Authors】: Liangyue Li ; Hanghang Tong ; Yanghua Xiao ; Wei Fan

【Abstract】: Graph kernels provide an expressive approach to measuring the similarity of two graphs, and are key building blocks behind many real-world applications, such as bioinformatics, brain science and social networks. However, current methods for computing graph kernels assume the input graphs are static, which is often not the case in reality. It is highly desirable to track the graph kernels on dynamic graphs evolving over time in a timely manner. In this paper, we propose a family of Cheetah algorithms to deal with the challenge. Cheetah leverages the low rank structure of graph updates and incrementally updates the eigen-decomposition or SVD of the adjacency matrices of graphs. Experimental evaluations on real world graphs validate our algorithms (1) are significantly faster than alternatives with high accuracy and (b) scale sub-linearly.

【Keywords】:

34. On the Non-Trivial Generalization of Dynamic Time Warping to the Multi-Dimensional Case.

Paper Link】 【Pages】:289-297

【Authors】: Mohammad Shokoohi-Yekta ; Jun Wang ; Eamonn J. Keogh

【Abstract】: In the last decade, Dynamic Time Warping (DTW) has emerged as the distance measure of choice for virtually all time series data mining applications. This is the result of significant progress in improving DTW's efficiency, and multiple empirical studies showing that DTW-based classifiers at least equal the accuracy of all their rivals across dozens of datasets. Thus far, most of the research has considered only the one-dimensional case, with practitioners generalizing to the multi-dimensional case in one of two ways. In general, it appears the community believes either that the two ways are equivalent, or that the choice is irrelevant. In this work, we show that this is not the case. The two most commonly used multidimensional DTW methods can produce different classifications, and neither one dominates over the other. This seems to suggest that one should learn the best method for a particular application. However, we will show that this is not necessary; a simple, principled rule can be used on a case-by-case basis to predict which of the two methods we should give credence to. Our method allows us to ensure that classification results are at least as accurate as the better of the two rival methods, and in many cases, our method is strictly more accurate. We demonstrate our ideas with the most extensive set of multi-dimensional time series classification experiments ever attempted.

【Keywords】:

35. Fast Mining of a Network of Coevolving Time Series.

Paper Link】 【Pages】:298-306

【Authors】: Yongjie Cai ; Hanghang Tong ; Wei Fan ; Ping Ji

【Abstract】: Coevolving multiple time series are ubiquitous and naturally appear in a variety of high-impact applications, ranging from environmental monitoring, computer network traffic monitoring, motion capture, to physiological signal in health care and many more. In many scenarios, the multiple time series data is often accompanied by some contextual information in the form of networks. In this paper, we refer to such multiple time series, together with its embedded network as a network of coevolving time series. In order to unveil the underlying patterns of a network of coevolving time series, we propose DCMF, a dynamic contextual matrix factorization algorithm. The key idea is to find the latent factor representation of the input time series and that of its embedded network simultaneously. Our experimental results on several real datasets demonstrate that our method (1) outperforms its competitors, especially when there are lots of missing values; and (2) enjoys a linear scalability w.r.t. the length of time series.

【Keywords】:

36. Shapelet Ensemble for Multi-dimensional Time Series.

Paper Link】 【Pages】:307-315

【Authors】: Mustafa S. Çetin ; Abdullah Mueen ; Vince D. Calhoun

【Abstract】: Time series shapelets are small subsequences that maximally differentiate classes of time series. Since the inception of shapelets, researchers have used shapelets for various data domains including anthropology and health care, and in the process suggested many efficient techniques for shapelet discovery. However, multi-dimensional time series data poses unique challenges to shapelet discovery that are yet to be solved. We show that an ensemble of shapelet-based decision trees on individual dimensions works better than shapelets defined over multiple dimensions. Generating a shapelet ensemble for multidimensional time series is computationally expensive. Most of the existing techniques prune shapelet candidates for speed. In this paper, we propose a novel technique for shapelet discovery that evaluates remaining candidates efficiently. Our algorithm uses a multi-length approximate index for time series data to efficiently find the nearest neighbors of the candidate shapelets. We employ a simple skipping technique for additional candidate pruning and a voting based technique to improve accuracy while retaining interpretability. Not only do we find a significant speed increase, our techniques enable us to efficiently discover shapelets on datasets with multi-dimensional and long time series such as hours of brain activity recordings. We demonstrate our approach on a biomedical dataset and find significant differences between patients with schizophrenia and healthy controls.

【Keywords】:

37. Low Rank Representation on Riemannian Manifold of Symmetric Positive Definite Matrices.

Paper Link】 【Pages】:316-324

【Authors】: Yifan Fu ; Junbin Gao ; Xia Hong ; David Tien

【Abstract】: Sparse coding aims to find a more compact representation based on a set of dictionary atoms. A well-known technique looking at 2D sparsity is the low rank representation (LRR). However, in many computer vision applications, data often originate from a manifold, which is equipped with some Riemannian geometry. In this case, the existing LRR becomes inappropriate for modeling and incorporating the intrinsic geometry of the manifold that is potentially important and critical to applications. In this paper, we generalize the LRR over the Euclidean space to the LRR model over a specific Rimannian manifold—the manifold of symmetric positive matrices (SPD). Experiments on several computer vision datasets showcase its noise robustness and superior performance on classification and segmentation compared with state-of-the-art approaches.

【Keywords】:

38. Getting to Know the Unknown Unknowns: Destructive-Noise Resistant Boolean Matrix Factorization.

Paper Link】 【Pages】:325-333

【Authors】: Sanjar Karaev ; Pauli Miettinen ; Jilles Vreeken

【Abstract】: Finding patterns in binary data is a classical problem in data mining, dating back to at least frequent itemset mining. More recently, approaches such as tiling and Boolean matrix factorization (BMF), have been proposed to find sets of patterns that aim to explain the full data well. These methods, however, are not robust against non-trivial destructive noise, i.e. when relatively many 1s are removed from the data: tiling can only model additive noise while BMF assumes approximately equal amounts of additive and destructive noise. Most real-world binary datasets, however, exhibit mostly destructive noise. In presence/absence data, for instance, it is much more common to fail to observe something than it is to observe a spurious presence. To address this problem, we take the recent approach of employing the Minimum Description Length (MDL) principle for BMF and introduce a new algorithm, Nassau, that directly optimizes the description length of the factorization instead of the reconstruction error. In addition, unlike the previous algorithms, it can adjust the factors it has discovered during its search. Empirical evaluation on synthetic data shows that Nassau excels at datasets with high destructive noise levels and its performance on real-world datasets confirms our hypothesis of the high numbers of missing observations in the real-world data.

【Keywords】:

39. Convex Matrix Completion: A Trace-Ball Optimization Perspective.

Paper Link】 【Pages】:334-342

【Authors】: Guangxiang Zeng ; Ping Luo ; Enhong Chen ; Hui Xiong ; Hengshu Zhu ; Qi Liu

【Abstract】: The problem of Matrix Completion (MC) refers to the process of adding entries for unknown or missing values in a matrix. In this paper, we study the convex matrix completion problem in the form of trace norm bounding. Specifically, we propose a robust solution for this problem based on trace-ball optimization, which can creatively change the original trace norm constraint into the problem of low-rank matrix factorization. Therefore, by searching in a ball space defined by the new trace constraint, the rank of new matrix can be self-determined such that the local minimum for matrix factorization is the global minimum for the original matrix completion task. Meanwhile, we define a free parameter γ to control the model complexity of our approach in terms of how well it fits the training data. Particularly, we identify a value of γb, which is the minimal value of the trace norm, in a way such that the model can exactly fit the known entries in the matrix. Furthermore, we also empirically reveal an important property of our approach: that is, a variable η* generated by γ is always stable with the increase of the amount of training data. This can help to speed up the tuning of optimal parameters for large matrices. Finally, extensive experiments on several real-world datasets clearly validate the effectiveness of the proposed approach.

【Keywords】:

40. Near-separable Non-negative Matrix Factorization with ℓ1 and Bregman Loss Functions.

Paper Link】 【Pages】:343-351

【Authors】: Abhishek Kumar ; Vikas Sindhwani

【Abstract】: Recently, a family of tractable NMF algorithms have been proposed under the assumption that the data matrix satisfies a separability condition (Donoho & Stodden, 2003; Arora et al., 2012). Geometrically, this condition reformulates the NMF problem as that of finding the extreme rays of the conical hull of a finite set of vectors. In this paper, we develop separable NMF algorithms with ℓ1 loss and Bregman divergences, by extending the conical hull procedures proposed in our earlier work (Kumar et al., 2013). Our methods inherit all the advantages of (Kumar et al., 2013) including scalability and noise-tolerance. We show that on foreground-background separation problems in computer vision, robust near-separable NMFs match the performance of Robust PCA, considered state of the art on these problems, with an order of magnitude faster training time. We also demonstrate applications in exemplar selection settings.

【Keywords】:

41. Personalized TV Recommendation with Mixture Probabilistic Matrix Factorization.

Paper Link】 【Pages】:352-360

【Authors】: Huayu Li ; Hengshu Zhu ; Yong Ge ; Yanjie Fu ; Yuan Ge

【Abstract】: With the rapid development of smart TV industry, a large number of TV programs have been available for meeting various user interests, which consequently raise a great demand of building personalized TV recommender systems. Indeed, a personalized TV recommender system can greatly help users to obtain their preferred programs and assist TV and channel providers to attract more audiences. While different methods have been proposed for TV recommendations, most of them neglect the mixture of watching groups behind an individual TV. In other words, there may be different groups of audiences at different times in front of a TV. For instance, watching groups of a TV may consist of children, wife and husband, husband, wife, etc in many US household. To this end, in this paper, we propose a Mixture Probabilistic Matrix Factorization (mPMF) model to learn the program preferences of televisions, which assumes that the preference of a given television can be regarded as the mixed preference of different watching groups. Specifically, the latent vector of a television is drawn from a mixture of Gaussian and the mixture number is the estimated number of watching groups behind the television. To evaluate the proposed mPMF model, we conduct extensive experiments with many state-of-the-art baseline methods and evaluation metrics on a real-world data set. The experimental results clearly demonstrate the effectiveness of our model.

【Keywords】:

42. Legislative Prediction with Dual Uncertainty Minimization from Heterogeneous Information.

Paper Link】 【Pages】:361-369

【Authors】: Yu Cheng ; Ankit Agrawal ; Huan Liu ; Alok N. Choudhary

【Abstract】: Voting on legislative bills to form new laws serves as a key function of most legislature. Predicting the votes of such deliberative bodies leads to better understanding of government policies and generates actionable strategies for social good. In this paper, we present a novel prediction model that maximizes the usage of publicly accessible heterogeneous data, i.e., bill text and lawmakers' profile data, to carry out effective legislative prediction. In particular, we propose to design a probabilistic prediction model which achieves high consistency with past vote records while ensuring the minimum uncertainty of the vote prediction reflecting the firm legal ground often held by the lawmakers. In addition, the proposed legislative prediction model enjoys the following properties: inductive and analytical solution, abilities to deal with the prediction on new bills and new legislators, and robustness to the missing vote issue. We conduct extensive empirical study using the real legislative data and compare with other representative methods in both quantitative political science and data mining communities. The experimental results clearly corroborate that the proposed method provides superior prediction accuracy with visible performance gain.

【Keywords】:

Paper Link】 【Pages】:370-378

【Authors】: Karthik Subbian ; Arindam Banerjee ; Sugato Basu

【Abstract】: Link prediction is an important problem in online social and collaboration networks, for recommending friends and future collaborators. Most of the existing approaches for link prediction are focused on building unsupervised or supervised classification models based on the availability of accepts and rejects of the past recommendations. Several of these methods are feature-based and they construct a large number of network-level features to make the prediction more effective. A more flexible approach is to allow the model to learn the required features from the network for a specific task, rather than explicit feature engineering. In addition, most of the social and collaboration relationships do not happen instantly and rather build slowly over time through several low cost interactions, such as email and chat. The existing approaches often ignore the availability of such auxiliary networks to make link prediction more robust and effective. The main focus of this work is to build a robust and effective classifier for link prediction using multiple auxiliary networks. We develop a supervised random walk model, that does not require any explicit feature construction, and can be personalized to each user based on the past accept and reject behavior. Our approach consistently outperforms several popular baselines in terms of precision and recall in multiple real-life data sets. Also, our approach is robust to noise and sparsity in auxiliary networks, while several popular baselines, specifically feature-based ones, are inconsistent in their performance under such conditions.

【Keywords】:

44. SourceSeer: Forecasting Rare Disease Outbreaks Using Multiple Data Sources.

Paper Link】 【Pages】:379-387

【Authors】: Theodoros Rekatsinas ; Saurav Ghosh ; Sumiko R. Mekaru ; Elaine O. Nsoesie ; John S. Brownstein ; Lise Getoor ; Naren Ramakrishnan

【Abstract】: Rapidly increasing volumes of news feeds from diverse data sources, such as online newspapers, Twitter and online blogs are proving to be extremely valuable resources in helping anticipate, detect, and forecast outbreaks of rare diseases. This paper presents SourceSeer, a novel algorithmic framework that combines spatio-temporal topic models with sourcebased anomaly detection techniques to effectively forecast the emergence and progression of infectious rare diseases. SourceSeer is capable of discovering the location focus of each source allowing sources to be used as experts with varying degrees of authoritativeness. To fuse the individual source predictions into a final outbreak prediction we employ a multiplicative weights algorithm taking into account the accuracy of each source. We evaluate the performance of SourceSeer using incidence data for hantavirus syndromes in multiple countries of Latin America provided by HealthMap over a timespan of fifteen months. We demonstrate that SourceSeer makes predictions of increased accuracy compared to several baselines and is capable of forecasting disease outbreaks in a timely manner even when no outbreaks were previously reported.

【Keywords】:

45. GIN: A Clustering Model for Capturing Dual Heterogeneity in Networked Data.

Paper Link】 【Pages】:388-396

【Authors】: Jialu Liu ; Chi Wang ; Jing Gao ; Quanquan Gu ; Charu C. Aggarwal ; Lance M. Kaplan ; Jiawei Han

【Abstract】: Networked data often consists of interconnected multi-typed nodes and links. A common assumption behind such heterogeneity is the shared clustering structure. However, existing network clustering approaches over-simplify the heterogeneity by either treating nodes or links in a homogeneous fashion, resulting in massive loss of information. In addition, these studies are more or less restricted to specific network schemas or applications, losing generality. In this paper, we introduce a flexible model to explain the process of forming heterogeneous links based on shared clustering information of heterogeneous nodes. Specifically, we categorize the link generation process into binary and weighted cases and model them respectively. We show these two cases can be seamlessly integrated into a unified model. We propose to maximize a joint log-likelihood function to infer the model efficiently with Expectation Maximization (EM) algorithms. Experiments on real-world networked data sets demonstrate the effectiveness and flexibility of the proposed method in fully capturing the dual heterogeneity of both nodes and links.

【Keywords】:

46. Believe It Today or Tomorrow? Detecting Untrustworthy Information from Dynamic Multi-Source Data.

Paper Link】 【Pages】:397-405

【Authors】: Houping Xiao ; Yaliang Li ; Jing Gao ; Fei Wang ; Liang Ge ; Wei Fan ; Long H. Vu ; Deepak S. Turaga

【Abstract】: A vast ocean of data is collected every day, and numerous applications call for the extraction of actionable insights from data. One important task is to detect untrustworthy information because such information usually indicates critical, unusual, or suspicious activities. In this paper, we study the important problem of detecting untrustworthy information from a novel perspective of correlating and comparing multiple sources that describe the same set of items. Different from existing work, we recognize the importance of time dimension in modeling the commonalities among multiple sources. We represent dynamic multi-source data as tensors and develop a joint non-negative tensor factorization approach to capture the common patterns across sources. We then conduct a comparison between source input and common patterns to identify inconsistencies as an indicator of untrustworthiness. An incremental factorization approach is developed to improve the computational efficiency on dynamically arriving data. We also propose a method to handle data sparseness. Experiments are conducted on hotel rating, network traffic flow, and weather forecast data that are collected from multiple sources. Results demonstrate the advantages of the proposed approach in detecting inconsistent and untrustworthy information.

【Keywords】:

47. Rare Class Detection in Networks.

Paper Link】 【Pages】:406-414

【Authors】: Karthik Subbian ; Charu C. Aggarwal ; Jaideep Srivastava ; Vipin Kumar

【Abstract】: The problem of node classification in networks is an important one in a wide variety of social networking domains. In many real applications such as product recommendations, the class of interest may be very rare. In such scenarios, it is often very difficult to learn the most relevant node classification characteristics, both because of the paucity of training data, and because of poor connectivity among rare class nodes in the network structure. Node classification methods crucially dependent upon structural homophily, and a lack of connectivity among rare class nodes can create significant challenges. However, many such social networks are content-rich, and the content-rich nature of such networks can be leveraged to compensate for the lack of structural connectivity among rare class nodes. While content-centric and semi-supervised methods have been used earlier in the context of paucity of labeled data, the rare class scenario has not been investigated in this context. In fact, we are not aware of any known classification method which is tailored towards rare class detection in networks. This paper will present a spectral approach for rare-class detection, which uses a distance-preserving transform, in order to combine the structural information in the network with the available content. We will show the advantage of this approach over traditional methods for collective classification.

【Keywords】:

48. Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics.

Paper Link】 【Pages】:415-423

【Authors】: Shashidhar Sundareisan ; Jilles Vreeken ; B. Aditya Prakash

【Abstract】: Given a noisy or sampled snapshot of an infection in a large graph, can we automatically and reliably recover the truly infected yet somehow missed nodes? And, what about the seeds, the nodes from which the infection started to spread? These are important questions in diverse contexts, ranging from epidemiology to social media. In this paper, we address the problem of simultaneously recovering the missing infections and the source nodes of the epidemic given noisy data. We formulate the problem by the Minimum Description Length principle, and propose NETFILL, an efficient algorithm that automatically and highly accurately identifies the number and identities of both missing nodes and the infection seed nodes. Experimental evaluation on synthetic and real datasets, including using data from information cascades over 96 million blog posts and news articles, shows that our method outperforms other baselines, scales near-linearly, and is highly effective in recovering missing nodes and sources.

【Keywords】:

49. Clustering and Ranking in Heterogeneous Information Networks via Gamma-Poisson Model.

Paper Link】 【Pages】:424-432

【Authors】: Junxiang Chen ; Wei Dai ; Yizhou Sun ; Jennifer Dy

【Abstract】: Clustering and ranking have been successfully applied independently to homogeneous information networks, containing only one type of objects. However, real-world information networks are oftentimes heterogeneous, containing multiple types of objects and links. Recent research has shown that clustering and ranking can actually mutually enhance each other, and several techniques have been developed to integrate clustering and ranking together on a heterogeneous information network. To the best our knowledge, however, all of such techniques assume the network follows a certain schema. In this paper, we propose a probabilistic generative model that simultaneously achieves clustering and ranking on a heterogeneous network that can follow arbitrary schema, where the edges from different types are sampled from a Poisson distribution with the parameters determined by the ranking scores of the nodes in each cluster. A variational Bayesian inference method is proposed to learn these parameters, which can be used to output ranking and clusters simultaneously. Our method is evaluated on both synthetic and real-world networks extracted from the DBLP and YELP data. Experimental results show that our method outperforms the state-of-the-art baselines.

【Keywords】:

50. A Divide-and-Conquer Algorithm for Betweenness Centrality.

Paper Link】 【Pages】:433-441

【Authors】: Dóra Erdös ; Vatche Ishakian ; Azer Bestavros ; Evimaria Terzi

【Abstract】: Given a set of target nodes S in a graph G we define the betweenness centrality of a node v with respect to S as the fraction of shortest paths among nodes in S that contain v. For this setting we describe Brandes++, a divide-and-conquer algorithm that can efficiently compute the exact values of betweenness scores. Brandes++ uses Brandes− the most widely-used algorithm for betweenness computation – as its subroutine. It achieves the notable faster running times by applying Brandes on significantly smaller networks than the input graph, and many of its computations can be done in parallel. The degree of speedup achieved by Brandes++ depends on the community structure of the input network as well as the size of S. Our experiments with real-life networks reveal Brandes++ achieves an average of 10-fold speedup over Brandes, while there are networks where this speedup is 75-fold. We have made our code public to benefit the research community.

【Keywords】:

51. Frameworks to Encode User Preferences for Inferring Topic-sensitive Information Networks.

Paper Link】 【Pages】:442-450

【Authors】: Qingbo Hu ; Sihong Xie ; Shuyang Lin ; Wei Fan ; Philip S. Yu

【Abstract】: The connection between online users is the key to the success of many important applications, such as viral marketing. In reality, we often easily observe the time when each user in the network receives a message, yet the users' connections that empower the message diffusion remain hidden. Therefore, given the traces of disseminated messages, recent research has extensively studied approaches to uncover the underlying diffusion network. Since topic related information could assist the network inference, previous methods incorporated either users' preferences over topics or the topic distributions of cascading messages. However, methods combining both of them may lead to more accurate results, because they consider a more comprehensive range of available information. In this paper, we investigate this possibility by exploring two principled methods: Weighted Topic Cascade (WTC) and Preference-enhanced Topic Cascade (PTC). WTC and PTC formulate the network inference task as non-smooth convex optimization problems and adopt coordinate proximal gradient descent to solve them. Based on synthetic and real datasets, substantial experiments demonstrate that although WTC is better than several previous approaches in most cases, it is less stable than PTC, which constantly outperforms other baselines with an improvement of 4%∼10% in terms of the F-measure of inferred networks.

【Keywords】:

Paper Link】 【Pages】:451-459

【Authors】: Shuangfei Zhai ; Zhongfei (Mark) Zhang

【Abstract】: Matrix factorization (MF) and Autoencoder (AE) are among the most successful approaches of unsupervised learning. While MF based models have been extensively exploited in the graph modeling and link prediction literature, the AE family has not gained much attention. In this paper we investigate both MF and AE's application to the link prediction problem in sparse graphs. We show the connection between AE andMF from the perspective of multiview learning, and further propose MF+AE: a model training MF and AE jointly with shared parameters. We apply dropout to training both the MF and AE parts, and show that it can significantly prevent overfitting by acting as an adaptive regularization. We conduct experiments on six real world sparse graph datasets, and show that MF+AE consistently outperforms the competing methods, especially on datasets that demonstrate strong non-cohesive structures.

【Keywords】:

53. An ADMM Algorithm for Clustering Partially Observed Networks.

Paper Link】 【Pages】:460-468

【Authors】: Necdet Serhat Aybat ; Sahar Zarmehri ; Soundar Kumara

【Abstract】: Community detection has attracted increasing attention during the past decade, and many algorithms have been proposed to find the underlying community structure in a given network. Many of these algorithms are based on modularity maximization, and these methods suffer from the resolution limit. In order to detect the underlying cluster structure, we propose a new convex formulation to decompose a partially observed adjacency matrix of a network into low-rank and sparse components. In such decomposition, the low-rank component encodes the cluster structure under certain assumptions. We also devise an alternating direction method of multipliers with increasing penalty sequence to solve this problem; and compare it with Louvain method, which maximizes the modularity, on some synthetic randomly generated networks. Numerical results show that our method outperforms Louvain method on the randomly generated networks when variance among cluster sizes increases. Moreover, empirical results also demonstrate that our formulation is indeed tighter than the robust PCA formulation, and is able to find the true clustering when the robust PCA formulation fails.

【Keywords】:

54. Scaling log-linear analysis to datasets with thousands of variables.

Paper Link】 【Pages】:469-477

【Authors】: François Petitjean ; Geoffrey I. Webb

【Abstract】: Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We have recently shown that, if we ensure that the graph supporting the log-linear model is chordal, log-linear analysis can be applied to datasets with hundreds of variables without sacrificing the statistical soundness [21]. However, further scalability remained limited, because state-of-the-art techniques have to examine every edge at every step of the search. This paper makes the following contributions: 1) we prove that only a very small subset of edges has to be considered at each step of the search; 2) we demonstrate how to efficiently find this subset of edges and 3) we show how to efficiently keep track of the best edges to be subsequently added to the initial model. Our experiments, carried out on real datasets with up to 2000 variables, show that our contributions make it possible to gain about 4 orders of magnitude, making log-linear analysis of datasets with thousands of variables possible in seconds instead of days.

【Keywords】:

55. A Distributed Frank-Wolfe Algorithm for Communication-Efficient Sparse Learning.

Paper Link】 【Pages】:478-486

【Authors】: Aurélien Bellet ; Yingyu Liang ; Alireza Bagheri Garakani ; Maria-Florina Balcan ; Fei Sha

【Abstract】: Learning sparse combinations is a frequent theme in machine learning. In this paper, we study its associated optimization problem in the distributed setting where the elements to be combined are not centrally located but spread over a network. We address the key challenges of balancing communication costs and optimization errors. To this end, we propose a distributed Frank-Wolfe (dFW) algorithm. We obtain theoretical guarantees on the optimization error ∊ and communication cost that do not depend on the total number of combining elements. We further show that the communication cost of dFW is optimal by deriving a lowerbound on the communication cost required to construct an ∊-approximate solution. We validate our theoretical analysis with empirical studies on synthetic and real-world data, which demonstrate that dFW outperforms both baselines and competing methods. We also study the performance of dFW when the conditions of our analysis are relaxed, and show that dFW is fairly robust.

【Keywords】:

56. Exceptional Model Mining with Tree-Constrained Gradient Ascent.

Paper Link】 【Pages】:487-495

【Authors】: Thomas E. Krak ; Ad Feelders

【Abstract】: Exceptional Model Mining (EMM) generalizes the well-known data mining task of Subgroup Discovery (SD). Given a model class of interest, the goal of EMM is to find subgroups of the data for which a model fitted to the subgroup deviates substantially from the global model. In both SD and EMM, heuristic search is often employed because exhaustive search of the subgroup description space is in general not feasible. We present a new heuristic search strategy for EMM called tree-constrained gradient ascent (TCGA). It is designed to exploit information about the influence of individual records on the quality of a subgroup, and guarantee at the same time that the subgroups can be described in the pattern language. Using the notion of soft subgroups, numerical optimization can be applied to find subgroups of high quality. We introduce a form of constrained gradient ascent that constructs the constraint simultaneously with the numerical optimization so as to guarantee that the subgroups can be described in the pattern language, while limiting the optimization as little as possible. We show how TCGA can be applied toEMMwith linear regression models. The proposed algorithm is evaluated on both synthetic and real world data sets. We compare the results to those obtained with beam search, a heuristic search algorithm commonly employed in SD and EMM.

【Keywords】:

57. FORMULA: FactORized MUlti-task LeArning for task discovery in personalized medical models.

Paper Link】 【Pages】:496-504

【Authors】: Jianpeng Xu ; Jiayu Zhou ; Pang-Ning Tan

【Abstract】: Medical predictive modeling is a challenging problem due to the heterogeneous nature of the patients. In order to build effective medical predictive models we need to address such heterogeneous nature during modeling and allow patients to have their own personalized models instead of using a one-size-fits-all model. However, building a personalized model for each patient is computationally expensive and the over-parametrization of the model makes it susceptible to the model overfitting problem. To address these challenges, we propose a novel approach called FactORized MUlti-task LeArning model (FORMULA), which learns the personalized model of each patient via a sparse multi-task learning method. The personalized models are assumed to share a low-rank representation, known as the base models. FORMULA is designed to simultaneously learn the base models as well as the personalized model of each patient, where the latter is a linear combination of the base models. We have performed extensive experiments to evaluate the proposed approach on a real medical data set. The proposed approach delivered superior predictive performance while the personalized models offered many useful medical insights.

【Keywords】:

58. Active Multi-task Learning via Bandits.

Paper Link】 【Pages】:505-513

【Authors】: Meng Fang ; Dacheng Tao

【Abstract】: In multi-task learning, the multiple related tasks allow each one to benefit from the learning of the others, and labeling instances for one task can also affect the other tasks especially when the task has a small number of labeled data. Thus labeling effective instances across different learning tasks is important for improving the generalization error of all tasks. In this paper, we propose a new active multi-task learning paradigm, which selectively samples effective instances for multi-task learning. Inspired by the multi-armed bandits, which can balance the trade-off between the exploitation and exploration, we introduce a new active learning strategy and cast the selection procedure as a bandit framework. We consider both the risk of multi-task learner and the corresponding confidence bounds and our selection tries to balance this trade-off. Our proposed method is a sequential algorithm, which at each round maintains a sampling distribution on the pool of data, queries the label for an instance according to this distribution and updates the distribution based on the newly trained multi-task learner. We provide an implementation of our algorithm based on a popular multi-task learning algorithm that is trace-norm regularization method. Theoretical guarantees are developed by exploiting the Rademacher complexities. Comprehensive experiments show the effectiveness and efficiency of the proposed approach.

【Keywords】:

59. Hierarchical Active Transfer Learning.

Paper Link】 【Pages】:514-522

【Authors】: David C. Kale ; Marjan Ghazvininejad ; Anil Ramakrishna ; Jingrui He ; Yan Liu

【Abstract】: We describe a unified active transfer learning framework called Hierarchical Active Transfer Learning (HATL). HATL exploits cluster structure shared between different data domains to perform transfer learning by imputing labels for unlabeled target data and to generate effective label queries during active learning. The resulting framework is flexible enough to perform not only adaptive transfer learning and accelerated active learning but also unsupervised and semi-supervised transfer learning. We derive an intuitive and useful upper bound on HATL's error when used to infer labels for unlabeled target points. We also present results on synthetic data that confirm both intuition and our analysis. Finally, we demonstrate HATL's empirical effectiveness on a benchmark data set for sentiment classification.

【Keywords】:

60. Learning Complex Rare Categories with Dual Heterogeneity.

Paper Link】 【Pages】:523-531

【Authors】: Pei Yang ; Jingrui He ; Jia-Yu Pan

【Abstract】: In the era of big data, it is often the case that the self-similar rare categories in a large data set are of great importance, such as the malicious insiders in big organizations, and the IC devices with defects in semiconductor manufacturing. Furthermore, such rare categories often exhibit multiple types of heterogeneity, such as the task heterogeneity, which originates from data collected in multiple domains, and the view heterogeneity, which originates from multiple information sources. Existing methods for learning rare categories mainly focus on the homogeneous settings, i.e., a single task and a single view. In this paper, for the first time, we study complex rare categories with both task and view heterogeneity, and propose a novel optimization framework named M2LID. It introduces a boundary characterization metric to capture the sharp changes in density near the boundary of the rare categories in the feature space, and constructs a graph-based model to leverage both task and view heterogeneity. Furthermore, M2LID integrates them in a way of mutual benefit. We also present an effective algorithm to solve this framework, analyze its performance from various aspects, and demonstrate its effectiveness on both synthetic and real datasets.

【Keywords】:

61. Faster Jobs in Distributed Data Processing using Multi-Task Learning.

Paper Link】 【Pages】:532-540

【Authors】: Neeraja J. Yadwadkar ; Bharath Hariharan ; Joseph Gonzalez ; Randy H. Katz

【Abstract】: Slow running or straggler tasks in distributed processing frameworks [1, 2] can be 6 to 8 times slower than the median task in a job on a production cluster [3], despite existing mitigation techniques. This leads to extended job completion times, inefficient use of resources, and increased costs. Recently, proactive straggler avoidance techniques [4] have explored the use of predictive models to improve task scheduling. However, to capture node and workload variability, separate models are built for every node and workload, requiring the time consuming collection of training data and limiting the applicability to new nodes and workloads. In this work, we observe that predictors for similar nodes or workloads are likely to be similar and can share information, suggesting a multi-task learning (MTL) based approach. We generalize the MTL formulation of [5] to capture commonalities in arbitrary groups. Using our formulation to predict stragglers allows us to reduce job completion times by up to 59% over Wrangler [4]. This large reduction arises from a 7 point increase in prediction accuracy. Further, we can get equal or better accuracy than [4] using a sixth of the training data, thus bringing the training time down from 4 hours to about 40 minutes. In addition, our formulation reduces the number of parameters by grouping our parameters into node- and workload-dependent factors. This helps us generalize to tasks with insufficient data and achieve significant gains over a naive MTL formulation [5].

【Keywords】:

62. Selecting Social Media Responses to News: A Convex Framework Based On Data Reconstruction.

Paper Link】 【Pages】:541-549

【Authors】: Zaiyi Chen ; Linli Xu ; Enhong Chen ; Biao Chang ; Zhefeng Wang ; Yitan Li

【Abstract】: With the explosive growth of social media, it has gained significantly increasing attention from both journalists and their readership in recent years by enhancing the reading experience with its timeliness, high participation, interactivity, etc. On the other hand, the popularity of social media services such as Twitter also leads to the challenge of information overload by generating thousands of responses (tweets) for each article of hot news, which will be overwhelming for readers. In this paper, we address the problem of selecting a representative subset of responses to news in order to deliver the most important information. We consider different criteria regarding the importance of the selected subset, and treat the problem from the data reconstruction perspective with concerns for both quality and generalizability of the selection. The intuition behind our work is that a good selection should be relevant from two levels: i) at the message level, it brings readers new information as much as possible or generalizes other people's opinions comprehensively; ii) at the text level, it is able to reconstruct the corpus. Specifically, the task of selecting responses to news can be formulated as a convex optimization problem where sparse non-negative weights are introduced for all the responses indicating whether they are selected or not. Several gradient based optimization and step size selection methods are also investigated in this paper to achieve a faster rate of convergence. More importantly, the proposed framework evaluates the utility of a set of responses jointly and therefore is able to reduce redundancy of the selected responses. We evaluate our approach on real-world data obtained from Twitter, and the results demonstrate superior performance over the state of the art in both accuracy and generalizability.

【Keywords】:

63. Tracking Events Using Time-dependent Hierarchical Dirichlet Tree Model.

Paper Link】 【Pages】:550-558

【Authors】: Rumeng Li ; Tao Wang ; Xun Wang

【Abstract】: Timeline Generation, through generating news timelines from the massive data of news corpus, aims at providing readers with summaries about the evolvement of an event. It is a new challenge of summarization that combines salience ranking with novelty detection. For a long-term public event, the main topic usually includes many different sub-topics at varying epochs, which also has its own evolving patterns. Existing approaches fail to utilize such hierarchical topic structure involved in the news corpus for timeline generation. In this paper, we develop a novel time-dependent Hierarchical Dirichlet Tree Model (tHDT) for timeline generation. Our model can aptly detect different levels of topic information in corpus and the structure is further used for sentence selection. Based on the topic distribution mined from tHDT, sentences are selected through an overall consideration of relevance, coherence and coverage. We develop experimental systems to compare different rival algorithms on 8 long-term events of public concern. The performance comparison demonstrates the effectiveness of our proposed model in terms of ROUGE metrics.

【Keywords】:

64. Fast Eigen-Functions Tracking on Dynamic Graphs.

Paper Link】 【Pages】:559-567

【Authors】: Chen Chen ; Hanghang Tong

【Abstract】: Many important graph parameters can be expressed as eigen-functions of its adjacency matrix. Examples include epidemic threshold, graph robustness, etc. It is often of key importance to accurately monitor these parameters. For example, knowing that Ebola virus has already been brought to the US continent, to avoid the virus from spreading away, it is important to know which emerging connections among related people would cause great reduction on the epidemic threshold of the network. However, most, if not all, of the existing algorithms computing these measures assume that the input graph is static, despite the fact that almost all real graphs are evolving over time. In this paper, we propose two online algorithms to track the eigen-functions of a dynamic graph with linear complexity wrt the number of nodes and number of changed edges in the graph. The key idea is to leverage matrix perturbation theory to efficiently update the top eigen-pairs of the underlying graph without recomputing them from scratch at each time stamp. Experiment results demonstrate that our methods can reach up to 20× speedup with precision more than 80% for fairly long period of time.

【Keywords】:

65. Approximation Algorithms for Reducing the Spectral Radius to Control Epidemic Spread.

Paper Link】 【Pages】:568-576

【Authors】: Sudip Saha ; Abhijin Adiga ; B. Aditya Prakash ; Anil Kumar S. Vullikanti

【Abstract】: The largest eigenvalue of the adjacency matrix of a network (referred to as the spectral radius) is an important metric in its own right. Further, for several models of epidemic spread on networks (e.g., the ‘flu-like’ SIS model), it has been shown that an epidemic dies out quickly if the spectral radius of the graph is below a certain threshold that depends on the model parameters. This motivates a strategy to control epidemic spread by reducing the spectral radius of the underlying network. In this paper, we develop a suite of provable approximation algorithms for reducing the spectral radius by removing the minimum cost set of edges (modeling quarantining) or nodes (modeling vaccinations), with different time and quality tradeoffs. Our main algorithm, GREEDYWALK, is based on the idea of hitting closed walks of a given length, and gives an O(log2 n)-approximation, where n denotes the number of nodes; it also performs much better in practice compared to all prior heuristics proposed for this problem. We further present a novel sparsification method to improve its running time. In addition, we give a new primal-dual based algorithm with an even better approximation guarantee (O(log n)), albeit with slower running time. We also give lower bounds on the worst-case performance of some of the popular heuristics. Finally we demonstrate the applicability of our algorithms and the properties of our solutions via extensive experiments on multiple synthetic and real networks.

【Keywords】:

66. Propagation-based Sentiment Analysis for Microblogging Data.

Paper Link】 【Pages】:577-585

【Authors】: Jiliang Tang ; Chikashi Nobata ; Anlei Dong ; Yi Chang ; Huan Liu

【Abstract】: The explosive popularity of microblogging services encourages more and more online users to share their opinions, and sentiment analysis on such opinion-rich resources has been proven to be an effective way to understand public opinions. On the one hand, the brevity and informality of microblogging data plus its wide variety and rapid evolution of language in microblogging pose new challenges to the vast majority of existing methods. On the other hand, microblogging texts contain various types of emotional signals strongly associated with their sentiment polarity, which brings about new opportunities for sentiment analysis. In this paper, we investigate propagation-based sentiment analysis for microblogging data. In particular, we provide a propagating process to incorporate various types of emotional signals in microblogging data into a coherent model, and propose a novel sentiment analysis framework PSA which learns from both labeled and unlabeled data by iteratively alternating a propagating process and a fitting process. We conduct experiments on real-world microblogging datasets, and the results demonstrate the effectiveness of the proposed framework. Further experiments are conducted to probe the working of the key components of the proposed framework.

【Keywords】:

67. POLYGLOT-NER: Massive Multilingual Named Entity Recognition.

Paper Link】 【Pages】:586-594

【Authors】: Rami Al-Rfou' ; Vivek Kulkarni ; Bryan Perozzi ; Steven Skiena

【Abstract】: The increasing diversity of languages used on the web introduces a new level of complexity to Information Retrieval (IR) systems. We can no longer assume that textual content is written in one language or even the same language family. In this paper, we demonstrate how to build massive multilingual annotators with minimal human expertise and intervention. We describe a system that builds Named Entity Recognition (NER) annotators for 40 major languages using Wikipedia and Freebase. Our approach does not require NER human annotated datasets or language specific resources like treebanks, parallel corpora, and orthographic rules. The novelty of approach lies therein - using only language agnostic techniques, while achieving competitive performance. Our method learns distributed word representations (word embeddings) which encode semantic and syntactic features of words in each language. Then, we automatically generate datasets from Wikipedia link structure and Freebase attributes. Finally, we apply two preprocessing stages (oversampling and exact surface form matching) which do not require any linguistic expertise. Our evaluation is two fold: First, we demonstrate the system performance on human annotated datasets. Second, for languages where no gold-standard benchmarks are available, we propose a new method, distant evaluation, based on statistical machine translation.

【Keywords】:

68. Online Resource Allocation with Structured Diversification.

Paper Link】 【Pages】:595-603

【Authors】: Nicholas Johnson ; Arindam Banerjee

【Abstract】: A variety of modern data analysis problems, ranging from finance to job scheduling, can be considered as online resource allocation (ORA) problems. A key consideration in such ORA problems is some notion of risk, and suitable ways to alleviate risk. In several settings, the risk is structured so that groups of assets, such as stocks, are exposed to similar risks. In this paper, we present a formulation for online resource allocation with structured diversification (ORASD) as a constrained online convex optimization problem. The key novel component of our formulation is a constraint on the L(∞,1) group norm of the resource allocation vector, which ensures that no single group gets a large share of the resource and, unlike L(1,p) norms used for overlapping group Lasso, does not impose sparsity structures over groups. We instantiate the problem in the context of portfolio selection, propose an efficient ADMM algorithm, and illustrate the effectiveness of the formulation through extensive experiments on two benchmark datasets.

【Keywords】:

69. Towards Permission Request Prediction on Mobile Apps via Structure Feature Learning.

Paper Link】 【Pages】:604-612

【Authors】: Deguang Kong ; Hongxia Jin

【Abstract】: The popularity of mobile apps has posed severe privacy risks to users because many permissions are over-claimed. In this work, we explore the techniques that can automatically predict the permission requests of a new mobile app based on its functionality and textual description information, which can help users to be aware of the privacy risks of mobile apps. Our framework formalizes the permission prediction problem as a multi-label learning problem, where a regularized structure feature learning framework is utilized to automatically capture the relations among textual descriptions, permissions, and app category. The permission prediction result can be automatically learned using our approach. We evaluate our approach on 173 permission requests from 11,067 mobile apps across 30 categories. Extensive experiment results indicate that our method consistently provides better performance (3%-5% performance improvement in terms of F1 score), when compared to the other state-of-the-art methods.

【Keywords】:

70. On Influential Nodes Tracking in Dynamic Social Networks.

Paper Link】 【Pages】:613-621

【Authors】: Xiaodong Chen ; Guojie Song ; Xinran He ; Kunqing Xie

【Abstract】: Real world marketing campaign utilizing the word-of-mouth effect usually lasts a long time, where multiple sets of influential users need to be mined and targeted at different times to fully utilize the power of viral marketing. As both social network structure and strength of influence between individuals evolve constantly, it requires to track the influential nodes under a dynamic setting. To address the above problem, we explore the Influential Node Tracking (INT) problem as an extension to the traditional Influence Maximization problem under dynamic social networks. While Influence Maximization problem aims at identifying a set of k nodes to maximize the joint influence under one static network, INT problem focuses on tracking a set of influential nodes that keeps maximizing the influence as the network evolves. Utilizing the smoothness of the evolution of the network structure, we propose an efficient algorithm, Upper Bound Interchange Greedy (UBI) to solve the INT problem. Instead of constructing the seed set from the ground, we start from the influential seed set we find previously and implement node replacement to improve the influence coverage. Furthermore, by using a fast update method to maintain an upper bound on the node replacing gain, our algorithm can scale to dynamic social networks with millions of nodes. Empirical experiments on three real large-scale dynamic social networks show that our UBI algorithm achieves better performance in terms of both influence coverage and running time.

【Keywords】:

71. Less is More: Building Selective Anomaly Ensembles with Application to Event Detection in Temporal Graphs.

Paper Link】 【Pages】:622-630

【Authors】: Shebuti Rayana ; Leman Akoglu

【Abstract】: Ensemble techniques for classification and clustering have long proven effective, yet anomaly ensembles have been barely studied. In this work, we tap into this gap and propose a new ensemble approach for anomaly mining, with application to event detection in temporal graphs. Our method aims to combine results from heterogeneous detectors with varying outputs, and leverage the evidence from multiple sources to yield better performance. However, trusting all the results may deteriorate the overall ensemble accuracy, as some detectors may fall short and provide inaccurate results depending on the nature of the data in hand. This suggests that being selective in which results to combine is vital in building effective ensembles—hence “less is more”. In this paper we propose SELECT; an ensemble approach for anomaly mining that employs novel techniques to automatically and systematically select the results to assemble in a fully unsupervised fashion. We apply our method to event detection in temporal graphs, where SELECT successfully utilizes five base detectors and seven consensus methods under a unified ensemble framework. We provide extensive quantitative evaluation of our approach on five realworld datasets (four with ground truth), including Enron email communications, New York Times news corpus, and World Cup 2014 Twitter news feed. Thanks to its selection mechanism, SELECT yields superior performance compared to individual detectors alone, the full ensemble (naively combining all results), and an existing diversity-based ensemble.

【Keywords】:

72. Principled Neuro-Functional Connectivity Discovery.

Paper Link】 【Pages】:631-639

【Authors】: Kejun Huang ; Nicholas D. Sidiropoulos ; Evangelos E. Papalexakis ; Christos Faloutsos ; Partha Pratim Talukdar ; Tom M. Mitchell

【Abstract】: How can we reverse-engineer the brain connectivity, given the input stimulus, and the corresponding brain-activity measurements, for several experiments? We show how to solve the problem in a principled way, modeling the brain as a linear dynamical system (LDS), and solving the resulting “system identification” problem after imposing sparsity and non-negativity constraints on the appropriate matrices. These are reasonable assumptions in some applications, including magnetoencephalography (MEG). There are three contributions: (a) Proof: We prove that this simple condition resolves the ambiguity of similarity transformation in the LDS identification problem; (b) Algorithm: we propose an effective algorithm which further induces sparse connectivity in a principled way; and (c) Validation: our experiments on semi-synthetic (C. elegans), as well as real MEG data, show that our method recovers the neural connectivity, and it leads to interpretable results.

【Keywords】:

73. Estimating Ad Impact on Clicker Conversions for Causal Attribution: A Potential Outcomes Approach.

Paper Link】 【Pages】:640-648

【Authors】: Joel Barajas ; Ram Akella ; Aaron Flores ; Marius Holtan

【Abstract】: We analyze the causal effect of online ads on the conversion probability of the users who click on the ad (clickers). We show that designing a randomized experiment to find this effect is infeasible, and propose a method to find the local effect on the clicker conversions. This method is developed in the Potential Outcomes causal model, via Principal Stratification to model non-ignorable post-treatment (or endogenous) variables such as user clicks, and is validated with simulated data. Based on two large-scale randomized experiments, performed for 7.16 million users and 22.7 million users to evaluate ad exposures, a pessimistic analysis for this effect shows a minimum increase of the campaigns effect on the clicker conversion probability of 75% with respect to the non-clickers. This finding contradicts a recent belief that clicks are not indicative of campaign success, and provides guidance in the user targeting task. In addition, we find a larger number of converting users attributed to the overall campaign than those attributed based on the click-to-conversion (C2C) standard business model. This evidence challenges the well-accepted belief that C2C attribution model over-estimates the value of the campaign.

【Keywords】:

74. Towards Classification of Social Streams.

Paper Link】 【Pages】:649-657

【Authors】: Min-Hsuan Tsai ; Charu C. Aggarwal ; Thomas S. Huang

【Abstract】: Social streams have become very popular in recent years because of the increasing popularity of social media sites such as Twitter, and Facebook. Such social media sites create huge streams of data, which can be leveraged for a wide variety of applications. In this paper, we will focus on the classification problem for social streams. Unfortunately, such streams are extremely noisy, and contain large volumes of information, with information about network linkages between the participants exchanging messages. This is additional social information, associated with the text stream, which can be very helpful for classification. We combine an LSH method with an incremental SVM model in order to design an effective and efficient social context-sensitive streaming classifier for this scenario. The LSH model is used for learning the social context, and the SVM model is used for more effective classification within this context. We will present experimental results, which show the effectiveness of our techniques over a wide variety of other methods.

【Keywords】:

75. Mobile App Security Risk Assessment: A Crowdsourcing Ranking Approach from User Comments.

Paper Link】 【Pages】:658-666

【Authors】: Lei Cen ; Deguang Kong ; Hongxia Jin ; Luo Si

【Abstract】: Along with the exponential growth on markets of mobile Applications (apps), comes the serious public concern about the security and privacy issues. Therefore automatic app risk assessment becomes increasingly important to support users with useful evidences for their decisions. User comment provides a unique perspective from actual user experience, and should be considered valuable information source for risk assessment for mobile apps. In this paper, we provide a novel perspective to view the risk assessment of an app from its user comments as a crowdsourcing problem and adopt ranking model as the evaluation method. We develop a co-training scheme to amalgamate feature learning and learning to rank models. Experiments conducted on two different real-world datasets show substantial performance improvements (i.e., 6%–7%) over the state-of- the-art methods.

【Keywords】:

76. Learning Compressive Sensing Models for Big Spatio-Temporal Data.

Paper Link】 【Pages】:667-675

【Authors】: Dongeun Lee ; Jaesik Choi

【Abstract】: Sensing devices including mobile phones and biomedical sensors generate massive amounts of spatio-temporal data. Compressive sensing (CS) can significantly reduce energy and resource consumption by shifting the complexity burden of encoding process to the decoder. CS reconstructs the compressed signals exactly with overwhelming probability when incoming data can be sparsely represented with a fixed number of components, which is one of drawbacks of CS frameworks because a real-world signal in general cannot be represented with the fixed number of components. We present the first CS framework that handles signals without the fixed sparsity assumption by incorporating the distribution of the number of principal components included in the signal recovery, which we show is naturally represented by the gamma distribution. This allows an analytic derivation of total error in our spatio-temporal Low Complexity Sampling (LCS). We show that LCS requires shorter compressed signals than existing CS frameworks to bound the same amount of error. Experiments with real-world sensor data also demonstrate that LCS outperforms existing CS frameworks.

【Keywords】:

77. Learning Stroke Treatment Progression Models for an MDP Clinical Decision Support System.

Paper Link】 【Pages】:676-684

【Authors】: Dan C. Coroian ; Kris Hauser

【Abstract】: This paper describes a clinical decision support framework in multi-step health care domains that can dynamically recommend optimal treatment plans with respect to both patient outcomes and expected treatment cost. Our system uses a modified POMDP framework in which hidden states are not explicitly modeled, but rather, probabilistic models for predicting future observables given observation and action histories are learned directly from electronic health record (EHR) data. High quality treatment recommendations are found using a sampling-based tree growing approach which produces good results despite only exploring a fraction of the observation and action spaces. We describe the application of the approach to an ischemic stroke domain with clinical trial data (International Stroke Trial Dataset, 1993–1996). The dataset is of moderate size (N=19,435) and exhibits many characteristics of real EHR data, including noise, missing values, and idiosyncratic coding. The system's predictive model was chosen using cross-validated model selection from a set of several candidate learning methods, including logistic regression, Naïve Bayes, Bayes nets, and random forests. Simulations suggest that the optimized decisions improve patient outcomes, such as 6-month survival rate, compared to the decisions of human doctors during the study.

【Keywords】:

78. OnlineCM: Real-time Consensus Classification with Missing Values.

Paper Link】 【Pages】:685-693

【Authors】: Bowen Dong ; Sihong Xie ; Jing Gao ; Wei Fan ; Philip S. Yu

【Abstract】: Combining predictions from multiple sources or models has been shown to be a useful technique in data mining. For example, in network anomaly detection, multiple detectors' output have to be combined to obtain the diagnostic decisions. Unfortunately, as data are generated at an increasingly high speed, existing prediction aggregation methods are facing new challenges. First, the high velocity and hugh volume of the data render existing batch mode prediction aggregation algorithms infeasible. Second, due to the heterogeneity, predictions from multiple models or data sources might not be perfectly synchronized, leading to abundant missing values in the prediction stream. We propose OnlineCM, short for Online Consensus Maximization, to address the above challenges. OnlineCM keeps only a minimal yet sufficient footprint for both consensus prediction and missing value imputation over the prediction stream. In particular, we show that the correlations among base models or data sources are sufficient for effective consensus prediction, require small storage and can be updated in an online fashion. Further, we identify a reinforcing relationship between missing value imputation and the consensus predictions, leading to a novel consensus-based missing values imputation method, which in turn makes model correlation estimation more accurate. Experiments demonstrates that OnlineCM achieves aggregated predictions that has close performance to the batch mode consensus maximization algorithm, and outperforms baseline methods significantly in 4 large real world datasets.

【Keywords】:

79. MET: A Fast Algorithm for Minimizing Propagation in Large Graphs with Small Eigen-Gaps.

Paper Link】 【Pages】:694-702

【Authors】: Long T. Le ; Tina Eliassi-Rad ; Hanghang Tong

【Abstract】: Given the topology of a graph G and a budget k, how can we quickly find the best k edges to delete that minimize dissemination in G? Stopping dissemination in a graph is important in a variety of fields from epidemiology to cyber security. The spread of an entity (e.g., a virus) on an arbitrary graph depends on two properties: (1) the topology of the graph and (2) the characteristics of the entity. In many settings, we cannot manipulate the latter, such as the entity's strength. That leaves us with modifying the former (e.g., by removing nodes and/or edges from the graph in order to reduce the graph's connectivity). In this work, we address the problem of removing edges. We know that the largest eigenvalue of the graph's adjacency matrix is a good indicator for its connectivity (a.k.a. path capacity). Thus, algorithms that are able to quickly reduce the largest eigenvalue of a graph often minimize dissemination on that graph. However, a problem arises when the differences between the largest eigenvalues of a graph are small. This problem, known as the small eigen-gap problem, occurs often in social graphs such as Facebook postings or instant messaging (IM) networks. We introduce a scalable algorithm called MET (short for Multiple Eigenvalues Tracking), which efficiently and effectively solves the small eigen-gap problem. Our extensive experiments on different graphs from various domains show the efficacy and efficiency of our approach.

【Keywords】:

80. What shall I share and with Whom? - A Multi-Task Learning Formulation using Multi-Faceted Task Relationships.

Paper Link】 【Pages】:703-711

【Authors】: Sunil Kumar Gupta ; Santu Rana ; Dinh Q. Phung ; Svetha Venkatesh

【Abstract】: Multi-task learning is a learning paradigm that improves the performance of “related” tasks through their joint learning. To do this each task answers the question “Which other task should I share with”? This task relatedness can be complex - a task may be related to one set of tasks based on one subset of features and to other tasks based on other subsets. Existing multi-task learning methods do not explicitly model this reality, learning a single-faceted task relationship over all the features. This degrades performance by forcing a task to become similar to other tasks even on their unrelated features. Addressing this gap, we propose a novel multi-task learning model that learns multi-faceted task relationship, allowing tasks to collaborate differentially on different feature subsets. This is achieved by simultaneously learning a low dimensional subspace for task parameters and inducing task groups over each latent subspace basis using a novel combination of L1 and pairwise L∞ norms. Further, our model can induce grouping across both positively and negatively related tasks, which helps towards exploiting knowledge from all types of related tasks. We validate our model on two synthetic and five real datasets, and show significant performance improvements over several state-of-the-art multi-task learning techniques. Thus our model effectively answers for each task: What shall I share and with whom?

【Keywords】:

81. A Generalized Mixture Framework for Multi-label Classification.

Paper Link】 【Pages】:712-720

【Authors】: Charmgil Hong ; Iyad Batal ; Milos Hauskrecht

【Abstract】: We develop a novel probabilistic ensemble framework for multi-label classification that is based on the mixtures-of-experts architecture. In this framework, we combine multi-label classification models in the classifier chains family that decompose the class posterior distribution P(Y1, …, Yd|X) using a product of posterior distributions over components of the output space. Our approach captures different input–output and output–output relations that tend to change across data. As a result, we can recover a rich set of dependency relations among inputs and outputs that a single multi-label classification model cannot capture due to its modeling simplifications. We develop and present algorithms for learning the mixtures-of-experts models from data and for performing multi-label predictions on unseen data instances. Experiments on multiple benchmark datasets demonstrate that our approach achieves highly competitive results and outperforms the existing state-of-the-art multi-label classification methods.

【Keywords】:

82. Domain-Knowledge Driven Cognitive Degradation Modeling for Alzheimer's Disease.

Paper Link】 【Pages】:721-729

【Authors】: Ying Lin ; Kaibo Liu ; Eunshin Byon ; Xiaoning Qian ; Shuai Huang

【Abstract】: Cognitive monitoring and screening holds great promises for early detection and intervention of Alzheimer's Disease (AD). A critical enabler is the personalized degradation model to predict the cognitive status over time. However, estimating such a model individual's data faces challenges due to the sparsity and fragmented nature of the cognitive data of each individual. To mitigate this problem, we propose novel methods, called the collaborative degradation model (CDM) together with its extended network regularized version, the NCDM, which can incorporate useful domain knowledge into the degradation modeling. While NCDM results in a difficult optimization problem, we are inspired by existing non-negative matrix factorization methods and develop an efficient algorithm to solve this problem and further provide theoretical results that ensure that the proposed algorithm can guarantee non-increasing property. Both simulation studies and the real-world application to AD are conducted across different degradation models and sampling schemes, which demonstrate the superiority of the proposed methods over existing methods.

【Keywords】:

83. Ensemble Learning Methods for Binary Classification with Multi-modality within the Classes.

Paper Link】 【Pages】:730-738

【Authors】: Anuj Karpatne ; Ankush Khandelwal ; Vipin Kumar

【Abstract】: We consider binary classification problems where each of the two classes show multi-modal distribution in the feature space. Inspired by existing ensemble learning methods for multi-class classification, we develop ensemble learning methods for binary classification that make use of the bipartite nature of the positive and negative modes in the data. By constructing ensembles that make use of the multi-modal structure within the two classes, as opposed to using random samples, we are able to ensure sufficient diversity among the classifiers and adequate representation of the modes in the learning of the classifiers. We demonstrate the effectiveness of the proposed ensemble learning methods in comparison with existing approaches over a synthetic dataset and a real-world application involving global lake monitoring, over a broad range of base classifiers.

【Keywords】:

84. A Framework for Simplifying Trip Data into Networks via Coupled Matrix Factorization.

Paper Link】 【Pages】:739-747

【Authors】: Chia-Tung Kuo ; James Bailey ; Ian Davidson

【Abstract】: Portable devices such as GPS-equipped smart phones and cameras are able to provide detailed spatiotemporal trip event data for each user. Such data can be aggregated over many users to provide large amounts of behavioral data of very fine granularity. Trying to simplify this data into meaningful higher-level insights is challenging for a variety of reasons. In this paper we study the problem of simplifying spatio-temporal trip data and summarizing them into an easily interpretable graph/network. We propose several constrained coupled nonnegative matrix factorization formulations that simultaneously cluster locations and times based on the associated trips, and develop a (block) coordinate descent algorithm to solve them. We empirically evaluate our approach on a real world data set of taxis' GPS traces and show the advantages of our approach over traditional clustering algorithms.

【Keywords】:

85. Multi-View Low-Rank Analysis for Outlier Detection.

Paper Link】 【Pages】:748-756

【Authors】: Sheng Li ; Ming Shao ; Yun Fu

【Abstract】: Outlier detection is a fundamental problem in data mining. Unlike most existing methods that are designed for single-view data, we propose a multi-view outlier detection approach in this paper. Multi-view data can provide plentiful information of samples, however, detecting outliers from multi-view data is still a challenging problem due to the complicated distribution and inconsistent behavior of samples across different views. We address this problem through robust data representation, by building a Multi-view Low-Rank Analysis (MLRA) framework. Our framework contains two major components. First, it performs cross-view low-rank analysis for revealing the intrinsic structures of data. Second, it identifies outliers by estimating the outlier score for each test sample. Specifically, we formulate the cross-view low-rank analysis as a constrained rank-minimization problem, and present an efficient optimization algorithm to solve it. Different from the existing multi-view outlier detection methods, our framework is able to detect two different types of outliers from multiple views simultaneously. To this end, we design a criterion to estimate the outlier scores by analyzing the obtained representation coefficients. Experimental results on seven UCI datasets and the USPS-MNIST dataset demonstrate that our approach outperforms several state-of-the-art single-view and multi-view outlier detection methods in most cases.

【Keywords】:

86. REAFUM: Representative Approximate Frequent Subgraph Mining.

Paper Link】 【Pages】:757-765

【Authors】: Ruirui Li ; Wei Wang

【Abstract】: Noisy graph data and pattern variations are two thorny problems faced by mining frequent subgraphs. Traditional exact-matching based methods, however, only generate patterns that have enough perfect matches in the graph database. As a result, a pattern may either remain undetected or be reported as multiple (almost identical) patterns if it manifests slightly different instances in different graphs. In this paper, we investigate the problem of approximate frequent pattern mining, with a focus on finding non-redundant representative frequent patterns that summarize the frequent patterns allowing approximate matches in a graph database. To achieve this goal, we propose the REAFUM framework which (1) first extracts a list of diverse representative graphs from the database, which may contain most approximate frequent patterns exhibited in the entire graph database; (2) then uses distinct patterns in the representative graphs as seed patterns to retrieve approximate matches in the entire graph database; (3) finally employs a consensus refinement model to derive representative approximate frequent patterns. Through a comprehensive evaluation of REAFUM on both synthetic and real datasets, we show that REAFUM is effective and efficient to find representative approximate frequent patterns and REAFUM is able to find patterns that much better resemble the ground truth in the presence of noise and errors, and are less redundant than that from any exact-matching based methods.

【Keywords】:

87. DIAS: A Disassemble-Assemble Framework for Highly Sparse Text Clustering.

Paper Link】 【Pages】:766-774

【Authors】: Hongfu Liu ; Junjie Wu ; Dacheng Tao ; Yuchao Zhang ; Yun Fu

【Abstract】: Upon extensive studies, text clustering remains a critical challenge in data mining community. Even by various techniques proposed to overcome some of these challenges, there still exist problems when dealing with weakly related or even noisy features. In response to this, we propose a DIssemble-ASsemble (DIAS) framework for text clustering. DIAS employs simple random feature sampling to disassemble high-dimensional text data and gains diverse structural knowledge. This also does good to avoiding the bulk of noisy features. Then the multi-view knowledge is assembled by weighted Information-theoretic Consensus Clustering (ICC) in order to gain a high-quality consensus partitioning. Extensive experiments on eight real-world text data sets demonstrate the advantages of DIAS over other widely used methods. In particular, DIAS shows strengths in learning from very weak basic partitionings. In addition, it is the natural suitability to distributed computing that makes DIAS become a promising candidate for big text clustering.

【Keywords】:

88. Optimal event sequence sanitization.

Paper Link】 【Pages】:775-783

【Authors】: Grigorios Loukides ; Robert Gwadera

【Abstract】: Frequent event mining is a fundamental task to extract insight from an event sequence (long sequence of events that are associated with time points). However, it may expose sensitive events that leak confidential business knowledge or lead to intrusive inferences about groups of individuals. In this work, we aim to prevent this threat, by deleting occurrences of sensitive events, while preserving the utility of the event sequence. To quantify utility, we propose a model that captures changes, caused by deletion, to the probability distribution of events across the sequence. Based on the model, we define the problem of sanitizing an event sequence as an optimization problem. Solving the problem is important to preserve the output of many mining tasks, including frequent pattern mining and sequence segmentation. However, this is also challenging, due to the exponential number of ways to apply deletion to the sequence. To optimally solve the problem when there is one sensitive event, we develop an efficient algorithm based on dynamic programming. The algorithm also forms the basis of a simple, iterative method that optimally sanitizes an event sequence, when there are multiple sensitive events. Experiments on real and synthetic datasets show the effectiveness and efficiency of our method.

【Keywords】:

89. Predicting Neighbor Distribution in Heterogeneous Information Networks.

Paper Link】 【Pages】:784-791

【Authors】: Yuchi Ma ; Ning Yang ; Chuan Li ; Lei Zhang ; Philip S. Yu

【Abstract】: Recently, considerable attention has been devoted to the prediction problems arising from heterogeneous information networks. In this paper, we present a new prediction task, Neighbor Distribution Prediction (NDP), which aims at predicting the distribution of the labels on neighbors of a given node and is valuable for many different applications in heterogeneous information networks. The challenges of NDP mainly come from three aspects: the infinity of the state space of a neighbor distribution, the sparsity of available data, and how to fairly evaluate the predictions. To address these challenges, we first propose an Evolution Factor Model (EFM) for NDP, which utilizes two new structures proposed in this paper, i.e. Neighbor Distribution Vector (NDV) to represent the state of a given node's neighbors, and Neighbor Label Evolution Matrix (NLEM) to capture the dynamics of a neighbor distribution, respectively. We further propose a learning algorithm for Evolution Factor Model. To overcome the problem of data sparsity, the learning algorithm first clusters all the nodes and learns an NLEM for each cluster instead of for each node. For fairly evaluating the predicting results, we propose a new metric: Virtual Accuracy (VA), which takes into consideration both the absolute accuracy and the predictability of a node. Extensive experiments conducted on three real datasets from different domains validate the effectiveness of our proposed model EFM and metric VA.

【Keywords】:

90. SimplePPT: A Simple Principal Tree Algorithm.

Paper Link】 【Pages】:792-800

【Authors】: Qi Mao ; Le Yang ; Li Wang ; Steve Goodison ; Yijun Sun

【Abstract】: Many scientific datasets are of high dimension, and the analysis usually requires visual manipulation by retaining the most important structures of data. Principal curve is a widely used approach for this purpose. However, many existing methods work only for data with structures that are not self-intersected, which is quite restrictive for real applications. To address this issue, we develop a new model, which captures the local information of the underlying graph structure based on reversed graph embedding. A generalization bound is derived that show that the model is consistent if the number of data points is sufficiently large. As a special case, a principal tree model is proposed and a new algorithm is developed that learns a tree structure automatically from data. The new algorithm is simple and parameter-free with guaranteed convergence. Experimental results on synthetic and breast cancer datasets show that the proposed method compares favorably with baselines and can discover a breast cancer progression path with multiple branches.

【Keywords】:

91. Temporally Coherent CRP: A Bayesian Non-Parametric Approach for Clustering Tracklets with applications to Person Discovery in Videos.

Paper Link】 【Pages】:801-809

【Authors】: Adway Mitra ; Soma Biswas ; Chiranjib Bhattacharyya

【Abstract】: Tracklet Clustering is central to several Computer vision tasks [17][20]. A video can be represented as a sequence of tracklets, each spanning over 10–20 successive video frames, and each tracklet is associated with one entity (eg. person in case of TV-serial videos). Tracklets are instances of data-types exhibiting rich spatio-temporal structure. Existing approaches model tracklets by deploying detailed parametric models with a large number of parameters, making the inference unwieldy. The task of Person Discovery in long TV-series videos (40–45 minutes) with many persons can be naturally posed as tracklet clustering, and existing approaches give unsatisfactory performance on it. In this paper we attempt to leverage Temporal Coherence(TC) of videos to improve tracklet clustering. TC is the fundamental property of videos that each tracklet is likely to be associated with the same entity as its predecessor or successor. We propose the first Bayesian nonparametric approach for modelling TC, which can automatically infer the number of clusters to be formed. The major contribution of this paper is Temporally Coherent Chinese Restaurant Process (TC-CRP), which extends CRP by using TC. On the task of discovering persons in TV serials via tracklet clustering, without meta-data such as scripts, TC-CRP shows up to 25% improvement in cluster purity compared to state-of-the-art parametric models, and upto 36% improvement in number of persons discovered. We use a simple representation of tracklets: a vector of very generic features (like pixel intensity) which can correspond to any type of entity (not necessarily person), and empirically demonstrate the utility of TC-CRP for discovering entities like cars and planes. Moreover, unlike existing approaches TC-CRP can perform online tracklet clustering on streaming videos with very little performance deterioration, and can also automatically reject outliers (tracklets resulting from false detections).

【Keywords】:

92. Correlating Surgical Vital Sign Quality with 30-Day Outcomes using Regression on Time Series Segment Features.

Paper Link】 【Pages】:810-818

【Authors】: Risa B. Myers ; John C. Frenzel ; Joseph R. Ruiz ; Christopher M. Jermaine

【Abstract】: Anesthesiologists are taught to carefully manage patient vital signs during surgery. Unfortunately, there is little empirical evidence that vital sign management, as currently practiced, is correlated with patient outcomes. We seek to validate or repudiate current clinical practice. Using a database of over 90,000 cases, we attempt to determine whether those cases that an anesthesiologist would subjectively decide are “low quality” are more likely to result in negative outcomes. The problem reduces to one of multidimensional time series classification. Our approach is to have an expert anesthesiologist label a small number of training cases, from which we can train a classifier to use to label all 90,000 cases. We then use the labeling to search for correlation with outcomes. We consider several standard classification methods, such as dynamic time warping in conjunction with a kNN classifier, as well as the recently proposed complexity invariant distance, and a regression based upon the feature extraction methods outlined by Mao et al. (using features such as time series mean, standard deviation, skew, approximate entropy, etc.). We also propose a feature selection mechanism that learns a hidden Markov model to segment the time series; the fraction of time that each series spends in each state is used to label the series using a regression based classifier. In the end, we are able to obtain strong, empirical evidence that current best practice is correlated with reduced negative patient outcomes.

【Keywords】:

93. Multi-Layered Framework for Modeling Relationships between Biased Objects.

Paper Link】 【Pages】:819-827

【Authors】: Iku Ohama ; Takuya Kida ; Hiroki Arimura

【Abstract】: Latent variable models for relational data enable us to extract a co-cluster structure underlying observed relational data. The Infinite Relational Model (IRM) is a well-known relational model for discovering co-cluster structures with an unknown number of clusters. The IRM and several related models commonly assume that link probability between two objects depends only on their cluster assignment. However, relational models based on this assumption often lead us to extract many non-informative and unexpected clusters. This is because the cluster structures underlying real-world relationships are often blurred by biases that are inherent to individual objects. To overcome this problem, we propose a multilayered framework that extracts a clear co-cluster structure in the presence of objects' biases. Then, we propose a new model which is a special instance of the proposed framework that incorporates the IRM. Furthermore, we reveal that some relational models can be regarded as special cases of the proposed model. We present an efficient Gibbs sampler for posterior inference. Experiments conducted using real-world datasets confirm that the proposed model successfully extracts clear and interpretable cluster structures from blurred relational data.

【Keywords】:

94. Optimizing Hashing Functions for Similarity Indexing in Arbitrary Metric and Nonmetric Spaces.

Paper Link】 【Pages】:828-836

【Authors】: Pat Jangyodsuk ; Panagiotis Papapetrou ; Vassilis Athitsos

【Abstract】: A large number of methods have been proposed for similarity indexing in Euclidean spaces, and several such methods can also be used in arbitrary metric spaces. Such methods exploit specific properties of Euclidean spaces or general metric spaces. Designing general-purpose similarity indexing methods for arbitrary metric and non-metric distance measures is a more difficult problem, due to the vast heterogeneity of such spaces and the lack of common properties that can be exploited. In this paper, we propose a generally applicable method for similarity-based indexing in arbitrary metric and nonmetric spaces, based on hashing. We build upon the technique of Distance-Based Hashing (DBH), which organizes database objects in multiple hash tables, so that two similar objects tend to fall in the same bucket in at least one of those hash tables. The main contribution is in showing how to optimize the hashing functions for accuracy and efficiency, using training data. The proposed optimizations significantly improve performance in experiments on three public datasets.

【Keywords】:

95. SpecLDA: Modeling Product Reviews and Specifications to Generate Augmented Specifications.

Paper Link】 【Pages】:837-845

【Authors】: Dae Hoon Park ; ChengXiang Zhai ; Lifan Guo

【Abstract】: Product specifications are often available for a product on E-commerce websites. However, novice customers often do not have enough knowledge to understand all features of a product, especially advanced features. In order to provide useful knowledge to the customers, we propose to automatically generate augmented product specifications, which contains relevant opinions for product feature values, feature importance, and product-specific words. Specifically, we propose a novel Specification Latent Dirichlet Allocation (SpecLDA) that can enable us to effectively model product reviews and specifications at the same time. It mines review texts relevant to a feature value in order to inform customers what other customers have said about the feature value in reviews of the same product and also different products. SpecLDA can also infer importance of each feature and infer which words are special for each product so that customers quickly understand products. Experiment results show that SpecLDA can effectively model product reviews with specifications. The model can be used for any text collections with specification (key-value) type prior knowledge.

【Keywords】:

96. Mining Multi-Relational Gradual Patterns.

Paper Link】 【Pages】:846-854

【Authors】: NhatHai Phan ; Dino Ienco ; Donato Malerba ; Pascal Poncelet ; Maguelonne Teisseire

【Abstract】: Gradual patterns highlight covariations of attributes of the form “The more/less X, the more/less Y”. Their usefulness in several applications has recently stimulated the synthesis of several algorithms for their automated discovery from large datasets. However, existing techniques require all the interesting data to be in a single database relation or table. This paper extends the notion of gradual pattern to the case in which the co-variations are possibly expressed between attributes of different database relations. The interestingness measure for this class of “relational gradual patterns” is defined on the basis of both Kendall's τ and gradual supports. Moreover, this paper proposes two algorithms, named τRGP Miner and gRGP Miner, for the discovery of relational gradual rules. Three pruning strategies to reduce the search space are proposed. The efficiency of the algorithms is empirically validated, and the usefulness of relational gradual patterns is proved on some real-world databases.

【Keywords】:

97. Modeling User Arguments, Interactions, and Attributes for Stance Prediction in Online Debate Forums.

Paper Link】 【Pages】:855-863

【Authors】: Minghui Qiu ; Yanchuan Sim ; Noah A. Smith ; Jing Jiang

【Abstract】: Online debate forums are important social media for people to voice their opinions and debate with each other. Mining user stances or viewpoints from these forums has been a popular research topic. However, most current work does not address an important problem: for a specific issue, there may not be many users participating and expressing their opinions. Despite the sparsity of user stances, users may provide rich side information; for example, users may write arguments to back up their stances, interact with each other, and provide biographical information. In this work, we propose an integrated model to leverage side information. Our proposed method is a regression-based latent factor model which jointly models user arguments, interactions, and attributes. Our method can perform stance prediction for both warm-start and cold-start users. We demonstrate in experiments that our method has promising results on both micro-level and macro-level stance prediction.

【Keywords】:

98. Predicting Preference Tags to Improve Item Recommendation.

Paper Link】 【Pages】:864-872

【Authors】: Tanwistha Saha ; Huzefa Rangwala ; Carlotta Domeniconi

【Abstract】: Collaborative filtering (CF) based recommender systems identify and recommend interesting items to a given user based on the user's past rating activity. These systems improve their recommendations by identifying user preferences and item related information from external sources, like reviews written by users, or concept tags shared by users about these items. These preferences are often reflected through a multi-criterion rating. In this study, we seek to improve recommender systems by integrating user preferences as side information within standard neighborhood-based and matrix factorization based methods. We assume that a user's choice of tags for an item provides additional information about the user's personal preference and additional features about the item. Since, querying users to provide tags and multi-criteria rating imposes an additional burden on the user base, we propose using collective classification to predict tags for both the users and items. We also investigate the use of active learning approaches integrated within the collective classification framework when tag information (users or items) is limited. Our experimental results on several real world datasets show the advantages of using tag-based information within the recommender systems. We are also able to show the effectiveness of collective classification algorithms in estimating user preferences and item features.

【Keywords】:

99. Data Stream Classification Guided by Clustering on Nonstationary Environments and Extreme Verification Latency.

Paper Link】 【Pages】:873-881

【Authors】: Vinícius M. A. de Souza ; Diego Furtado Silva ; João Gama ; Gustavo E. A. P. A. Batista

【Abstract】: Data stream classification algorithms for nonstationary environments frequently assume the availability of class labels, instantly or with some lag after the classification. However, certain applications, mainly those related to sensors and robotics, involve high costs to obtain new labels during the classification phase. Such a scenario in which the actual labels of processed data are never available is called extreme verification latency. Extreme verification latency requires new classification methods capable of adapting to possible changes over time without external supervision. This paper presents a fast, simple, intuitive and accurate algorithm to classify nonstationary data streams in an extreme verification latency scenario, namely Stream Classification Algorithm Guided by Clustering – SCARGC. Our method consists of a clustering followed by a classification step applied repeatedly in a closed loop fashion. We show in several classification tasks evaluated in synthetic and real data that our method is faster and more accurate than the state-of-the-art.

【Keywords】:

100. Mining Block I/O Traces for Cache Preloading with Sparse Temporal Non-parametric Mixture of Multivariate Poisson.

Paper Link】 【Pages】:882-890

【Authors】: Lavanya Sita Tekumalla ; Chiranjib Bhattacharyya

【Abstract】: Existing caching strategies, in the storage domain, though well suited to exploit short range spatio-temporal patterns, are unable to leverage long-range motifs for improving hitrates. Motivated by this, we investigate novel Bayesian non-parametric modeling(BNP) techniques for count vectors, to capture long range correlations for cache preloading, by mining Block I/O traces. Such traces comprise of a sequence of memory accesses that can be aggregated into high-dimensional sparse correlated count vector sequences. While there are several state of the art BNP algorithms for clustering and their temporal extensions for prediction, there has been no work on exploring these for correlated count vectors. Our first contribution addresses this gap by proposing a DP based mixture model of Multivariate Poisson (DP-MMVP) and its temporal extension(HMM-DP-MMVP) that captures the full covariance structure of multivariate count data. However, modeling full covariance structure for count vectors is computationally expensive, particularly for high dimensional data. Hence, we exploit sparsity in our count vectors, and as our main contribution, introduce the Sparse DP mixture of multivariate Poisson(Sparse-DP-MMVP), generalizing our DP-MMVP mixture model, also leading to more efficient inference. We then discuss a temporal extension to our model for cache preloading. We take the first step towards mining historical data, to capture long range patterns in storage traces for cache preloading. Experimentally, we show a dramatic improvement in hitrates on benchmark traces and lay the groundwork for further research in storage domain to reduce latencies using data mining techniques to capture long range motifs.

【Keywords】:

101. Taming the Empirical Hubness Risk in Many Dimensions.

Paper Link】 【Pages】:891-899

【Authors】: Nenad Tomasev

【Abstract】: The hubness phenomenon has recently come into focus as an important aspect of the curse of dimensionality that affects many instance-based learning systems. It has to do with the long-tailed distribution of instance relevance within the models, where a small number of hub points dominates the analysis and influences many predictions. High data hubness is therefore often linked to poor system performance. In this paper, we re-examine several hubness-aware metric learning strategies that propose to improve system performance by reducing the expected data hubness. Instead of observing only the expected hubness degrees, our comparisons are aimed at evaluating the shape of the induced hubness degree distribution, in order to better estimate the associated hubness risk. It is revealed that the distribution of hubness degree tends to become highly skewed in many dimensions, so that many samples exhibit substantially higher hubness than expected. We argue that the long-tailed high-hubness-degree-variance metrics can be susceptible to highly detrimental high-hubness events, even if they reduce the expected hubness of the data. The experiments indicate significant differences between the compared hubness-aware metric learning approaches and show that simhubs entails the lowest overall hubness risk with increasing dimensionality. This is further shown to improve classifier stability for k-nearest neighbor classification methods.

【Keywords】:

102. Scalable Clustering of Time Series with U-Shapelets.

Paper Link】 【Pages】:900-908

【Authors】: Liudmila Ulanova ; Nurjahan Begum ; Eamonn J. Keogh

【Abstract】: A recently introduced primitive for time series data mining, unsupervised shapelets (u-shapelets), has demonstrated significant potential for time series clustering. In contrast to approaches that consider the entire time series to compute pairwise similarities, the u-shapelets technique allows considering only relevant subsequences of time series. Moreover, u-shapelets allow us to bypass the apparent chicken-and-egg paradox of defining relevant with reference to the clustering itself. U-shapelets have several advantages over rival methods. First, they are defined even when the time series are of different lengths; for example, they allow clustering datasets containing a mixture of single heartbeats and multi-beat ECG recordings. Second, u-shapelets mitigate sensitivity to irrelevant data such as noise, spikes, dropouts, etc. Finally, u-shapelets demonstrated ability to provide additional insights into the data. Unfortunately, the state-of-the-art algorithms for u-shapelets search are intractable and so their advantages have only been demonstrated on tiny datasets. We propose a simple approach to speed up a u-shapelet discovery by two orders of magnitude, without any significant loss in clustering quality.

【Keywords】:

103. Causal Inference by Direction of Information.

Paper Link】 【Pages】:909-917

【Authors】: Jilles Vreeken

【Abstract】: We focus on data-driven causal inference. In particular, we propose a new principle for causal inference based on algorithmic information theory, i.e. Kolmogorov complexity. In a nutshell, we determine how much information one data object gives about the other, and vice versa, and identify the most likely causal direction by the strongest direction of information. To apply this principle in practice, we propose ERGO, an efficient instantiation for inferring the causal direction between multivariate real-valued data pairs. ERGO is based on cumulative and Shannon entropy. Therewith, we do not have to assume distributions, nor have to restrict the type of correlation. Extensive empirical evaluation on synthetic, benchmark, and real-world data shows that ERGO is robust against both noise and dimensionality, efficient, and outperforms the state of the art by a wide margin.

【Keywords】:

104. Graph Regularized Meta-path Based Transductive Regression in Heterogeneous Information Network.

Paper Link】 【Pages】:918-926

【Authors】: Mengting Wan ; Yunbo Ouyang ; Lance M. Kaplan ; Jiawei Han

【Abstract】: A number of real-world networks are heterogeneous information networks, which are composed of different types of nodes and links. Numerical prediction in heterogeneous information networks is a challenging but significant area because network based information for unlabeled objects is usually limited to make precise estimations. In this paper, we consider a graph regularized meta-path based transductive regression model (Grempt), which combines the principal philosophies of typical graph-based transductive classification methods and transductive regression models designed for homogeneous networks. The computation of our method is time and space efficient and the precision of our model can be verified by numerical experiments.

【Keywords】:

105. Localizing Temporal Anomalies in Large Evolving Graphs.

Paper Link】 【Pages】:927-935

【Authors】: Teng Wang ; Chunsheng Victor Fang ; Derek Lin ; Shyhtsun Felix Wu

【Abstract】: Mining for anomalies in graph structured datasets is an important and challenging problem for many applications including security, health care, and social media. In this paper, we propose a novel framework to localize temporal anomalies in large evolving graphs with reduced false alarm rate. Specifically, we first introduce a node-centric model based on Vector Autoregression to analyze node behavior history in dynamic graphs. Then we develop two community-centric models to reduce the amount of false positive results by tracking the structural change and dynamics of graph communities. We analyze the performance of our proposed anomaly localization framework on several synthetic and real-world data sets including Enron email network data, an enterprise network traffic data, and CNN public Facebook page. All experimental results show the effectiveness and consistency of our framework in localizing temporal anomalies with reduced false alarm rate.

【Keywords】:

106. Non-exhaustive, Overlapping k-means.

Paper Link】 【Pages】:936-944

【Authors】: Joyce Jiyoung Whang ; Inderjit S. Dhillon ; David F. Gleich

【Abstract】: Traditional clustering algorithms, such as k-means, output a clustering that is disjoint and exhaustive, that is, every single data point is assigned to exactly one cluster. However, in real datasets, clusters can overlap and there are often outliers that do not belong to any cluster. This is a well recognized problem that has received much attention in the past, and several algorithms, such as fuzzy k-means have been proposed for overlapping clustering. However, most existing algorithms address either overlap or outlier detection and do not tackle the problem in a unified way. In this paper, we propose a simple and intuitive objective function that captures the issues of overlap and non-exhaustiveness in a unified manner. Our objective function can be viewed as a reformulation of the traditional k-means objective, with easy-to-understand parameters that capture the degrees of overlap and non-exhaustiveness. By studying the objective, we are able to obtain a simple iterative algorithm which we call NEO-K-Means (Non-Exhaustive Overlapping K-Means). Furthermore, by considering an extension to weighted kernel k-means, we can tackle the case of non-exhaustive and overlapping graph clustering. This extension allows us to apply our NEO-K-Means algorithm to the community detection problem, which is an important task in network analysis. Our experimental results show that the new objective and algorithm are effective in finding ground-truth clusterings that have varied overlap and non-exhaustiveness; for the case of graphs, we show that our algorithm outperforms state-of-the-art overlapping community detection methods.

【Keywords】:

107. Festival, Date and Limit Line: Predicting Vehicle Accident Rate in Beijing.

Paper Link】 【Pages】:945-953

【Authors】: Xinyu Wu ; Ping Luo ; Qing He ; Tianshu Feng ; Fuzhen Zhuang

【Abstract】: Thousands of vehicle accidents happen every day in Beijing, leading to huge losses. Government traffic management bureau, hospitals, and insurance companies put massive manpower and material resources to deal with accidents. For more reasonable resource assignment, in this study we focus on the prediction of daily Vehicle Accident Rate (VAR), namely the percentage of vehicles with accidents. Specifically, we analyze how the variation of VAR correlates with the macroscopic features, like Chinese festival, date, tail-number limit line etc., and develop the prediction model for VAR based on these features. Our analysis is based on the records of two-year accidents on the vehicles, which are insured by a local insurance giant in Beijing. Experiments show that the proposed model can predict the long-term VAR for at least three months in advance, with satisfactory results. Note also that our study is based on the local conditions in Beijing with Chinese characteristics. It not only helps government bureaus and insurance companies to operate more efficiently, but also helps to know many underlying characteristics of this China capital in a macroscopic perspective.

【Keywords】:

108. A Multi-label Least-Squares Hashing For Scalable Image Search.

Paper Link】 【Pages】:954-962

【Authors】: Shengsheng Wang ; Zi Huang ; Xin-Shun Xu

【Abstract】: Recently, hashing methods have attracted more and more attentions for their effectiveness in large scale data search, e.g., images and videos data, etc. For different scenarios, unsupervised, supervised and semi-supervised hashing methods have been proposed. Especially, when semantic information is available, supervised hashing methods show better performance than unsupervised ones. In many practical applications, one sample usually has more than one label, which has been considered by multi-label learning. However, few supervised hashing methods consider such scenario. In this paper, we propose a Multi-label Least-Squares Hashing (MLSH) method for multi-label data hashing. It can directly work well on multi-label data; moreover, unlike other hashing methods which directly learn hashing functions on original data, MLSH first utilizes the equivalent form of CCA and Least-Squares to project original multi-label data into lower-dimensional space; then, in the lower-dimensional space, it learns the project matrix and gets final binary codes of data. MLSH is tested on NUS-WIDE and CIFAR-100 which are widely used for searching task. The results show that MLSH outperforms several state-of-the-art hashing methods including supervised and unsupervised methods.

【Keywords】:

109. Spatiotemporal Event Forecasting in Social Media.

Paper Link】 【Pages】:963-971

【Authors】: Liang Zhao ; Feng Chen ; Chang-Tien Lu ; Naren Ramakrishnan

【Abstract】: Event forecasting in Twitter is an important and challenging problem. Most existing approaches focus on forecasting temporal events (such as elections and sports) and do not consider spatial features and their underlying correlations. In this paper, we propose a generative model for spatiotemporal event forecasting in Twitter. Our model characterizes the underlying development of future events by jointly modeling the structural contexts and spatiotemporal burstiness. An effective inference algorithm is developed to train the model parameters. Utilizing the trained model, the alignment likelihood of tweet sequences is calculated by dynamic programming. Extensive experimental evaluations on two different domains demonstrated the effectiveness of our proposed approach.

【Keywords】:

110. Back Matter.

Paper Link】 【Pages】:972-976

【Authors】:

【Abstract】: Backmatter includes author index

【Keywords】: