Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015. ACM 【DBLP Link】
【Paper Link】 【Pages】:1
【Authors】: Ron Kohavi
【Abstract】: The Internet provides developers of connected software, including web sites, applications, and devices, an unprecedented opportunity to accelerate innovation by evaluating ideas quickly and accurately using trustworthy controlled experiments (e.g., A/B tests and their generalizations). From front-end user-interface changes to backend recommendation systems and relevance algorithms, from search engines (e.g., Google, Microsoft's Bing, Yahoo) to retailers (e.g., Amazon, eBay, Netflix, Etsy) to social networking services (e.g., Facebook, LinkedIn, Twitter) to Travel services (e.g., Expedia, Airbnb, Booking.com) to many startups, online controlled experiments are now utilized to make data-driven decisions at a wide range of companies. While the theory of a controlled experiment is simple, and dates back to Sir Ronald A. Fisher's experiments at the Rothamsted Agricultural Experimental Station in England in the 1920s, the deployment and mining of online controlled experiments at scale (e.g., hundreds of experiments run every day at Bing) and deployment of online controlled experiments across dozens of web sites and applications has taught us many lessons. We provide an introduction, share real examples, key lessons, and cultural challenges.
【Keywords】: a/b testing; controlled experiments; online controlled experiments
【Paper Link】 【Pages】:3
【Authors】: Daphne Koller
【Abstract】: It has been nearly four years since the first MOOCs (massive open online courses) were offered by Stanford University. MOOCs are now offered to tens of millions of learners worldwide, by hundreds of top universities. MOOCs are no longer an experiment - the learning, reach, and value they offer are now a reality. I will show how MOOCs provide opportunities for open-ended projects, intercultural learner interactions, and collaborative learning. I will discuss some of data that we are collecting from MOOCs, and what we are learning from these data about both courses and learners. Finally, I will discuss both data and examples of the kind of transformative impact that can be derived from providing millions of people with access to the world's best education.
【Keywords】: massive open online courses; teaching and learning
【Paper Link】 【Pages】:5-6
【Authors】: Susan Athey
【Abstract】: A large literature on causal inference in statistics, econometrics, biostatistics, and epidemiology (see, e.g., Imbens and Rubin [2015] for a recent survey) has focused on methods for statistical estimation and inference in a setting where the researcher wishes to answer a question about the (counterfactual) impact of a change in a policy, or "treatment" in the terminology of the literature. The policy change has not necessarily been observed before, or may have been observed only for a subset of the population; examples include a change in minimum wage law or a change in a firm's price. The goal is then to estimate the impact of small set of "treatments" using data from randomized experiments or, more commonly, "observational" studies (that is, non-experimental data). The literature identifies a variety of assumptions that, when satisfied, allow the researcher to draw the same types of conclusions that would be available from a randomized experiment. To estimate causal effects given non-random assignment of individuals to alternative policies in observational studies, popular techniques include propensity score weighting, matching, and regression analysis; all of these methods adjust for differences in observed attributes of individuals. Another strand of literature in econometrics, referred to as "structural modeling," fully specifies the preferences of actors as well as a behavioral model, and estimates those parameters from data (for applications to auction-based electronic commerce, see Athey and Haile [2007] and Athey and Nekipelov [2012]). In both cases, parameter estimates are interpreted as "causal," and they are used to make predictions about the effect of policy changes. In contrast, the supervised machine learning literature has traditionally focused on prediction, providing data-driven approaches to building rich models and relying on cross-validation as a powerful tool for model selection. These methods have been highly successful in practice. This talk will review several recent papers that attempt to bring the tools of supervised machine learning to bear on the problem of policy evaluation, where the papers are connected by three themes. The first theme is that it important for both estimation and inference to distinguish between parts of the model that relate to the causal question of interest, and "attributes," that is, features or variables that describe attributes of individual units that are held fixed when policies change. Specifically, we propose to divide the features of a model into causal features, whose values may be manipulated in a counterfactual policy environment, and attributes. A second theme is that relative to conventional tools from the policy evaluation literature, tools from supervised machine learning can be particularly effective at modeling the association of outcomes with attributes, as well as in modeling how causal effects vary with attributes. A final theme is that modifications of existing methods may be required to deal with the "fundamental problem of causal inference," namely, that no unit is observed in multiple counterfactual worlds at the same time: we do not see a patient at the same time with and without medication, and we do not see a consumer at the same moment exposed to two different prices. This creates a substantial challenge for cross-validation, as the ground truth for the causal effect is not observed for any individual.
【Keywords】: a/b tests; causal inference; counterfactual prediction; cross-validation; model robustness; policy evaluation; randomized experiments; supervised machine learning; treatment effects
【Paper Link】 【Pages】:7
【Authors】: Hugh Durrant-Whyte
【Abstract】: Increasingly it is data, vast amounts of data, that drives scientific discovery. At the heart of this so-called "fourth paradigm of science" is the rapid development of large scale statistical data fusion and machine learning methods. While these developments in "big data" methods are largely driven by commercial applications such as internet search or customer modelling, the opportunity for applying these to scientific discovery is huge. This talk will describe a number of applied machine learning projects addressing real-world inference problems in physical, life and social science areas. In particular, I will describe a major Science and Industry Endowment Fund (SIEF) project, in collaboration with the NICTA and Macquarie University, looking to apply machine learning techniques to discovery in the natural sciences. This talk will look at the key methods in machine learning that are being applied to the discovery process, especially in areas like geology, ecology and biological discovery.
【Keywords】: data science
【Paper Link】 【Pages】:9-18
【Authors】: Sungjin Ahn ; Anoop Korattikara ; Nathan Liu ; Suju Rajan ; Max Welling
【Abstract】: Despite having various attractive qualities such as high prediction accuracy and the ability to quantify uncertainty and avoid over-fitting, Bayesian Matrix Factorization has not been widely adopted because of the prohibitive cost of inference. In this paper, we propose a scalable distributed Bayesian matrix factorization algorithm using stochastic gradient MCMC. Our algorithm, based on Distributed Stochastic Gradient Langevin Dynamics, can not only match the prediction accuracy of standard MCMC methods like Gibbs sampling, but at the same time is as fast and simple as stochastic gradient descent. In our experiments, we show that our algorithm can achieve the same level of prediction accuracy as Gibbs sampling an order of magnitude faster. We also show that our method reduces the prediction error as fast as distributed stochastic gradient descent, achieving a 4.1% improvement in RMSE for the Netflix dataset and an 1.8% for the Yahoo music dataset.
【Keywords】: bayesian inference; distributed; large-scale; matrix factorization; mcmc; stochastic gradient
【Paper Link】 【Pages】:19-28
【Authors】: Tim Althoff ; Xin Luna Dong ; Kevin Murphy ; Safa Alai ; Van Dang ; Wei Zhang
【Abstract】: We present a method called TIMEMACHINE to generate a timeline of events and relations for entities in a knowledge base. For example for an actor, such a timeline should show the most important professional and personal milestones and relationships such as works, awards, collaborations, and family relationships. We develop three orthogonal timeline quality criteria that an ideal timeline should satisfy: (1) it shows events that are relevant to the entity; (2) it shows events that are temporally diverse, so they distribute along the time axis, avoiding visual crowding and allowing for easy user interaction, such as zooming in and out; and (3) it shows events that are content diverse, so they contain many different types of events (e.g., for an actor, it should show movies and marriages and awards, not just movies). We present an algorithm to generate such timelines for a given time period and screen size, based on submodular optimization and web-co-occurrence statistics with provable performance guarantees. A series of user studies using Mechanical Turk shows that all three quality criteria are crucial to produce quality timelines and that our algorithm significantly outperforms various baseline and state-of-the-art methods.
【Keywords】: knowledge base; submodular optimization; summarization; timeline
【Paper Link】 【Pages】:29-38
【Authors】: Laurent Amsaleg ; Oussama Chelly ; Teddy Furon ; Stéphane Girard ; Michael E. Houle ; Ken-ichi Kawarabayashi ; Michael Nett
【Abstract】: This paper is concerned with the estimation of a local measure of intrinsic dimensionality (ID) recently proposed by Houle. The local model can be regarded as an extension of Karger and Ruhl's expansion dimension to a statistical setting in which the distribution of distances to a query point is modeled in terms of a continuous random variable. This form of intrinsic dimensionality can be particularly useful in search, classification, outlier detection, and other contexts in machine learning, databases, and data mining, as it has been shown to be equivalent to a measure of the discriminative power of similarity functions. Several estimators of local ID are proposed and analyzed based on extreme value theory, using maximum likelihood estimation (MLE), the method of moments (MoM), probability weighted moments (PWM), and regularly varying functions (RV). An experimental evaluation is also provided, using both real and artificial data.
【Keywords】: indiscriminability; intrinsic dimension; manifold learning
【Paper Link】 【Pages】:39-48
【Authors】: Émilien Antoine ; Adam Jatowt ; Shoko Wakamiya ; Yukiko Kawai ; Toyokazu Akiyama
【Abstract】: Microblogging platforms such as Twitter have been recently frequently used for detecting real-time events. The spatial component, as reflected by user location, usually plays a key role in such systems. However, an often neglected source of spatial information are location mentions expressed in tweet contents. In this paper we demonstrate a novel visualization system for analyzing how Twitter users collectively talk about space and for uncovering correlations between geographical locations of Twitter users and the locations they tweet about. Our exploratory analysis is based on the development of a model of spatial information extraction and representation that allows building effective visual analytics framework for large scale datasets. We show visualization results based on half a year long dataset of Japanese tweets and a four months long collection of tweets from USA. The proposed system allows observing many space related aspects of tweet messages including the average scope of spatial attention of social media users and variances in spatial interest over time. The analytical framework we provide and the findings we outline can be valuable for scientists from diverse research areas and for any users interested in geographical and social aspects of shared online data.
【Keywords】: location mention; social network; spatial analysis; twitter; visualization
【Paper Link】 【Pages】:49-58
【Authors】: Nurjahan Begum ; Liudmila Ulanova ; Jun Wang ; Eamonn J. Keogh
【Abstract】: Clustering time series is a useful operation in its own right, and an important subroutine in many higher-level data mining analyses, including data editing for classifiers, summarization, and outlier detection. While it has been noted that the general superiority of Dynamic Time Warping (DTW) over Euclidean Distance for similarity search diminishes as we consider ever larger datasets, as we shall show, the same is not true for clustering. Thus, clustering time series under DTW remains a computationally challenging task. In this work, we address this lethargy in two ways. We propose a novel pruning strategy that exploits both upper and lower bounds to prune off a large fraction of the expensive distance calculations. This pruning strategy is admissible; giving us provably identical results to the brute force algorithm, but is at least an order of magnitude faster. For datasets where even this level of speedup is inadequate, we show that we can use a simple heuristic to order the unavoidable calculations in a most-useful-first ordering, thus casting the clustering as an anytime algorithm. We demonstrate the utility of our ideas with both single and multidimensional case studies in the domains of astronomy, speech physiology, medicine and entomology.
【Keywords】: anytime algorithms; clustering; time series
【Paper Link】 【Pages】:59-68
【Authors】: Albert Bifet ; Gianmarco De Francisci Morales ; Jesse Read ; Geoff Holmes ; Bernhard Pfahringer
【Abstract】: The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.
【Keywords】: classification; data streams; evaluation; online learning
【Paper Link】 【Pages】:69-78
【Authors】: Karla L. Caballero Barajas ; Ram Akella
【Abstract】: In this paper, we present a method to dynamically estimate the probability of mortality inside the Intensive Care Unit (ICU) by combining heterogeneous data. We propose a method based on Generalized Linear Dynamic Models that models the probability of mortality as a latent state that evolves over time. This framework allows us to combine different types of features (lab results, vital signs readings, doctor and nurse notes, etc) into a single state, which is updated each time new patient data is observed. In addition, we include the use of text features, based on medical noun phrase extraction and Statistical Topic Models. These features provide context about the patient that cannot be captured when only numerical features are used. We fill out the missing values using a Regularized Expectation Maximization based method assuming temporal data. We test our proposed approach using 15,000 Electronic Medical Records (EMRs) obtained from the MIMIC II public dataset. Experimental results show that the proposed model allows us to detect an increase in the probability of mortality before it occurs. We report an AUC 0.8657. Our proposed model clearly outperforms other methods of the literature in terms of sensitivity with 0.7885 compared to 0.6559 of Naive Bayes and F-score with 0.5929 compared to 0.4662 of Apache III score after 24 hours.
【Keywords】: dynamic linear models; mortality prediction; statistical topic models; text mining
【Paper Link】 【Pages】:79-88
【Authors】: Yongjie Cai ; Hanghang Tong ; Wei Fan ; Ping Ji ; Qing He
【Abstract】: Mining time series data has been a very active research area in the past decade, exactly because of its prevalence in many high-impact applications, ranging from environmental monitoring, intelligent transportation systems, computer network forensics, to smart buildings and many more. It has posed many fascinating research questions. Among others, three prominent challenges shared by a variety of real applications are (a) high-order; (b) contextual constraints and (c) temporal smoothness. The state-of-the-art mining algorithms are rich in addressing each of these challenges, but relatively short of comprehensiveness in attacking the coexistence of multiple or even all of these three challenges. In this paper, we propose a comprehensive method, FACETS, to simultaneously model all these three challenges. We formulate it as an optimization problem from a dynamic graphical model perspective. The key idea is to use tensor factorization to address multi-aspect challenges, and perform careful regularizations to attack both contextual and temporal challenges. Based on that, we propose an effective and scalable algorithm to solve the problem. Our experimental evaluations on three real datasets demonstrate that our method (1) outperforms its competitors in two common data mining tasks (imputation and prediction); and (2) enjoys a linear scalability w.r.t. the length of time series.
【Keywords】: a network of time series; tensor factorization
【Paper Link】 【Pages】:89-98
【Authors】: Lei Cao ; Mingrui Wei ; Di Yang ; Elke A. Rundensteiner
【Abstract】: Traditional outlier detection systems process each individual outlier detection request instantiated with a particular parameter setting one at a time. This is not only prohibitively time-consuming for large datasets, but also tedious for analysts as they explore the data to hone in on the appropriate parameter setting and desired results. In this work, we present the first online outlier exploration platform, called ONION, that enables analysts to effectively explore anomalies even in large datasets. First, ONION features an innovative interactive anomaly exploration model that offers an "outlier-centric panorama" into big datasets along with rich classes of exploration operations. Second, to achieve this model ONION employs an online processing framework composed of a one time offline preprocessing phase followed by an online exploration phase that enables users to interactively explore the data. The preprocessing phase compresses raw big data into a knowledge-rich ONION abstraction that encodes critical interrelationships of outlier candidates so to support subsequent interactive outlier exploration. For the interactive exploration phase, our ONION framework provides several processing strategies that efficiently support the outlier exploration operations. Our user study with real data confirms the effectiveness of ONION in recognizing "true" outliers. Furthermore as demonstrated by our extensive experiments with large datasets, ONION supports all exploration operations within milliseconds response time.
【Keywords】: online exploration; outlier; parameter setting
【Paper Link】 【Pages】:99-108
【Authors】: Shayok Chakraborty ; Vineeth Nallure Balasubramanian ; Adepu Ravi Sankar ; Sethuraman Panchanathan ; Jieping Ye
【Abstract】: Active learning algorithms automatically identify the salient and exemplar instances from large amounts of unlabeled data and thus reduce human annotation effort in inducing a classification model. More recently, Batch Mode Active Learning (BMAL) techniques have been proposed, where a batch of data samples is selected simultaneously from an unlabeled set. Most active learning algorithms assume a flat label space, that is, they consider the class labels to be independent. However, in many applications, the set of class labels are organized in a hierarchical tree structure, with the leaf nodes as outputs and the internal nodes as clusters of outputs at multiple levels of granularity. In this paper, we propose a novel BMAL algorithm (BatchRank) for hierarchical classification. The sample selection is posed as an NP-hard integer quadratic programming problem and a convex relaxation (based on linear programming) is derived, whose solution is further improved by an iterative truncated power method. Finally, a deterministic bound is established on the quality of the solution. Our empirical results on several challenging, real-world datasets from multiple domains, corroborate the potential of the proposed framework for real-world hierarchical classification applications.
【Keywords】: active learning; hierarchical classification; optimization
【Paper Link】 【Pages】:109-118
【Authors】: Tanmoy Chakraborty ; Sikhar Patranabis ; Pawan Goyal ; Animesh Mukherjee
【Abstract】: The availability of an overwhelmingly large amount of bibliographic information including citation and co-authorship data makes it imperative to have a systematic approach that will enable an author to organize her own personal academic network profitably. An effective method could be to have one's co-authorship network arranged into a set of ``circles'', which has been a recent practice for organizing relationships (e.g., friendship) in many online social networks. In this paper, we propose an unsupervised approach to automatically detect circles in an ego network such that each circle represents a densely knit community of researchers. Our model is an unsupervised method which combines a variety of node features and node similarity measures. The model is built from a rich co-authorship network data of more than 8 hundred thousand authors. In the first level of evaluation, our model achieves 13.33% improvement in terms of overlapping modularity compared to the best among four state-of-the-art community detection methods. Further, we conduct a task-based evaluation -- two basic frameworks for collaboration prediction are considered with the circle information (obtained from our model) included in the feature set. Experimental results show that including the circle information detected by our model improves the prediction performance by 9.87% and 15.25% on average in terms of AUC (Area under the ROC) and Prec@20 (Precision at Top 20) respectively compared to the case, where the circle information is not present.
【Keywords】: circles; ego networks; prediction
【Paper Link】 【Pages】:119-128
【Authors】: Shiyu Chang ; Wei Han ; Jiliang Tang ; Guo-Jun Qi ; Charu C. Aggarwal ; Thomas S. Huang
【Abstract】: Data embedding is used in many machine learning applications to create low-dimensional feature representations, which preserves the structure of data points in their original space. In this paper, we examine the scenario of a heterogeneous network with nodes and content of various types. Such networks are notoriously difficult to mine because of the bewildering combination of heterogeneous contents and structures. The creation of a multidimensional embedding of such data opens the door to the use of a wide variety of off-the-shelf mining techniques for multidimensional data. Despite the importance of this problem, limited efforts have been made on embedding a network of scalable, dynamic and heterogeneous data. In such cases, both the content and linkage structure provide important cues for creating a unified feature representation of the underlying network. In this paper, we design a deep embedding algorithm for networked data. A highly nonlinear multi-layered embedding function is used to capture the complex interactions between the heterogeneous data in a network. Our goal is to create a multi-resolution deep embedding function, that reflects both the local and global network structures, and makes the resulting embedding useful for a variety of data mining tasks. In particular, we demonstrate that the rich content and linkage information in a heterogeneous network can be captured by such an approach, so that similarities among cross-modal data can be measured directly in a common embedding space. Once this goal has been achieved, a wide variety of data mining problems can be solved by applying off-the-shelf algorithms designed for handling vector representations. Our experiments on real-world network datasets show the effectiveness and scalability of the proposed algorithm as compared to the state-of-the-art embedding methods.
【Keywords】: cross-domain knowledge propagation; deep learning; dimensionality reduction; feature learning; heterogeneous embedding; network embedding
【Paper Link】 【Pages】:129-138
【Authors】: Rui Chen ; Qian Xiao ; Yu Zhang ; Jianliang Xu
【Abstract】: Releasing high-dimensional data enables a wide spectrum of data mining tasks. Yet, individual privacy has been a major obstacle to data sharing. In this paper, we consider the problem of releasing high-dimensional data with differential privacy guarantees. We propose a novel solution to preserve the joint distribution of a high-dimensional dataset. We first develop a robust sampling-based framework to systematically explore the dependencies among all attributes and subsequently build a dependency graph. This framework is coupled with a generic threshold mechanism to significantly improve accuracy. We then identify a set of marginal tables from the dependency graph to approximate the joint distribution based on the solid inference foundation of the junction tree algorithm while minimizing the resultant error. We prove that selecting the optimal marginals with the goal of minimizing error is NP-hard and, thus, design an approximation algorithm using an integer programming relaxation and the constrained concave-convex procedure. Extensive experiments on real datasets demonstrate that our solution substantially outperforms the state-of-the-art competitors.
【Keywords】: dependency graph; differential privacy; high-dimensional data; joint distribution; junction tree algorithm
【Paper Link】 【Pages】:139-148
【Authors】: Flavio Chierichetti ; Alessandro Epasto ; Ravi Kumar ; Silvio Lattanzi ; Vahab S. Mirrokni
【Abstract】: We introduce the public-private model of graphs. In this model, we have a public graph and each node in the public graph has an associated private graph. The motivation for studying this model stems from social networks, where the nodes are the users, the public graph is visible to everyone, and the private graph at each node is visible only to the user at the node. From each node's viewpoint, the graph is just a union of its private graph and the public graph. We consider the problem of efficiently computing various properties of the graphs from each node's point of view, with minimal amount of recomputation on the public graph. To illustrate the richness of our model, we explore two powerful computational paradigms for studying large graphs, namely, sketching and sampling, and focus on some key problems in social networks and show efficient algorithms in the public-private graph model. In the sketching model, we show how to efficiently approximate the neighborhood function, which in turn can be used to approximate various notions of centrality. In the sampling model, we focus on all-pair shortest path distances, node similarities, and correlation clustering.
【Keywords】: graph algorithms; privacy; social networks
【Paper Link】 【Pages】:149-158
【Authors】: Bo-Yu Chu ; Chia-Hua Ho ; Cheng-Hao Tsai ; Chieh-Yen Lin ; Chih-Jen Lin
【Abstract】: In linear classification, a regularization term effectively remedies the overfitting problem, but selecting a good regularization parameter is usually time consuming. We consider cross validation for the selection process, so several optimization problems under different parameters must be solved. Our aim is to devise effective warm-start strategies to efficiently solve this sequence of optimization problems. We detailedly investigate the relationship between optimal solutions of logistic regression/linear SVM and regularization parameters. Based on the analysis, we develop an efficient tool to automatically find a suitable parameter for users with no related background knowledge.
【Keywords】: linear classification; regularization parameter; warm start
【Paper Link】 【Pages】:159-168
【Authors】: Edith Cohen
【Abstract】: Unaggregated data, in a streamed or distributed form, is prevalent and comes from diverse sources such as interactions of users with web services and IP traffic. Data elements have keys (cookies, users, queries) and elements with different keys interleave. Analytics on such data typically utilizes statistics expressed as a sum over keys in a specified segment of a function f applied to the frequency (the total number of occurrences) of the key. In particular, Distinct is the number of active keys in the segment, Sum is the sum of their frequencies, and both are special cases of frequency cap statistics, which cap the frequency by a parameter T. One important application of cap statistics is staging advertisement campaigns, where the cap parameter is the limit of the maximum number of impressions per user and we estimate the total number of qualifying impressions. The number of distinct active keys in the data can be very large, making exact computation of queries costly. Instead, we can estimate these statistics from a sample. An optimal sample for a given function f would include a key with frequency w with probability roughly proportional to f(w). But while such a "gold-standard" sample can be easily computed over the aggregated data (the set of key-frequency pairs), exact aggregation itself is costly and slow. Ideally, we would like to compute and maintain a sample without aggregation. We present a sampling framework for unaggregated data that uses a single pass (for streams) or two passes (for distributed data) and state proportional to the desired sample size. Our design unifies classic solutions for Distinct and Sum. Specifically, our l-capped samples provide nonnegative unbiased estimates of any monotone non-decreasing frequency statistics, and close to gold-standard estimates for frequency cap statistics with T=Θ(l). Furthermore, our design facilitates multi-objective samples, which provide tight estimates for a specified set of statistics using a single smaller sample.
【Keywords】: frequency capping; sampling; streaming; unaggregated data
【Paper Link】 【Pages】:169-178
【Authors】: Corinna Cortes ; Mehryar Mohri ; Andres Muñoz Medina
【Abstract】: We present a new algorithm for domain adaptation improving upon the discrepancy minimization algorithm (DM), which was previously shown to outperform a number of popular algorithms designed for this task. Unlike most previous approaches adopted for domain adaptation, our algorithm does not consist of a fixed reweighting of the losses over the training sample. Instead, it uses a reweighting that depends on the hypothesis considered and is based on the minimization of a new measure of generalized discrepancy. We give a detailed description of our algorithm and show that it can be formulated as a convex optimization problem. We also present a detailed theoretical analysis of its learning guarantees, which helps us select its parameters. Finally, we report the results of experiments demonstrating that it improves upon the DM algorithm in several tasks.
【Keywords】: domain adaptation; learning theory
【Paper Link】 【Pages】:179-188
【Authors】: Zhicheng Cui ; Wenlin Chen ; Yujie He ; Yixin Chen
【Abstract】: Additive tree models (ATMs) are widely used for data mining and machine learning. Important examples of ATMs include random forest, adaboost (with decision trees as weak learners), and gradient boosted trees, and they are often referred to as the best off-the-shelf classifiers. Though capable of attaining high accuracy, ATMs are not well interpretable in the sense that they do not provide actionable knowledge for a given instance. This greatly limits the potential of ATMs on many applications such as medical prediction and business intelligence, where practitioners need suggestions on actions that can lead to desirable outcomes with minimum costs. To address this problem, we present a novel framework to post-process any ATM classifier to extract an optimal actionable plan that can change a given input to a desired class with a minimum cost. In particular, we prove the NP-hardness of the optimal action extraction problem for ATMs and formulate this problem in an integer linear programming formulation which can be efficiently solved by existing packages. We also empirically demonstrate the effectiveness of the proposed framework by conducting comprehensive experiments on challenging real-world datasets.
【Keywords】: actionable knowledge; additive tree model; decision tree; integer linear programming; random forest
【Paper Link】 【Pages】:189-198
【Authors】: Robin Devooght ; Nicolas Kourtellis ; Amin Mantrach
【Abstract】: Advanced and effective collaborative filtering methods based on explicit feedback assume that unknown ratings do not follow the same model as the observed ones not missing at random). In this work, we build on this assumption, and introduce a novel dynamic matrix factorization framework that allows to set an explicit prior on unknown values. When new ratings, users, or items enter the system, we can update the factorization in time independent of the size of data (number of users, items and ratings). Hence, we can quickly recommend items even to very recent users. We test our methods on three large datasets, including two very sparse ones, in static and dynamic conditions. In each case, we outrank state-of-the-art matrix factorization methods that do not use a prior on unknown ratings.
【Keywords】: collaborative filtering; matrix factorization; recommender systems
【Paper Link】 【Pages】:199-208
【Authors】: Yuxiao Dong ; Jing Zhang ; Jie Tang ; Nitesh V. Chawla ; Bai Wang
【Abstract】: We study the problem of link prediction in coupled networks, where we have the structure information of one (source) network and the interactions between this network and another (target) network. The goal is to predict the missing links in the target network. The problem is extremely challenging as we do not have any information of the target network. Moreover, the source and target networks are usually heterogeneous and have different types of nodes and links. How to utilize the structure information in the source network for predicting links in the target network? How to leverage the heterogeneous interactions between the two networks for the prediction task? We propose a unified framework, CoupledLP, to solve the problem. Given two coupled networks, we first leverage atomic propagation rules to automatically construct implicit links in the target network for addressing the challenge of target network incompleteness, and then propose a coupled factor graph model to incorporate the meta-paths extracted from the coupled part of the two networks for transferring heterogeneous knowledge. We evaluate the proposed framework on two different genres of datasets: disease-gene (DG) and mobile social networks. In the DG networks, we aim to use the disease network to predict the associations between genes. In the mobile networks, we aim to use the mobile communication network of one mobile operator to infer the network structure of its competitors. On both datasets, the proposed CoupledLP framework outperforms several alternative methods. The proposed problem of coupled link prediction and the corresponding framework demonstrate both the scientific and business applications in biology and social networks.
【Keywords】: coupled networks; healthcare; link prediction; mobile communication networks; social networks
【Paper Link】 【Pages】:209-218
【Authors】: Liang Du ; Yi-Dong Shen
【Abstract】: The problem of feature selection has raised considerable interests in the past decade. Traditional unsupervised methods select the features which can faithfully preserve the intrinsic structures of data, where the intrinsic structures are estimated using all the input features of data. However, the estimated intrinsic structures are unreliable/inaccurate when the redundant and noisy features are not removed. Therefore, we face a dilemma here: one need the true structures of data to identify the informative features, and one need the informative features to accurately estimate the true structures of data. To address this, we propose a unified learning framework which performs structure learning and feature selection simultaneously. The structures are adaptively learned from the results of feature selection, and the informative features are reselected to preserve the refined structures of data. By leveraging the interactions between these two essential tasks, we are able to capture accurate structures and select more informative features. Experimental results on many benchmark data sets demonstrate that the proposed method outperforms many state of the art unsupervised feature selection methods.
【Keywords】: adaptive structure learning; unsupervised feature selection
【Paper Link】 【Pages】:219-228
【Authors】: Nan Du ; Mehrdad Farajtabar ; Amr Ahmed ; Alexander J. Smola ; Le Song
【Abstract】: Clusters in document streams, such as online news articles, can be induced by their textual contents, as well as by the temporal dynamics of their arriving patterns. Can we leverage both sources of information to obtain a better clustering of the documents, and distill information that is not possible to extract using contents only? In this paper, we propose a novel random process, referred to as the Dirichlet-Hawkes process, to take into account both information in a unified framework. A distinctive feature of the proposed model is that the preferential attachment of items to clusters according to cluster sizes, present in Dirichlet processes, is now driven according to the intensities of cluster-wise self-exciting temporal point processes, the Hawkes processes. This new model establishes a previously unexplored connection between Bayesian Nonparametrics and temporal Point Processes, which makes the number of clusters grow to accommodate the increasing complexity of online streaming contents, while at the same time adapts to the ever changing dynamics of the respective continuous arrival time. We conducted large-scale experiments on both synthetic and real world news articles, and show that Dirichlet-Hawkes processes can recover both meaningful topics and temporal dynamics, which leads to better predictive performance in terms of content perplexity and arrival time of future documents.
【Keywords】: dirichlet process; document modeling; hawkes process
【Paper Link】 【Pages】:229-238
【Authors】: Ethan R. Elenberg ; Karthikeyan Shanmugam ; Michael Borokhovich ; Alexandros G. Dimakis
【Abstract】: We study the problem of approximating the 3-profile of a large graph. 3-profiles are generalizations of triangle counts that specify the number of times a small graph appears as an induced subgraph of a large graph. Our algorithm uses the novel concept of 3-profile sparsifiers: sparse graphs that can be used to approximate the full 3-profile counts for a given large graph. Further, we study the problem of estimating local and ego 3-profiles, two graph quantities that characterize the local neighborhood of each vertex of a graph. Our algorithm is distributed and operates as a vertex program over the GraphLab PowerGraph framework. We introduce the concept of edge pivoting which allows us to collect 2-hop information without maintaining an explicit 2-hop neighborhood list at each vertex. This enables the computation of all the local 3-profiles in parallel with minimal communication. We test our implementation in several experiments scaling up to 640 cores on Amazon EC2. We find that our algorithm can estimate the 3-profile of a graph in approximately the same time as triangle counting. For the harder problem of ego 3-profiles, we introduce an algorithm that can estimate profiles of hundreds of thousands of vertices in parallel, in the timescale of minutes.
【Keywords】: 3-profiles; graph analytics; graph engines; graph sparsifiers; graphlab; motifs
【Paper Link】 【Pages】:239-248
【Authors】: Kai Fan ; Marisa Eisenberg ; Alison Walsh ; Allison Aiello ; Katherine A. Heller
【Abstract】: The purpose of this study is to leverage modern technology (mobile or web apps) to enrich epidemiology data and infer the transmission of disease. We develop hierarchical Graph-Coupled Hidden Markov Models (hGCHMMs) to simultaneously track the spread of infection in a small cell phone community and capture person-specific infection parameters by leveraging a link prior that incorporates additional covariates. In this paper we investigate two link functions, the beta-exponential link and sigmoid link, both of which allow the development of a principled Bayesian hierarchical framework for disease transmission. The results of our model allow us to predict the probability of infection for each persons on each day, and also to infer personal physical vulnerability and the relevant association with covariates. We demonstrate our approach theoretically and experimentally on both simulation data and real epidemiological records.
【Keywords】: burn-in gibbs em; dynamic bayesian modeling; heterogenous infection; social networks
【Paper Link】 【Pages】:249-258
【Authors】: Dan Feldman ; Tamir Tassa
【Abstract】: We suggest a generic data reduction technique with provable guarantees for computing the low rank approximation of a matrix under some $ellz error, and constrained factorizations, such as the Non-negative Matrix Factorization (NMF). Our main algorithm reduces a given n x d matrix into a small, ε-dependent, weighted subset C of its rows (known as a coreset), whose size is independent of both n and d. We then prove that applying existing algorithms on the resulting coreset can be turned into (1+ε)-approximations for the original (large) input matrix. In particular, we provide the first linear time approximation scheme (LTAS) for the rank-one NMF. The coreset C can be computed in parallel and using only one pass over a possibly unbounded stream of row vectors. In this sense we improve the result in [4] (Best paper of STOC 2013). Moreover, since C is a subset of these rows, its construction time, as well as its sparsity (number of non-zeroes entries) and the sparsity of the resulting low rank approximation depend on the maximum sparsity of an input row, and not on the actual dimension d. In this sense, we improve the result of Libery 21 and answer affirmably, and in a more general setting, his open question of computing such a coreset. Source code is provided for reproducing the experiments and integration with existing and future algorithms.
【Keywords】: nnmf svd coresets streaming distributed parallel
【Paper Link】 【Pages】:259-268
【Authors】: Michael Feldman ; Sorelle A. Friedler ; John Moeller ; Carlos Scheidegger ; Suresh Venkatasubramanian
【Abstract】: What does it mean for an algorithm to be biased? In U.S. law, unintentional bias is encoded via disparate impact, which occurs when a selection process has widely different outcomes for different groups, even as it appears to be neutral. This legal determination hinges on a definition of a protected class (ethnicity, gender) and an explicit description of the process. When computers are involved, determining disparate impact (and hence bias) is harder. It might not be possible to disclose the process. In addition, even if the process is open, it might be hard to elucidate in a legal setting how the algorithm makes its decisions. Instead of requiring access to the process, we propose making inferences based on the data it uses. We present four contributions. First, we link disparate impact to a measure of classification accuracy that while known, has received relatively little attention. Second, we propose a test for disparate impact based on how well the protected class can be predicted from the other attributes. Third, we describe methods by which data might be made unbiased. Finally, we present empirical evidence supporting the effectiveness of our test for disparate impact and our approach for both masking bias and preserving relevant information in the data. Interestingly, our approach resembles some actual selection practices that have recently received legal scrutiny.
【Keywords】: disparate impact; fairness; machine learning
【Paper Link】 【Pages】:269-278
【Authors】: Alceu Ferraz Costa ; Yuto Yamaguchi ; Agma Juci Machado Traina ; Caetano Traina Jr. ; Christos Faloutsos
【Abstract】: Can we identify patterns of temporal activities caused by human communications in social media? Is it possible to model these patterns and tell if a user is a human or a bot based only on the timing of their postings? Social media services allow users to make postings, generating large datasets of human activity time-stamps. In this paper we analyze time-stamp data from social media services and find that the distribution of postings inter-arrival times (IAT) is characterized by four patterns: (i) positive correlation between consecutive IATs, (ii) heavy tails, (iii) periodic spikes and (iv) bimodal distribution. Based on our findings, we propose Rest-Sleep-and-Comment (RSC), a generative model that is able to match all four discovered patterns. We demonstrate the utility of RSC by showing that it can accurately fit real time-stamp data from Reddit and Twitter. We also show that RSC can be used to spot outliers and detect users with non-human behavior, such as bots. We validate RSC using real data consisting of over 35 million postings from Twitter and Reddit. RSC consistently provides a better fit to real data and clearly outperform existing models for human dynamics. RSC was also able to detect bots with a precision higher than 94%.
【Keywords】: generative model; social media; time-series; user behavior
【Paper Link】 【Pages】:279-288
【Authors】: Jeffrey Fisher ; Peter Christen ; Qing Wang ; Erhard Rahm
【Abstract】: Entity resolution (ER) is a common data cleaning task that involves determining which records from one or more data sets refer to the same real-world entities. Because a pairwise comparison of all records scales quadratically with the number of records in the data sets to be matched, it is common to use blocking or indexing techniques to reduce the number of comparisons required. These techniques split the data sets into blocks and only records within blocks are compared with each other. Most existing blocking techniques do not provide control over the size of the generated blocks, despite this control being important in many practical applications of ER, such as privacy-preserving record linkage and real-time ER. We propose two novel hierarchical clustering approaches which can generate blocks within a specified size range, and we present a penalty function which allows control of the trade-off between block quality and block size in the clustering process. We evaluate our techniques on three real-world data sets and compare them against three baseline approaches. The results show our proposed techniques perform well on the measures of pairs completeness and reduction ratio compared to the baseline approaches, while also satisfying the block size restrictions.
【Keywords】: blocking; data cleaning; indexing; record linkage
【Paper Link】 【Pages】:289-298
【Authors】: Seth R. Flaxman ; Yu-Xiang Wang ; Alexander J. Smola
【Abstract】: We present a new solution to the ``ecological inference'' problem, of learning individual-level associations from aggregate data. This problem has a long history and has attracted much attention, debate, claims that it is unsolvable, and purported solutions. Unlike other ecological inference techniques, our method makes use of unlabeled individual-level data by embedding the distribution over these predictors into a vector in Hilbert space. Our approach relies on recent learning theory results for distribution regression, using kernel embeddings of distributions. Our novel approach to distribution regression exploits the connection between Gaussian process regression and kernel ridge regression, giving us a coherent, Bayesian approach to learning and inference and a convenient way to include prior information in the form of a spatial covariance function. Our approach is highly scalable as it relies on FastFood, a randomized explicit feature representation for kernel embeddings. We apply our approach to the challenging political science problem of modeling the voting behavior of demographic groups based on aggregate voting data. We consider the 2012 US Presidential election, and ask: what was the probability that members of various demographic groups supported Barack Obama, and how did this vary spatially across the country? Our results match standard survey-based exit polling data for the small number of states for which it is available, and serve to fill in the large gaps in this data, at a much higher degree of granularity.
【Keywords】: distribution regression; gaussian processes; kernel methods; machine learning; supervised learning
【Paper Link】 【Pages】:299-308
【Authors】: Yanjie Fu ; Guannan Liu ; Spiros Papadimitriou ; Hui Xiong ; Yong Ge ; Hengshu Zhu ; Chen Zhu
【Abstract】: Mixed land use refers to the effort of putting residential, commercial and recreational uses in close proximity to one another. This can contribute economic benefits, support viable public transit, and enhance the perceived security of an area. It is naturally promising to investigate how to rank real estate from the viewpoint of diverse mixed land use, which can be reflected by the portfolio of community functions in the observed area. To that end, in this paper, we develop a geographical function ranking method, named FuncDivRank, by incorporating the functional diversity of communities into real estate appraisal. Specifically, we first design a geographic function learning model to jointly capture the correlations among estate neighborhoods, urban functions, temporal effects, and user mobility patterns. In this way we can learn latent community functions and the corresponding portfolios of estates from human mobility data and Point of Interest (POI) data. Then, we learn the estate ranking indicator by simultaneously maximizing ranking consistency and functional diversity, in a unified probabilistic optimization framework. Finally, we conduct a comprehensive evaluation with real-world data. The experimental results demonstrate the enhanced performance of the proposed method for real estate appraisal.
【Keywords】: functional diversity; human mobility; mixed-land using; real estate ranking; urban geography
【Paper Link】 【Pages】:309-318
【Authors】: Yasuhiro Fujiwara ; Makoto Nakatsuji ; Hiroaki Shiokawa ; Yasutoshi Ida ; Machiko Toyoda
【Abstract】: Affinity Propagation is a clustering algorithm used in many applications. It iteratively updates messages between data points until convergence. The message updating process enables Affinity Propagation to have higher clustering quality compared with other approaches. However, its computation cost is high; it is quadratic in the number of data points. This is because it updates the messages of all data point pairs. This paper proposes an efficient algorithm that guarantees the same clustering results as the original algorithm. Our approach, F-AP, is based on two ideas: (1) it computes upper and lower estimates to limit the messages to be updated in each iteration, and (2) it dynamically detects converged messages to efficiently skip unneeded updates. Experiments show that F-AP is much faster than previous approaches with no loss in clustering performance.
【Keywords】: affinity propagation; clustering; efficient
【Paper Link】 【Pages】:319-328
【Authors】: Moshe Gabel ; Daniel Keren ; Assaf Schuster
【Abstract】: Least squares regression is widely used to understand and predict data behavior in many fields. As data evolves, regression models must be recomputed, and indeed much work has focused on quick, efficient and accurate computation of linear regression models. In distributed streaming settings, however, periodically recomputing the global model is wasteful: communicating new observations or model updates is required even when the model is, in practice, unchanged. This is prohibitive in many settings, such as in wireless sensor networks, or when the number of nodes is very large. The alternative, monitoring prediction accuracy, is not always sufficient: in some settings, for example, we are interested in the model's coefficients, rather than its predictions. We propose the first monitoring algorithm for multivariate regression models of distributed data streams that guarantees a bounded model error. It maintains an accurate estimate using a fraction of the communication by recomputing only when the precomputed model is sufficiently far from the (hypothetical) current global model. When the global model is stable, no communication is needed. Experiments on real and synthetic datasets show that our approach reduces communication by up to two orders of magnitude while providing an accurate estimate of the current global model in all nodes.
【Keywords】: data mining; distributed streams; least squares; regression
【Paper Link】 【Pages】:329-338
【Authors】: Matthias Gallé ; Matías Tealdi
【Abstract】: We analyze the problem of reconstructing documents when we only have access to the n-grams (for n fixed) and their counts from the original documents. Formally, we are interested in recovering the longest contiguous substrings of whose presence in the original documents we are certain. We map this problem on a de Bruijn graph, where the n-grams form the edges and where every Eulerian cycles gives a plausible reconstruction. We define two rules that reduce this graph, preserving all possible reconstructions while at the same time increasing the length of the edge labels. From a theoretical perspective we prove that the iterative application of these rules gives an irreducible graph equivalent to the original one. We then apply this on the data from the Gutenberg project to measure the number and size of the obtained longest substrings. Moreoever, we analyze how the n-gram corpus could be noised to prevent reconstruction, showing empirically that removing low frequent n-grams has little impact. Instead, we propose another method consisting in adding strategically fictitious n-grams and show that a noised corpus like that is much harder to reconstruct while increasing only little the perplexity of a language model obtained through it.
【Keywords】: de bruijn graph; eulerian cycles; privacy-preserving data mining
【Paper Link】 【Pages】:339-348
【Authors】: Hongchang Gao ; Lin Yan ; Weidong Cai ; Heng Huang
【Abstract】: In Drosophila gene expression pattern research, the in situ hybridization (ISH) image has become the standard technique to visualize and study the spatial distribution of RNA. To facilitate the search and comparison of Drosophila gene expression patterns during Drosophila embryogenesis, it is highly desirable to annotate the tissue-level anatomical ontology terms for ISH images. In ISH image annotations, the image content representation is crucial to achieve satisfactory results. However, existing methods mainly focus on improving the classification algorithms and only using simple visual descriptor. If we integrate the effective local and holistic visual descriptors via proper learning method, we can achieve more accurate image annotation results than using individual visual descriptor. We propose a novel structured sparsity-inducing norms based feature learning model to integrate the multi-dimensional visual descriptors for Drosophila gene expression patterns annotations. The new mixed norms are designed to learn the importance of different features from both local and global point of views. We successfully integrate six widely used visual descriptors to annotate the Drosophila gene expression patterns from the lateral, dorsal, and ventral views. The empirical results show that the proposed new method can effectively integrate different visual descriptors, and consistently outperforms related methods using the concatenated visual descriptors.
【Keywords】: drosophila gene expression pattern annotation; multi-dimensional feature learning; multi-view data integration
【Paper Link】 【Pages】:349-358
【Authors】: Jinyang Gao ; H. V. Jagadish ; Beng Chin Ooi ; Sheng Wang
【Abstract】: Locality Sensitive Hashing (LSH) and its variants, are generally believed to be the most effective radius search methods in high-dimensional spaces. However, many applications involve finding the k nearest neighbors (k-NN), where the k-NN distances of different query points may differ greatly and the performance of LSH suffers. We propose a novel indexing scheme called Selective Hashing, where a disjoint set of indices are built with different granularities and each point is only stored in the most effective index. Theoretically, we show that k-NN search using selective hashing can achieve the same recall as a fixed radius LSH search, using a radius equal to the distance of the c1kth nearest neighbor, with at most c2 times overhead, where c1 and c2 are small constants. Selective hashing is also easy to build and update, and outperforms all the state-of-the-art algorithms such as DSH and IsoHash.
【Keywords】: adaptive hashing; data sensitive hashing; high-dimensional indexing; locality sensitive hashing; lsh; selective hashing
【Paper Link】 【Pages】:359-368
【Authors】: David F. Gleich ; Michael W. Mahoney
【Abstract】: Graph-based learning methods have a variety of names including semi-supervised and transductive learning. They typically use a diffusion to propagate labels from a small set of nodes with known class labels to the remaining nodes of the graph. While popular, these algorithms, when implemented in a straightforward fashion, are extremely sensitive to the details of the graph construction. Here, we provide four procedures to help make them more robust: recognizing implicit regularization in the diffusion, using a scalable push method to evaluate the diffusion, using rank-based rounding, and densifying the graph through a matrix polynomial. We study robustness with respect to the details of graph constructions, errors in node labeling, degree variability, and a variety of other real-world heterogeneities, studying these methods through a precise relationship with mincut problems. For instance, the densification strategy explicitly adds new weighted edges to a sparse graph. We find that this simple densification creates a graph where multiple diffusion methods are robust to several types of errors. This is demonstrated by a study with predicting product categories from an Amazon co-purchasing network.
【Keywords】: clustering; diffusions; robustifying; semi-supervised learning
【Paper Link】 【Pages】:369-378
【Authors】: Jen J. Gong ; Thoralf M. Sundt ; James D. Rawn ; John V. Guttag
【Abstract】: Accurate risk models for adverse outcomes can provide important input to clinical decision-making. Surprisingly, one of the main challenges when using machine learning to build clinically useful risk models is the small amount of data available. Risk models need to be developed for specific patient populations, specific institutions, specific procedures, and specific outcomes. With each exclusion criterion, the amount of relevant training data decreases, until there is often an insufficient amount to learn an accurate model. This difficulty is compounded by the large class imbalance that is often present in medical applications. In this paper, we present an approach to address the problem of small data using transfer learning methods in the context of developing risk models for cardiac surgeries. We explore ways to build surgery-specific and hospital-specific models (the target task) using information from other kinds of surgeries and other hospitals (source tasks). We propose a novel method to weight examples based on their similarity to the target task training examples to take advantage of the useful examples while discounting less relevant ones. We show that incorporating appropriate source data in training can lead to improved performance over using only target task training data, and that our method of instance weighting can lead to further improvements. Applied to a surgical risk stratification task, our method, which used data from two institutions, performed comparably to the risk model published by the Society for Thoracic Surgeons, which was developed and tested on over one hundred thousand surgeries from hundreds of institutions.
【Keywords】: instance-weighting; risk stratification models; transfer learning
【Paper Link】 【Pages】:379-386
【Authors】: Aditya Grover ; Ashish Kapoor ; Eric Horvitz
【Abstract】: Weather forecasting is a canonical predictive challenge that has depended primarily on model-based methods. We explore new directions with forecasting weather as a data-intensive challenge that involves inferences across space and time. We study specifically the power of making predictions via a hybrid approach that combines discriminatively trained predictive models with a deep neural network that models the joint statistics of a set of weather-related variables. We show how the base model can be enhanced with spatial interpolation that uses learned long-range spatial dependencies. We also derive an efficient learning and inference procedure that allows for large scale optimization of the model parameters. We evaluate the methods with experiments on real-world meteorological data that highlight the promise of the approach.
【Keywords】: deep learning; gaussian processes; graphical models; machine learning; weather forecasting
【Paper Link】 【Pages】:387-396
【Authors】: David Hallac ; Jure Leskovec ; Stephen Boyd
【Abstract】: Convex optimization is an essential tool for modern data analysis, as it provides a framework to formulate and solve many problems in machine learning and data mining. However, general convex optimization solvers do not scale well, and scalable solvers are often specialized to only work on a narrow class of problems. Therefore, there is a need for simple, scalable algorithms that can solve many common optimization problems. In this paper, we introduce the network lasso, a generalization of the group lasso to a network setting that allows for simultaneous clustering and optimization on graphs. We develop an algorithm based on the Alternating Direction Method of Multipliers (ADMM) to solve this problem in a distributed and scalable manner, which allows for guaranteed global convergence even on large graphs. We also examine a non-convex extension of this approach. We then demonstrate that many types of problems can be expressed in our framework. We focus on three in particular --- binary classification, predicting housing prices, and event detection in time series data --- comparing the network lasso to baseline approaches and showing that it is both a fast and accurate method of solving large optimization problems.
【Keywords】: admm; convex optimization; network lasso
【Paper Link】 【Pages】:397-406
【Abstract】: In multi-task learning (MTL), multiple related tasks are learned jointly by sharing information according to task relations. One promising approach is to utilize the given tree structure, which describes the hierarchical relations among tasks, to learn model parameters under the regularization framework. However, such a priori information is rarely available in most applications. To the best of our knowledge, there is no work to learn the tree structure among tasks and model parameters simultaneously under the regularization framework and in this paper, we develop a TAsk Tree (TAT) model for MTL to achieve this. By specifying the number of layers in the tree as H, the TAT method decomposes the parameter matrix into H component matrices, each of which corresponds to the model parameters in each layer of the tree. In order to learn the tree structure, we devise sequential constraints to make the distance between the parameters in the component matrices corresponding to each pair of tasks decrease over layers, and hence the component parameters will keep fused until the topmost layer, once they become fused in a layer. Moreover, to make the component parameters have chance to fuse in different layers, we develop a structural sparsity regularizer, which is the sum of the l2 norm on the pairwise difference among the component parameters, to learn layer-specific task structure. In order to solve the resulting non-convex objective function, we use the general iterative shrinkage and thresholding (GIST) method. By using the alternating direction method of multipliers (ADMM) method, we decompose the proximal problem in the GIST method into three independent subproblems, where a key subproblem with the sequential constraints has an efficient solution as the other two subproblems do. We also provide some theoretical analysis for the TAT model. Experiments on both synthetic and real-world datasets show the effectiveness of the TAT model.
【Keywords】: learning tree structure; multi-task learning
【Paper Link】 【Pages】:407-416
【Abstract】: Numerous models have been proposed for modeling social networks to explore their structure or to address application problems, such as community detection and behavior prediction. However, the results are still far from satisfactory. One of the biggest challenges is how to capture all the information of a social network in a unified manner, such as links, communities, user attributes, roles and behaviors. In this paper, we propose a unified probabilistic framework, the Community Role Model (CRM), to model a social network. CRM incorporates all the information of nodes and edges that form a social network. We propose methods based on Gibbs sampling and an EM algorithm to estimate the model's parameters and fit our model to real social networks. Real data experiments show that CRM can be used not only to represent a social network, but also to handle various application problems with better performance than a baseline model, without any modification to the model, showing its great advantages.
【Keywords】: behavior prediction; community; social network
【Paper Link】 【Pages】:417-426
【Authors】: Kohei Hayashi ; Takanori Maehara ; Masashi Toyoda ; Ken-ichi Kawarabayashi
【Abstract】: Twitter is a "what's-happening-right-now" tool that enables interested parties to follow thoughts and commentary of individual users in nearly real-time. While it is a valuable source of information for real-time topic detection and tracking, Twitter data are not clean because of noisy messages and users, which significantly diminish the reliability of obtained results. In this paper, we integrate both the extraction of meaningful topics and the filtering of messages over the Twitter stream. We develop a streaming algorithm for a sequence of document-frequency tables; our algorithm enables real-time monitoring of the top-10 topics from approximately 25% of all Twitter messages, while automatically filtering noisy and meaningless topics. We apply our proposed streaming algorithm to the Japanese Twitter stream and successfully demonstrate that, compared with other online nonnegative matrix factorization methods, our framework both tracks real-world events with high accuracy in terms of the perplexity and simultaneously eliminates irrelevant topics.
【Keywords】: noise filtering; nonnegative matrix factorization; streaming algorithm; topic detection; twitter
【Paper Link】 【Pages】:427-436
【Authors】: Yangyang Hou ; Joyce Jiyoung Whang ; David F. Gleich ; Inderjit S. Dhillon
【Abstract】: Clustering is one of the most fundamental tasks in data mining. To analyze complex real-world data emerging in many data-centric applications, the problem of non-exhaustive, overlapping clustering has been studied where the goal is to find overlapping clusters and also detect outliers simultaneously. We propose a novel convex semidefinite program (SDP) as a relaxation of the non-exhaustive, overlapping clustering problem. Although the SDP formulation enjoys attractive theoretical properties with respect to global optimization, it is computationally intractable for large problem sizes. As an alternative, we optimize a low-rank factorization of the solution. The resulting problem is non-convex, but has a smaller number of solution variables. We construct an optimization solver using an augmented Lagrangian methodology that enables us to deal with problems with tens of thousands of data points. The new solver provides more accurate and reliable answers than other approaches. By exploiting the connection between graph clustering objective functions and a kernel k-means objective, our new low-rank solver can also compute overlapping communities of social networks with state-of-the-art accuracy.
【Keywords】: community detection; overlapping clustering; semidefinite programming
【Paper Link】 【Pages】:437-446
【Authors】: Hsun-Ping Hsieh ; Shou-De Lin ; Yu Zheng
【Abstract】: This paper tries to answer two questions. First, how to infer real-time air quality of any arbitrary location given environmental data and historical air quality data from very sparse monitoring locations. Second, if one needs to establish few new monitoring stations to improve the inference quality, how to determine the best locations for such purpose? The problems are challenging since for most of the locations (>99%) in a city we do not have any air quality data to train a model from. We design a semi-supervised inference model utilizing existing monitoring data together with heterogeneous city dynamics, including meteorology, human mobility, structure of road networks, and point of interests (POIs). We also propose an entropy-minimization model to suggest the best locations to establish new monitoring stations. We evaluate the proposed approach using Beijing air quality data, resulting in clear advantages over a series of state-of-the-art and commonly used methods.
【Keywords】: air quality; city dynamics; location recommendation; monitoring station; semi-supervised inference; sensor placement
【Paper Link】 【Pages】:447-456
【Authors】: Shuhei Iitsuka ; Yutaka Matsuo
【Abstract】: Online controlled experiments are widely used to improve the performance of websites by comparison of user behavior related to different variations of the given website. Although such experiments might have an important effect on the key metrics to maximize, small-scale websites have difficulty applying this methodology because they have few users. Furthermore, the candidate variations increase exponentially with the number of elements that must be optimized. A testing method that finds a high-performing variation with a few samples must be devised to address these problems. As described herein, we formalize this problem as a website optimization problem and provide a basis to apply existing search algorithms to this problem. We further organize existing testing methods and extract devices to make the experiments more effective. By combining organized algorithms and devices, we propose a rapid testing method that detects high-performing variations with few users. We evaluated our proposed method using simulation experiments. Results show that it outperforms existing methods at any website scale. Moreover, we implemented our proposed method as an optimizer program and used it on an actual small-scale website. Results show that our proposed method can achieve 57% higher performance variation than that of the generally used A/B testing method. Therefore, our proposed method can optimize a website with fewer samples. The website optimization problem has broad application possibilities that are applicable not only to websites but also to manufactured goods.
【Keywords】: a/b testing; controlled experiments; website optimization
【Paper Link】 【Pages】:457-466
【Authors】: Bo Jiang ; Zhi-Li Zhang ; Don Towsley
【Abstract】: Directed links -- representing asymmetric social ties or interactions (e.g., "follower-followee") -- arise naturally in many social networks and other complex networks, giving rise to directed graphs (or digraphs) as basic topological models for these networks. Reciprocity, defined for a digraph as the percentage of edges with a reciprocal edge, is a key metric that has been used in the literature to compare different directed networks and provide "hints" about their structural properties: for example, are reciprocal edges generated randomly by chance or are there other processes driving their generation? In this paper we study the problem of maximizing achievable reciprocity for an ensemble of digraphs with the same prescribed in- and out-degree sequences. We show that the maximum reciprocity hinges crucially on the in- and out-degree sequences, which may be intuitively interpreted as constraints on some "social capacities" of nodes and impose fundamental limits on achievable reciprocity. We show that it is NP-complete to decide the achievability of a simple upper bound on maximum reciprocity, and provide conditions for achieving it. We demonstrate that many real networks exhibit reciprocities surprisingly close to the upper bound, which implies that users in these social networks are in a sense more "social" than suggested by the empirical reciprocity alone in that they are more willing to reciprocate, subject to their "social capacity" constraints. We find some surprising linear relationships between empirical reciprocity and the bound. We also show that a particular type of small network motifs that we call 3-paths are the major source of loss in reciprocity for real networks.
【Keywords】: degree sequence; directed graph; reciprocity; social network
【Paper Link】 【Pages】:467-476
【Authors】: Fredrik D. Johansson ; Devdatt P. Dubhashi
【Abstract】: We develop and apply the Balcan-Blum-Srebro (BBS) theory of classification via similarity functions (which are not necessarily kernels) to the problem of graph classification. First we place the BBS theory into the unifying framework of optimal transport theory. This also opens the way to exploit coupling methods for establishing properties required of a good similarity function as per their definition. Next, we use the approach to the problem of graph classification via geometric embeddings such as the Laplacian, pseudo-inverse Laplacian and the Lovász orthogonal labellings. We consider the similarity function given by optimal and near--optimal matchings with respect to Euclidean distance of the corresponding embeddings of the graphs in high dimensions. We use optimal couplings to rigorously establish that this yields a "good" similarity measure in the BBS sense for two well known families of graphs. Further, we show that the similarity yields better classification accuracy in practice, on these families, than matchings of other well-known graph embeddings. Finally we perform an extensive empirical evaluation on benchmark data sets where we show that classifying graphs using matchings of geometric embeddings outperforms the previous state-of-the-art methods.
【Keywords】: classification; geometric embeddings; graphs; matchings; similarity functions
【Paper Link】 【Pages】:477-486
【Authors】: Nicholas Johnson ; Arindam Banerjee
【Abstract】: Data mining algorithms for computing solutions to online resource allocation (ORA) problems have focused on budgeting resources currently in possession, e.g., investing in the stock market with cash on hand or assigning current employees to projects. In several settings, one can leverage borrowed resources with which tasks can be accomplished more efficiently and cheaply. Additionally, a variety of opposing allocation types or positions may be available with which one can hedge the allocation to alleviate risk from external changes. In this paper, we present a formulation for hedging online resource allocations with leverage and propose an efficient data mining algorithm (SHERAL). We pose the problem as a constrained online convex optimization problem. The key novel components of our formulation are (1) a loss function for general leveraging and opposing allocation positions and (2) a penalty function which hedges between structurally dependent allocation positions to control risk. We instantiate the problem in the context of portfolio selection and evaluate the effectiveness of the formulation through extensive experiments on five datasets in comparison with existing algorithms and several variants.
【Keywords】: finance; online learning; structured learning
【Paper Link】 【Pages】:487-496
【Authors】: Ata Kaban
【Abstract】: Dot product is a key building block in a number of data mining algorithms from classification, regression, correlation clustering, to information retrieval and many others. When data is high dimensional, the use of random projections may serve as a universal dimensionality reduction method that provides both low distortion guarantees and computational savings. Yet, contrary to the optimal guarantees that are known on the preservation of the Euclidean distance cf. the Johnson-Lindenstrauss lemma, the existing guarantees on the dot product under random projection are loose and incomplete in the current data mining and machine learning literature. Some recent literature even suggested that the dot product may not be preserved when the angle between the original vectors is obtuse. In this paper we provide improved bounds on the dot product under random projection that matches the optimal bounds on the Euclidean distance. As a corollary, we elucidate the impact of the angle between the original vectors on the relative distortion of the dot product under random projection, and we show that the obtuse vs. acute angles behave symmetrically in the same way. In a further corollary we make a link to sign random projection, where we generalise earlier results. Numerical simulations confirm our theoretical results. Finally we give an application of our results to bounding the generalisation error of compressive linear classifiers under the margin loss.
【Keywords】: compressive classification; dot product preservation; margin preservation; random projection
【Paper Link】 【Pages】:497-506
【Authors】: Mojtaba Kadkhodaie ; Konstantina Christakopoulou ; Maziar Sanjabi ; Arindam Banerjee
【Abstract】: Recent years have seen a revival of interest in the Alternating Direction Method of Multipliers (ADMM), due to its simplicity, versatility, and scalability. As a first order method for general convex problems, the rate of convergence of ADMM is O(1=k) [4, 25]. Given the scale of modern data mining problems, an algorithm with similar properties as ADMM but faster convergence rate can make a big difference in real world applications. In this paper, we introduce the Accelerated Alternating Direction Method of Multipliers (A2DM2) which solves problems with the same structure as ADMM. When the objective function is strongly convex, we show that A2DM2 has a O(1=k2) convergence rate. Unlike related existing literature on trying to accelerate ADMM, our analysis does not need any additional restricting assumptions. Through experiments, we show that A2DM2 converges faster than ADMM on a variety of problems. Further, we illustrate the versatility of the general A2DM2 on the problem of learning to rank, where it is shown to be competitive with the state-of-the-art specialized algorithms for the problem on both scalability and accuracy.
【Keywords】: alternating direction method of multipliers; ranking on top of the list
【Paper Link】 【Pages】:507-516
【Authors】: Zhengping Che ; David C. Kale ; Wenzhe Li ; Mohammad Taha Bahadori ; Yan Liu
【Abstract】: We apply deep learning to the problem of discovery and detection of characteristic patterns of physiology in clinical time series data. We propose two novel modifications to standard neural net training that address challenges and exploit properties that are peculiar, if not exclusive, to medical data. First, we examine a general framework for using prior knowledge to regularize parameters in the topmost layers. This framework can leverage priors of any form, ranging from formal ontologies (e.g., ICD9 codes) to data-derived similarity. Second, we describe a scalable procedure for training a collection of neural networks of different sizes but with partially shared architectures. Both of these innovations are well-suited to medical applications, where available data are not yet Internet scale and have many sparse outputs (e.g., rare diagnoses) but which have exploitable structure (e.g., temporal order and relationships between labels). However, both techniques are sufficiently general to be applied to other problems and domains. We demonstrate the empirical efficacy of both techniques on two real-world hospital data sets and show that the resulting neural nets learn interpretable and clinically relevant features.
【Keywords】: deep learning; healthcare; medical informatics; multi-label classification; multivari- ate time series; phenotyping
【Paper Link】 【Pages】:517-526
【Authors】: Janani Kalyanam ; Amin Mantrach ; Diego Sáez-Trumper ; Hossein Vahabi ; Gert R. G. Lanckriet
【Abstract】: Topic discovery and evolution (TDE) has been a problem which has gained long standing interest in the research community. The goal in topic discovery is to identify groups of keywords from large corpora so that the information in those corpora are summarized succinctly. The nature of text corpora has changed dramatically in the past few years with the advent of social media. Social media services allow users to constantly share, follow and comment on posts from other users. Hence, such services have given a new dimension to the traditional text corpus. The new dimension being that today's corpora have a social context embedded in them in terms of the community of users interested in a particular post, their profiles etc. We wish to harness this social context that comes along with the textual content for TDE. In particular, our goal is to both qualitatively and quantitatively analyze when social context actually helps with TDE. Methodologically, we approach the problem of TDE by a proposing non-negative matrix factorization (NMF) based model that incorporates both the textual information and social context information. We perform experiments on large scale real world dataset of news articles, and use Twitter as the platform providing information about the social context of these news articles. We compare with and outperform several state-of-the-art baselines. Our conclusion is that using the social context information is most useful when faced with topics that are particularly difficult to detect.
【Keywords】: collective factorization; social networks; topic discovery; topic monitoring; topic tracking
【Paper Link】 【Pages】:527-536
【Authors】: Alexandros Karakasidis ; Georgia Koloniari ; Vassilios S. Verykios
【Abstract】: When dealing with sensitive and personal user data, the process of record linkage raises privacy issues. Thus, privacy preserving record linkage has emerged with the goal of identifying matching records across multiple data sources while preserving the privacy of the individuals they describe. The task is very resource demanding, considering the abundance of available data, which, in addition, are often dirty. Blocking techniques are deployed prior to matching to prune out unlikely to match candidate records so as to reduce processing time. However, when scaling to large datasets, such methods often result in quality loss. To this end, we propose Multi-Sampling Transitive Closure for Encrypted Fields (MS-TCEF), a novel privacy preserving blocking technique based on the use of reference sets. Our new method effectively prunes records based on redundant assignments to blocks, providing better fault-tolerance and maintaining result quality while scaling linearly with respect to the dataset size. We provide a theoretical analysis on the method's complexity and show how it outperforms state-of-the-art privacy preserving blocking techniques with respect to both recall and processing cost.
【Keywords】: performance; private blocking; reference sets
【Paper Link】 【Pages】:537-546
【Authors】: Noriaki Kawamae
【Abstract】: The information overload problem remains serious for both consumers and service/content providers, leading to heightened demands for personalized recommendations. For recommender systems, updating user models is one of the most important tasks to keep up with their changing preferences and trends. Especially since new consumers and items emerge every day, which are promptly rated or reviewed, updating lists of items and rankings is crucial. In this paper, we set the goal of real time recommendation, to present these items instantly. Unlike standard collaborative filtering algorithms, our offline approach focuses only innovative consumers for these predictions, and then uses as few consumers as possible while keeping the same precision. Since innovators exist in many communities, and their opinions will spread and then stimulate their followers to adopt the same behavior, our approach is based on the hypothesis that a set of innova- tive consumers is sufficient to represent the most representative opinions in each community. Following this hypothesis, we derive a scalable method to detect both communities and innovative consumers in each community from a web- scale data from a behavior log. Our evaluation shows that our proposed weighting method can accurately sample given logs, and be compatible only with previous algorithms for real time recommendations.
【Keywords】: nonparametric models; personalization; real time recommendations; recommendations; serendipitous; topic models
【Paper Link】 【Pages】:547-556
【Authors】: Emre KicKiman ; Matthew Richardson
【Abstract】: Every day, people take actions, trying to achieve their personal, high-order goals. People decide what actions to take based on their personal experience, knowledge and gut instinct. While this leads to positive outcomes for some people, many others do not have the necessary experience, knowledge and instinct to make good decisions. What if, rather than making decisions based solely on their own personal experience, people could take advantage of the reported experiences of hundreds of millions of other people? In this paper, we investigate the feasibility of mining the relationship between actions and their outcomes from the aggregated timelines of individuals posting experiential microblog reports. Our contributions include an architecture for extracting action-outcome relationships from social media data, techniques for identifying experiential social media messages and converting them to event timelines, and an analysis and evaluation of action-outcome extraction in case studies.
【Keywords】: actions; knowledge base; knowledge base of actions; outcomes; social media
【Paper Link】 【Pages】:557-566
【Authors】: Daniel Kifer
【Abstract】: When analyzing data, it is important to account for all sources of noise. Public use datasets, such as those provided by the Census Bureau, often undergo additional perturbations designed to protect confidentiality. This source of noise is generally ignored in data analysis because crucial parameters and details about its implementation are withheld. In this paper, we consider the problem of inferring such parameters from the data. Specifically, we target data swapping, a perturbation technique commonly used by the U.S. Census Bureau and which, barring practical breakthroughs in disclosure control, will be used in the foreseeable future. The vanilla version of data swapping selects pairs of records and exchanges some of their attribute values. The number of swapped records is kept secret even though it is needed for data analysis and investigations into the confidentiality protection of individual records. We propose algorithms for estimating the number of swapped records in categorical data, even when the true data distribution is unknown.
【Keywords】: data swapping
【Paper Link】 【Pages】:567-576
【Authors】: Hannah Kim ; Jaegul Choo ; Jingu Kim ; Chandan K. Reddy ; Haesun Park
【Abstract】: Understanding large-scale document collections in an efficient manner is an important problem. Usually, document data are associated with other information (e.g., an author's gender, age, and location) and their links to other entities (e.g., co-authorship and citation networks). For the analysis of such data, we often have to reveal common as well as discriminative characteristics of documents with respect to their associated information, e.g., male- vs. female-authored documents, old vs. new documents, etc. To address such needs, this paper presents a novel topic modeling method based on joint nonnegative matrix factorization, which simultaneously discovers common as well as discriminative topics given multiple document sets. Our approach is based on a block-coordinate descent framework and is capable of utilizing only the most representative, thus meaningful, keywords in each topic through a novel pseudo-deflation approach. We perform both quantitative and qualitative evaluations using synthetic as well as real-world document data sets such as research paper collections and nonprofit micro-finance data. We show our method has a great potential for providing in-depth analyses by clearly identifying common and discriminative topics among multiple document sets.
【Keywords】: discriminative pattern mining; nonnegative matrix factorization; topic modeling
【Paper Link】 【Pages】:577-586
【Authors】: Taehwan Kim ; Yisong Yue ; Sarah L. Taylor ; Iain A. Matthews
【Abstract】: We study the problem of learning to predict a spatiotemporal output sequence given an input sequence. In contrast to conventional sequence prediction problems such as part-of-speech tagging (where output sequences are selected using a relatively small set of discrete labels), our goal is to predict sequences that lie within a high-dimensional continuous output space. We present a decision tree framework for learning an accurate non-parametric spatiotemporal sequence predictor. Our approach enjoys several attractive properties, including ease of training, fast performance at test time, and the ability to robustly tolerate corrupted training data using a novel latent variable approach. We evaluate on several datasets, and demonstrate substantial improvements over existing decision tree based sequence learning frameworks such as SEARN and DAgger.
【Keywords】: decision trees; sequence prediction
【Paper Link】 【Pages】:587-596
【Authors】: Younghoon Kim ; Jiawei Han ; Cangzhou Yuan
【Abstract】: With the increasing use of GPS-enabled mobile phones, geo-tagging, which refers to adding GPS information to media such as micro-blogging messages or photos, has seen a surge in popularity recently. This enables us to not only browse information based on locations, but also discover patterns in the location-based behaviors of users. Many techniques have been developed to find the patterns of people's movements using GPS data, but latent topics in text messages posted with local contexts have not been utilized effectively. In this paper, we present a latent topic-based clustering algorithm to discover patterns in the trajectories of geo-tagged text messages. We propose a novel probabilistic model to capture the semantic regions where people post messages with a coherent topic as well as the patterns of movement between the semantic regions. Based on the model, we develop an efficient inference algorithm to calculate model parameters. By exploiting the estimated model, we next devise a clustering algorithm to find the significant movement patterns that appear frequently in data. Our experiments on real-life data sets show that the proposed algorithm finds diverse and interesting trajectory patterns and identifies the semantic regions in a finer granularity than the traditional geographical clustering methods.
【Keywords】: modeling geo-tagged messages; topical trajectory pattern
【Paper Link】 【Pages】:597-606
【Authors】: Dimitrios Kotzias ; Misha Denil ; Nando de Freitas ; Padhraic Smyth
【Abstract】: In many classification problems labels are relatively scarce. One context in which this occurs is where we have labels for groups of instances but not for the instances themselves, as in multi-instance learning. Past work on this problem has typically focused on learning classifiers to make predictions at the group level. In this paper we focus on the problem of learning classifiers to make predictions at the instance level. To achieve this we propose a new objective function that encourages smoothness of inferred instance-level labels based on instance-level similarity, while at the same time respecting group-level label constraints. We apply this approach to the problem of predicting labels for sentences given labels for reviews, using a convolutional neural network to infer sentence similarity. The approach is evaluated using three large review data sets from IMDB, Yelp, and Amazon, and we demonstrate the proposed approach is both accurate and scalable compared to various alternatives.
【Keywords】: deep learning; multi-instance learning; sentiment analysis; unsupervised learning
【Paper Link】 【Pages】:607-616
【Authors】: Srijan Kumar ; Francesca Spezzano ; V. S. Subrahmanian
【Abstract】: We study the problem of detecting vandals on Wikipedia before any human or known vandalism detection system reports flagging potential vandals so that such users can be presented early to Wikipedia administrators. We leverage multiple classical ML approaches, but develop 3 novel sets of features. Our Wikipedia Vandal Behavior (WVB) approach uses a novel set of user editing patterns as features to classify some users as vandals. Our Wikipedia Transition Probability Matrix (WTPM) approach uses a set of features derived from a transition probability matrix and then reduces it via a neural net auto-encoder to classify some users as vandals. The VEWS approach merges the previous two approaches. Without using any information (e.g. reverts) provided by other users, these algorithms each have over 85% classification accuracy. Moreover, when temporal recency is considered, accuracy goes to almost 90%. We carry out detailed experiments on a new data set we have created consisting of about 33K Wikipedia users (including both a black list and a white list of editors) and containing 770K edits. We describe specific behaviors that distinguish between vandals and non-vandals. We show that VEWS beats ClueBot NG and STiki, the best known algorithms today for vandalism detection. Moreover, VEWS detects far more vandals than ClueBot NG and on average, detects them 2.39 edits before ClueBot NG when both detect the vandal. However, we show that the combination of VEWS and ClueBot NG can give a fully automated vandal early warning system with even higher accuracy.
【Keywords】: behavior modeling; early detection; vandal detection; wikipedia
【Paper Link】 【Pages】:617-626
【Authors】: Chia-Tung Kuo ; Xiang Wang ; Peter B. Walker ; Owen T. Carmichael ; Jieping Ye ; Ian Davidson
【Abstract】: The analysis of data represented as graphs is common having wide scale applications from social networks to medical imaging. A popular analysis is to cut the graph so that the disjoint subgraphs can represent communities (for social network) or background and foreground cognitive activity (for medical imaging). An emerging setting is when multiple data sets (graphs) exist which opens up the opportunity for many new questions. In this paper we study two such questions: i) For a collection of graphs find a single cut that is good for all the graphs and ii) For two collections of graphs find a single cut that is good for one collection but poor for the other. We show that existing formulations of multiview, consensus and alternative clustering cannot address these questions and instead we provide novel formulations in the spectral clustering framework. We evaluate our approaches on functional magnetic resonance imaging (fMRI) data to address questions such as: "What common cognitive network does this group of individuals have?" and "What are the differences in the cognitive networks for these two groups?" We obtain useful results without the need for strong domain knowledge.
【Keywords】: application; fmri; graph cuts
【Paper Link】 【Pages】:627-634
【Authors】: Chao Lan ; Jun Huan
【Abstract】: In semi-supervised multi-view learning, unlabeled sample complexity (u.s.c.) specifies the size of unlabeled training sample that guarantees a desired learning error. In this paper, we improve the state-of-art u.s.c. from O(1/ε) to O(log 1/ε) for small error ε, under mild conditions. To obtain the improved result, as a primary step we prove a connection between the generalization error of a classifier and its incompatibility, which measures the fitness between the classifier and the sample distribution. We then prove that with a sufficiently large unlabeled sample, one is able to find classifiers with low incompatibility. Combining the two observations, we manage to prove a probably approximately correct (PAC) style learning bound for semi-supervised multi-view learning. We empirically verified our theory by designing two proof-of-concept multi-view learning algorithms, one based on active view sensing and the other based on online co-regularization, with real-world data sets.
【Keywords】: multi-view learning; sample complexity; semi-supervised learning
【Paper Link】 【Pages】:635-644
【Authors】: Jaewoo Lee ; Yue Wang ; Daniel Kifer
【Abstract】: When analyzing data that has been perturbed for privacy reasons, one is often concerned about its usefulness. Recent research on differential privacy has shown that the accuracy of many data queries can be improved by post-processing the perturbed data to ensure consistency constraints that are known to hold for the original data. Most prior work converted this post-processing step into a least squares minimization problem with customized efficient solutions. While improving accuracy, this approach ignored the noise distribution in the perturbed data. In this paper, to further improve accuracy, we formulate this post-processing step as a constrained maximum likelihood estimation problem, which is equivalent to constrained L1 minimization. Instead of relying on slow linear program solvers, we present a faster generic recipe (based on ADMM) that is suitable for a wide variety of applications including differentially private contingency tables, histograms, and the matrix mechanism (linear queries). An added benefit of our formulation is that it can often take direct advantage of algorithmic tricks used by the prior work on least-squares post-processing. An extensive set of experiments on various datasets demonstrates that this approach significantly improve accuracy over prior work.
【Keywords】: admm; differential privacy; post-processing; ppdm
【Paper Link】 【Pages】:645-654
【Authors】: Siyu Lei ; Silviu Maniu ; Luyi Mo ; Reynold Cheng ; Pierre Senellart
【Abstract】: Social networks are commonly used for marketing purposes. For example, free samples of a product can be given to a few influential social network users (or seed nodes), with the hope that they will convince their friends to buy it. One way to formalize this objective is through the problem of influence maximization (or IM), whose goal is to find the best seed nodes to activate under a fixed budget, so that the number of people who get influenced in the end is maximized. Solutions to IM rely on the influence probability that a user influences another one. However, this probability information may be unavailable or incomplete. In this paper, we study IM in the absence of complete information on influence probability. We call this problem Online Influence Maximization (OIM), since we learn influence probabilities at the same time we run influence campaigns. To solve OIM, we propose a multiple-trial approach, where (1) some seed nodes are selected based on existing influence information; (2) an influence campaign is started with these seed nodes; and (3) user feedback is used to update influence information. We adopt Explore-Exploit strategies, which can select seed nodes using either the current influence probability estimation (exploit), or the confidence bound on the estimation (explore). Any existing IM algorithm can be used in this framework. We also develop an incremental algorithm that can significantly reduce the overhead of handling user feedback information. Our experiments show that our solution is more effective than traditional IM methods on the partial information.
【Keywords】: influence maximization; multi-armed bandits; reinforcement learning; uncertain databases
【Paper Link】 【Pages】:655-664
【Authors】: Liangyue Li ; Hanghang Tong
【Abstract】: Understanding the dynamic mechanisms that drive the high-impact scientific work (e.g., research papers, patents) is a long-debated research topic and has many important implications, ranging from personal career development and recruitment search, to the jurisdiction of research resources. Recent advances in characterizing and modeling scientific success have made it possible to forecast the long-term impact of scientific work, where data mining techniques, supervised learning in particular, play an essential role. Despite much progress, several key algorithmic challenges in relation to predicting long-term scientific impact have largely remained open. In this paper, we propose a joint predictive model to forecast the long-term scientific impact at the early stage, which simultaneously addresses a number of these open challenges, including the scholarly feature design, the non-linearity, the domain-heterogeneity and dynamics. In particular, we formulate it as a regularized optimization problem and propose effective and scalable algorithms to solve it. We perform extensive empirical evaluations on large, real scholarly data sets to validate the effectiveness and the efficiency of our method.
【Keywords】: joint predictive model; long term impact prediction
【Paper Link】 【Pages】:665-674
【Authors】: Ping Li
【Abstract】: We develop 0-bit consistent weighted sampling (CWS) for efficiently estimating min-max kernel, which is a generalization of the resemblance kernel originally designed for binary data. Because the estimator of 0-bit CWS constitutes a positive definite kernel, this method can be naturally applied to large-scale data mining problems. Basically, if we feed the sampled data from 0-bit CWS to a highly efficient linear classifier (e.g., linear SVM), we effectively (and approximately) train a nonlinear classifier based on the min-max kernel. The accuracy improves as we increase the sample size. In this paper, we first demonstrate, through an extensive classification study using kernel machines, that the min-max kernel often provides an effective measure of similarity for nonnegative data. This helps justify the use of min-max kernel. However, as the min-max kernel is nonlinear and might be difficult to be used for industrial applications with massive data, we propose to linearize the min-max kernel via 0-bit CWS, a simplification of the original CWS method. The previous remarkable work on {\em consistent weighted sampling (CWS)} produces samples in the form of (i, t) where the i records the location (and in fact also the weights) information analogous to the samples produced by classical minwise hashing on binary data. Because the t is theoretically unbounded, it was not immediately clear how to effectively implement CWS for building large-scale linear classifiers. We provide a simple solution by discarding t* (which we refer to as the "0-bit" scheme). Via an extensive empirical study, we show that this 0-bit scheme does not lose essential information. We then apply 0-bit CWS for building linear classifiers to approximate min-max kernel classifiers, as extensively validated on a wide range of public datasets. We expect this work will generate interests among data mining practitioners who would like to efficiently utilize the nonlinear information of non-binary and nonnegative data.
【Keywords】: big data; hashing; learning
【Paper Link】 【Pages】:675-684
【Authors】: Yaliang Li ; Qi Li ; Jing Gao ; Lu Su ; Bo Zhao ; Wei Fan ; Jiawei Han
【Abstract】: In the era of big data, information regarding the same objects can be collected from increasingly more sources. Unfortunately, there usually exist conflicts among the information coming from different sources. To tackle this challenge, truth discovery, i.e., to integrate multi-source noisy information by estimating the reliability of each source, has emerged as a hot topic. In many real world applications, however, the information may come sequentially, and as a consequence, the truth of objects as well as the reliability of sources may be dynamically evolving. Existing truth discovery methods, unfortunately, cannot handle such scenarios. To address this problem, we investigate the temporal relations among both object truths and source reliability, and propose an incremental truth discovery framework that can dynamically update object truths and source weights upon the arrival of new data. Theoretical analysis is provided to show that the proposed method is guaranteed to converge at a fast rate. The experiments on three real world applications and a set of synthetic data demonstrate the advantages of the proposed method over state-of-the-art truth discovery methods.
【Keywords】: dynamic data; source reliability; truth discovery
【Paper Link】 【Pages】:685-694
【Authors】: Yongsub Lim ; U. Kang
【Abstract】: How can we estimate local triangle counts accurately in a graph stream without storing the whole graph? The local triangle counting which counts triangles for each node in a graph is a very important problem with wide applications in social network analysis, anomaly detection, web mining, etc. In this paper, we propose MASCOT, a memory-efficient and accurate method for local triangle estimation in a graph stream based on edge sampling. To develop MASCOT, we first present two naive local triangle counting algorithms in a graph stream: MASCOT-C and MASCOT-A. MASCOT-C is based on constant edge sampling, and MASCOT-A improves its accuracy by utilizing more memory spaces. MASCOT achieves both accuracy and memory-efficiency of the two algorithms by an unconditional triangle counting for a new edge, regardless of whether it is sampled or not. In contrast to the existing algorithm which requires prior knowledge on the target graph and appropriately set parameters, MASCOT requires only one simple parameter, the edge sampling probability. Through extensive experiments, we show that for the same number of edges sampled, MASCOT provides the best accuracy compared to the existing algorithm as well as MASCOT-C and MASCOT-A. Thanks to MASCOT, we also discover interesting anomalous patterns in real graphs, like core-peripheries in the web and ambiguous author names in DBLP.
【Keywords】: anomaly detection; edge sampling; graph stream mining; local triangle counting
【Paper Link】 【Pages】:695-704
【Authors】: Su-Chen Lin ; Shou-De Lin ; Ming-Syan Chen
【Abstract】: Considering nowadays companies providing similar products or services compete with each other for resources and customers, this work proposes a learning-based framework to tackle the multi-round competitive influence maximization problem on a social network. We propose a data-driven model leveraging the concept of meta-learning to maximize the expected influence in the long run. Our model considers not only the network information but also the opponent's strategy while making a decision. It maximizes the total influence in the end of the process instead of myopically pursuing short term gain. We propose solutions for scenarios when the opponent's strategy is known or unknown and available or unavailable for training. We also show how an effective framework can be trained without manually labeled data, and conduct several experiments to verify the effectiveness of the whole process.
【Keywords】: influence maximization; reinforcement learning; social network
【Paper Link】 【Pages】:705-714
【Authors】: Chuanren Liu ; Fei Wang ; Jianying Hu ; Hui Xiong
【Abstract】: The rapid growth in the development of healthcare information systems has led to an increased interest in utilizing the patient Electronic Health Records (EHR) for assisting disease diagnosis and phenotyping. The patient EHRs are generally longitudinal and naturally represented as medical event sequences, where the events include clinical notes, problems, medications, vital signs, laboratory reports, etc. The longitudinal and heterogeneous properties make EHR analysis an inherently difficult challenge. To address this challenge, in this paper, we develop a novel representation, namely the temporal graph, for such event sequences. The temporal graph is informative for a variety of challenging analytic tasks, such as predictive modeling, since it can capture temporal relationships of the medical events in each event sequence. By summarizing the longitudinal data, the temporal graphs are also robust and resistant to noisy and irregular observations. Based on the temporal graph representation, we further develop an approach for temporal phenotyping to identify the most significant and interpretable graph basis as phenotypes. This helps us better understand the disease evolving patterns. Moreover, by expressing the temporal graphs with the phenotypes, the expressing coefficients can be used for applications such as personalized medicine, disease diagnosis, and patient segmentation. Our temporal phenotyping framework is also flexible to incorporate semi-supervised/supervised information. Finally, we validate our framework on two real-world tasks. One is predicting the onset risk of heart failure. Another is predicting the risk of heart failure related hospitalization for patients with COPD pre-condition. Our results show that the diagnosis performance in both tasks can be improved significantly by the proposed approaches. Also, we illustrate some interesting phenotypes derived from the data.
【Keywords】: electronic health records; regularization; temporal graph; temporal phenotyping
【Paper Link】 【Pages】:715-724
【Authors】: Hongfu Liu ; Tongliang Liu ; Junjie Wu ; Dacheng Tao ; Yun Fu
【Abstract】: Ensemble clustering, also known as consensus clustering, is emerging as a promising solution for multi-source and/or heterogeneous data clustering. The co-association matrix based method, which redefines the ensemble clustering problem as a classical graph partition problem, is a landmark method in this area. Nevertheless, the relatively high time and space complexity preclude it from real-life large-scale data clustering. We therefore propose SEC, an efficient Spectral Ensemble Clustering method based on co-association matrix. We show that SEC has theoretical equivalence to weighted K-means clustering and results in vastly reduced algorithmic complexity. We then derive the latent consensus function of SEC, which to our best knowledge is among the first to bridge co-association matrix based method to the methods with explicit object functions. The robustness and generalizability of SEC are then investigated to prove the superiority of SEC in theory. We finally extend SEC to meet the challenge rising from incomplete basic partitions, based on which a scheme for big data clustering can be formed. Experimental results on various real-world data sets demonstrate that SEC is an effective and efficient competitor to some state-of-the-art ensemble clustering methods and is also suitable for big data clustering.
【Keywords】: co-association matrix; ensemble clustering; k-means
【Paper Link】 【Pages】:725-734
【Authors】: Felipe Llinares-López ; Mahito Sugiyama ; Laetitia Papaxanthos ; Karsten M. Borgwardt
【Abstract】: We present a novel algorithm for significant pattern mining, Westfall-Young light. The target patterns are statistically significantly enriched in one of two classes of objects. Our method corrects for multiple hypothesis testing and correlations between patterns via the Westfall-Young permutation procedure, which empirically estimates the null distribution of pattern frequencies in each class via permutations. In our experiments, Westfall-Young light dramatically outperforms the current state-of-the-art approach, both in terms of runtime and memory efficiency on popular real-world benchmark datasets for pattern mining. The key to this efficiency is that, unlike all existing methods, our algorithm does not need to solve the underlying frequent pattern mining problem anew for each permutation and does not need to store the occurrence list of all frequent patterns. Westfall-Young light opens the door to significant pattern mining on large datasets that previously involved prohibitive runtime or memory costs. Our code is available from http://www.bsse.ethz.ch/mlcb/research/machine-learning/wylight.html
【Keywords】: multiple hypothesis testing; p-value; significant pattern mining; westfall-young permutation
【Paper Link】 【Pages】:735-744
【Authors】: Brendan Lucier ; Joel Oren ; Yaron Singer
【Abstract】: We consider the task of evaluating the spread of influence in large networks in the well-studied independent cascade model. We describe a novel sampling approach that can be used to design scalable algorithms with provable performance guarantees. These algorithms can be implemented in distributed computation frameworks such as MapReduce. We complement these results with a lower bound on the query complexity of influence estimation in this model. We validate the performance of these algorithms through experiments that demonstrate the efficacy of our methods and related heuristics.
【Keywords】: distributed computation; influence maximization; mapreduce; social networks; submodular
【Paper Link】 【Pages】:745-754
【Authors】: Fenglong Ma ; Yaliang Li ; Qi Li ; Minghui Qiu ; Jing Gao ; Shi Zhi ; Lu Su ; Bo Zhao ; Heng Ji ; Jiawei Han
【Abstract】: In crowdsourced data aggregation task, there exist conflicts in the answers provided by large numbers of sources on the same set of questions. The most important challenge for this task is to estimate source reliability and select answers that are provided by high-quality sources. Existing work solves this problem by simultaneously estimating sources' reliability and inferring questions' true answers (i.e., the truths). However, these methods assume that a source has the same reliability degree on all the questions, but ignore the fact that sources' reliability may vary significantly among different topics. To capture various expertise levels on different topics, we propose FaitCrowd, a fine grained truth discovery model for the task of aggregating conflicting data collected from multiple users/sources. FaitCrowd jointly models the process of generating question content and sources' provided answers in a probabilistic model to estimate both topical expertise and true answers simultaneously. This leads to a more precise estimation of source reliability. Therefore, FaitCrowd demonstrates better ability to obtain true answers for the questions compared with existing approaches. Experimental results on two real-world datasets show that FaitCrowd can significantly reduce the error rate of aggregation compared with the state-of-the-art multi-source aggregation approaches due to its ability of learning topical expertise from question content and collected answers.
【Keywords】: crowdsourcing; source reliability; truth discovery
【Paper Link】 【Pages】:755-764
【Authors】: Mohammad Mahdian ; Okke Schrijvers ; Sergei Vassilvitskii
【Abstract】: We study the problem of selecting a set of points of interest (POIs) to show on a map. We begin with a formal model of the setting, noting that the utility of a POI may be discounted by (i) the presence of competing businesses nearby as well as (ii) its position in the set of establishments ordered by distance from the user. We present simple, approximately optimal selection algorithms, coupled with incentive compatible pricing schemes in case of advertiser supplied points of interest. Finally, we evaluate our algorithms on real data sets and show that they outperform simple baselines.
【Keywords】: advertising on maps; algorithmic cartography; negative externalities; poi placement; points of interest on maps
【Paper Link】 【Pages】:765-774
【Authors】: Qi Mao ; Li Wang ; Steve Goodison ; Yijun Sun
【Abstract】: We present a new dimensionality reduction setting for a large family of real-world problems. Unlike traditional methods, the new setting aims to explicitly represent and learn an intrinsic structure from data in a high-dimensional space, which can greatly facilitate data visualization and scientific discovery in downstream analysis. We propose a new dimensionality-reduction framework that involves the learning of a mapping function that projects data points in the original high-dimensional space to latent points in a low-dimensional space that are then used directly to construct a graph. Local geometric information of the projected data is naturally captured by the constructed graph. As a showcase, we develop a new method to obtain a discriminative and compact feature representation for clustering problems. In contrast to assumptions used in traditional clustering methods, we assume that centers of clusters should be close to each other if they are connected in a learned graph, and other cluster centers should be distant. Extensive experiments are performed that demonstrate that the proposed method is able to obtain discriminative feature representations yielding superior clustering performance, and correctly recover the intrinsic structures of various real-world datasets including curves, hierarchies and a cancer progression path.
【Keywords】: clustering; dimensionality reduction; graph structure learning; unsupervised learning
【Paper Link】 【Pages】:775-784
【Authors】: William B. March ; Bo Xiao ; Sameer Tharakan ; Chenhan D. Yu ; George Biros
【Abstract】: Since exact evaluation of a kernel matrix requires O(N2) work, scalable learning algorithms using kernels must approximate the kernel matrix. This approximation must be robust to the kernel parameters, for example the bandwidth for the Gaussian kernel. We consider two approximation methods: Nystrom and an algebraic treecode developed in our group. Nystrom methods construct a global low-rank approximation of the kernel matrix. Treecodes approximate just the off-diagonal blocks, typically using a hierarchical decomposition. We present a theoretical error analysis of our treecode and relate it to the error of Nystrom methods. Our analysis reveals how the block-rank structure of the kernel matrix controls the performance of the treecode. We evaluate our treecode by comparing it to the classical Nystrom method and a state-of-the-art fast approximate Nystrom method. We test the kernel matrix approximation accuracy for several different bandwidths and datasets. On the MNIST2M dataset (2M points in 784 dimensions) for a Gaussian kernel with bandwidth h=1, the Nystrom methods' error is over 90% whereas our treecode delivers error less than 1%. We also test the performance of the three methods on binary classification using two models: a Bayes classifier and kernel ridge regression. Our evaluation reveals the existence of bandwidth values that should be examined in cross-validation but whose corresponding kernel matrices cannot be approximated well by Nystrom methods. In contrast, the treecode scheme performs much better for these values.
【Keywords】: bandwidth selection; kernel regression; kernel summations; nystrom methods; treecodes
【Paper Link】 【Pages】:785-794
【Authors】: Julian J. McAuley ; Rahul Pandey ; Jure Leskovec
【Abstract】: To design a useful recommender system, it is important to understand how products relate to each other. For example, while a user is browsing mobile phones, it might make sense to recommend other phones, but once they buy a phone, we might instead want to recommend batteries, cases, or chargers. In economics, these two types of recommendations are referred to as substitutes and complements: substitutes are products that can be purchased instead of each other, while complements are products that can be purchased in addition to each other. Such relationships are essential as they help us to identify items that are relevant to a user's search. Our goal in this paper is to learn the semantics of substitutes and complements from the text of online reviews. We treat this as a supervised learning problem, trained using networks of products derived from browsing and co-purchasing logs. Methodologically, we build topic models that are trained to automatically discover topics from product reviews that are successful at predicting and explaining such relationships. Experimentally, we evaluate our system on the Amazon product catalog, a large dataset consisting of 9 million products, 237 million links, and 144 million reviews.
【Keywords】: link prediction; recommender systems; substitutes and complements; topic models
【Paper Link】 【Pages】:805-814
【Authors】: Bryan Minor ; Janardhan Rao Doppa ; Diane J. Cook
【Abstract】: We consider a novel problem called Activity Prediction, where the goal is to predict the future activity occurrence times from sensor data. In this paper, we make three main contributions. First, we formulate and solve the activity prediction problem in the framework of imitation learning and reduce it to simple regression learning problem. This approach allows us to leverage powerful regression learners; is easy to implement; and can reason about the relational and temporal structure of the problem with negligible computational overhead. Second, we present several evaluation metrics to evaluate a given activity predictor, and discuss their pros and cons in the context of real-world applications. Third, we evaluate our approach using real sensor data collected from 24 smart home testbeds. We also embed the learned predictor into a mobile device based activity prompter and evaluate the app on multiple participants living in smart homes. Our experimental results indicate that the activity predictor learned with our approach performs better than the baseline methods, and offers a simple and reliable approach to prediction of activities from sensor data.
【Keywords】: activity prediction; digital prompting; regression learning; smart environments
【Paper Link】 【Pages】:815-824
【Authors】: Michael Mitzenmacher ; Jakub Pachocki ; Richard Peng ; Charalampos E. Tsourakakis ; Shen Chen Xu
【Abstract】: Extracting dense subgraphs from large graphs is a key primitive in a variety of graph mining applications, ranging from mining social networks and the Web graph to bioinformatics [41]. In this paper we focus on a family of poly-time solvable formulations, known as the k-clique densest subgraph problem (k-Clique-DSP) [57]. When k=2, the problem becomes the well-known densest subgraph problem (DSP) [22, 31, 33, 39]. Our main contribution is a sampling scheme that gives densest subgraph sparsifier, yielding a randomized algorithm that produces high-quality approximations while providing significant speedups and improved space complexity. We also extend this family of formulations to bipartite graphs by introducing the (p,q)-biclique densest subgraph problem ((p,q)-Biclique-DSP), and devise an exact algorithm that can treat both clique and biclique densities in a unified way. As an example of performance, our sparsifying algorithm extracts the 5-clique densest subgraph --which is a large-near clique on 62 vertices-- from a large collaboration network. Our algorithm achieves 100% accuracy over five runs, while achieving an average speedup factor of over 10,000. Specifically, we reduce the running time from ∼2 107 seconds to an average running time of 0.15 seconds. We also use our methods to study how the k-clique densest subgraphs change as a function of time in time-evolving networks for various small values of k. We observe significant deviations between the experimental findings on real-world networks and stochastic Kronecker graphs, a random graph model that mimics real-world networks in certain aspects. We believe that our work is a significant advance in routines with rigorous theoretical guarantees for scalable extraction of large near-cliques from networks.
【Keywords】: dense subgraphs; graph mining; near-clique extraction
【Paper Link】 【Pages】:825-834
【Authors】: Davide Mottin ; Francesco Bonchi ; Francesco Gullo
【Abstract】: We study a problem of graph-query reformulation enabling explorative query-driven discovery in graph databases. Given a query issued by the user, the system, apart from returning the result patterns, also proposes a number of specializations (i.e., supergraphs) of the original query to facilitate the exploration of the results. We formalize the problem of finding a set of reformulations of the input query by maximizing a linear combination of coverage (of the original query's answer set) and diversity among the specializations. We prove that our problem is hard, but also that a simple greedy algorithm achieves a (1/2)-approximation guarantee. The most challenging step of the greedy algorithm is the computation of the specialization that brings the maximum increment to the objective function. To efficiently solve this step, we show how to compute the objective-function increment of a specialization linearly in the number of its results and derive an upper bound that we exploit to devise an efficient search-space visiting strategy. An extensive evaluation on real and synthetic databases attests high efficiency and accuracy of our proposal.
【Keywords】: graph databases; graph queries; query reformulation
【Paper Link】 【Pages】:835-844
【Authors】: Jingchao Ni ; Hanghang Tong ; Wei Fan ; Xiang Zhang
【Abstract】: Integrating multiple graphs (or networks) has been shown to be a promising approach to improve the graph clustering accuracy. Various multi-view and multi-domain graph clustering methods have recently been developed to integrate multiple networks. In these methods, a network is treated as a view or domain.The key assumption is that there is a common clustering structure shared across all views (domains), and different views (domains) provide compatible and complementary information on this underlying clustering structure. However, in many emerging real-life applications, different networks have different data distributions, where the assumption that all networks share a single common clustering structure does not hold. In this paper, we propose a flexible and robust framework that allows multiple underlying clustering structures across different networks. Our method models the domain similarity as a network, which can be utilized to regularize the clustering structures in different networks. We refer to such a data model as a network of networks (NoN). We develop NoNClus, a novel method based on non-negative matrix factorization (NMF), to cluster an NoN. We provide rigorous theoretical analysis of NoNClus in terms of its correctness, convergence and complexity. Extensive experimental results on synthetic and real-life datasets show the effectiveness of our method.
【Keywords】: graph clustering; network of networks
【Paper Link】 【Pages】:845-854
【Authors】: Kirill Nikolaev ; Alexey Drutsa ; Ekaterina Gladkikh ; Alexander Ulianov ; Gleb Gusev ; Pavel Serdyukov
【Abstract】: Nowadays, the development of most leading web services is controlled by online experiments that qualify and quantify the steady stream of their updates. The challenging problem is to define an appropriate online metric of user behavior, so-called Overall Evaluation Criterion (OEC), which is both interpretable and sensitive. The state-of-the-art approach is to choose a type of entities to observe in the behavior data, to define a key metric for these observations, and to estimate the average value of this metric over the observations in each of the system versions. A significant disadvantage of the OEC obtained in this way is that the average value of the key metric does not necessarily change, even if its distribution changes significantly. The reason is that the difference between the mean values of the key metric over the two variants of the system does not necessarily reflect the character of the change in the distribution. We develop a novel method of quantifying the change in the distribution of the key metric, which is (1) interpretable, (2) is based on the analysis of the two distributions as a whole, and, for this reason, is sensitive to more ways the two distributions may actually differ. We provide a thorough theoretical analysis of our approach and show experimentally that, other things being equal, it produces more sensitive OEC than the average.
【Keywords】: a/b test; distribution decomposition; effect variable
【Paper Link】 【Pages】:855-864
【Authors】: Nozomi Nori ; Hisashi Kashima ; Kazuto Yamashita ; Hiroshi Ikai ; Yuichi Imanaka
【Abstract】: Acute hospital care as performed in the intensive care unit (ICU) is characterized by its frequent, but short-term interventions for patients who are severely ill. Because clinicians have to attend to more than one patient at a time and make decisions in a limited time in acute hospital care environments, the accurate prediction of the in-hospital mortality risk could assist them to pay more attention to patients with a higher in-hospital mortality risk, thereby improving the quality and efficiency of the care. One of the salient features of ICU is the diversity of patients: clinicians are faced by patients with a wide variety of diseases. However, mortality prediction for ICU patients has typically been conducted by building one common predictive model for all the diseases. In this paper, we incorporate disease-specific contexts into mortality modeling by formulating the mortality prediction problem as a multi-task learning problem in which a task corresponds to a disease. Our method effectively integrates medical domain knowledge relating to the similarity among diseases and the similarity among Electronic Health Records (EHRs) into a data-driven approach by incorporating graph Laplacians into the regularization term to encode these similarities. The experimental results on a real dataset from a hospital corroborate the effectiveness of the proposed method. The AUCs of several baselines were improved, including logistic regression without multi-task learning and several multi-task learning methods that do not incorporate the domain knowledge. In addition, we illustrate some interesting results pertaining to disease-specific predictive features, some of which are not only consistent with existing medical domain knowledge, but also contain suggestive hypotheses that could be validated by further investigations in the medical domain.
【Keywords】: health-care; mortality risk prediction; multi-task learning
【Paper Link】 【Pages】:865-874
【Authors】: Jinoh Oh ; Wook-Shin Han ; Hwanjo Yu ; Xiaoqian Jiang
【Abstract】: Matrix factorization is one of the fundamental techniques for analyzing latent relationship between two entities. Especially, it is used for recommendation for its high accuracy. Efficient parallel SGD matrix factorization algorithms have been developed for large matrices to speed up the convergence of factorization. However, most of them are designed for a shared-memory environment thus fail to factorize a large matrix that is too big to fit in memory, and their performances are also unreliable when the matrix is skewed. This paper proposes a fast and robust parallel SGD matrix factorization algorithm, called MLGF-MF, which is robust to skewed matrices and runs efficiently on block-storage devices (e.g., SSD disks) as well as shared-memory. MLGF-MF uses Multi-Level Grid File (MLGF) for partitioning the matrix and minimizes the cost for scheduling parallel SGD updates on the partitioned regions by exploiting partial match queries processing}. Thereby, MLGF-MF produces reliable results efficiently even on skewed matrices. MLGF-MF is designed with asynchronous I/O permeated in the algorithm such that CPU keeps executing without waiting for I/O to complete. Thereby, MLGF-MF overlaps the CPU and I/O processing, which eventually offsets the I/O cost and maximizes the CPU utility. Recent flash SSD disks support high performance parallel I/O, thus are appropriate for executing the asynchronous I/O. From our extensive evaluations, MLGF-MF significantly outperforms (or converges faster than) the state-of-the-art algorithms in both shared-memory and block-storage environments. In addition, the outputs of MLGF-MF is significantly more robust to skewed matrices. Our implementation of MLGF-MF is available at http://dm.postech.ac.kr/MLGF-MF as executable files.
【Keywords】: matrix factorization; stochastic gradient descent
【Paper Link】 【Pages】:875-884
【Authors】: Naoto Ohsaka ; Takanori Maehara ; Ken-ichi Kawarabayashi
【Abstract】: Real-world networks, such as the World Wide Web and online social networks, are very large and are evolving rapidly. Thus tracking personalized PageRank in such evolving networks is an important challenge in network analysis and graph mining. In this paper, we propose an efficient online algorithm for tracking personalized PageRank in an evolving network. The proposed algorithm tracks personalized PageRank accurately (i.e., within a given accuracy ε > 0). Moreover it can update the personalized PageRank scores in amortized O(1/ε) iterations for each graph modification. In addition, when m edges are randomly and sequentially inserted, the total number of iterations is expected to be O(log m/ε). We evaluated our algorithm in real-world networks. In average case, for each edge insertion and deletion, our algorithm updated the personalized PageRank in 3us in a web graph with 105M vertices and 3.7B edges, and 20ms in a social network with 42M vertices and 1.5B edges. By comparing existing state-of-the-arts algorithms, our algorithm is 2--290 times faster with an equal accuracy.
【Keywords】: dynamic graphs; online algorithms; personalized pagerank
【Paper Link】 【Pages】:885-894
【Authors】: Shota Okumura ; Yoshiki Suzuki ; Ichiro Takeuchi
【Abstract】: We introduce a novel sensitivity analysis framework for large scale classification problems that can be used when a small number of instances are incrementally added or removed. For quickly updating the classifier in such a situation, incremental learning algorithms have been intensively studied in the literature. Although they are much more efficient than solving the optimization problem from scratch, their computational complexity yet depends on the entire training set size. It means that, if the original training set is large, completely solving an incremental learning problem might be still rather expensive. To circumvent this computational issue, we propose a novel framework that allows us to make an inference about the updated classifier without actually re-optimizing it. Specifically, the proposed framework can quickly provide a lower and an upper bounds of a quantity on the unknown updated classifier. The main advantage of the proposed framework is that the computational cost of computing these bounds depends only on the number of updated instances. This property is quite advantageous in a typical sensitivity analysis task where only a small number of instances are updated. In this paper we demonstrate that the proposed framework is applicable to various practical sensitivity analysis tasks, and the bounds provided by the framework are often sufficiently tight for making desired inferences.
【Keywords】: incremental learning; leave-one-out cross- validation; sensitivity analysis
【Paper Link】 【Pages】:895-904
【Authors】: Mingdong Ou ; Peng Cui ; Fei Wang ; Jun Wang ; Wenwu Zhu
【Abstract】: Approximating the semantic similarity between entities in the learned Hamming space is the key for supervised hashing techniques. The semantic similarities between entities are often non-transitive since they could share different latent similarity components. For example, in social networks, we connect with people for various reasons, such as sharing common interests, working in the same company, being alumni and so on. Obviously, these social connections are non-transitive if people are connected due to different reasons. However, existing supervised hashing methods treat the pairwise similarity relationships in a simple and unified way and project data into a single Hamming space, while neglecting that the non-transitive property cannot be ade- quately captured by a single Hamming space. In this paper, we propose a non-transitive hashing method, namely Multi-Component Hashing (MuCH), to identify the latent similarity components to cope with the non-transitive similarity relationships. MuCH generates multiple hash tables with each hash table corresponding to a similarity component, and preserves the non-transitive similarities in different hash table respectively. Moreover, we propose a similarity measure, called Multi-Component Similarity, aggregating Hamming similarities in multiple hash tables to capture the non-transitive property of semantic similarity. We conduct extensive experiments on one synthetic dataset and two public real-world datasets (i.e. DBLP and NUS-WIDE). The results clearly demonstrate that the proposed MuCH method significantly outperforms the state-of-art hashing methods especially on search efficiency.
【Keywords】: hashing; non-transitive similarity; similarity components
【Paper Link】 【Pages】:905-914
【Authors】: Pan Chao ; Qiming Huang ; Michael Zhu
【Abstract】: The general goal of multivariate regression analysis is to infer about the relationship between a response variable Y and a predictor vector X. Many popularly used regression methods only focus on specific aspects of this relationship. Even though the conditional distribution P(Y|X) of Y given X fully characterizes the relationship between Y and X, estimation of the conditional density is challenging and becomes quickly infeasible when the dimension of X increases. In this article, we propose to use optimal group transformations as a general approach for exploring the relationship between Y and X. This approach can be considered an automatic procedure to identify the best characteristic of P(Y|X) under which the relationship between Y and X can be fully explored. The emphasis on using group transformations allows the approach to recover intrinsic group structures among the predictors. Furthermore, we develop kernel methods for estimating the optimal group transformations based on cross-covariance and conditional covariance operators. The statistical consistency of the estimates has been established. We refer to the proposed framework and approach as the Optimal Kernel Group Transformation (OKGT) method. Simulation study and real data applications show that the OKGT method is flexible and powerful for the general purpose of high dimensional regression analysis.
【Keywords】: conditional covariance operator; eigen decomposition; kernel method; optimal transformation; regression analysis
【Paper Link】 【Pages】:915-924
【Authors】: Christina Papagiannopoulou ; Grigorios Tsoumakas ; Ioannis Tsamardinos
【Abstract】: This work presents a probabilistic method for enforcing adherence of the marginal probabilities of a multi-label model to automatically discovered deterministic relationships among labels. In particular we focus on discovering two kinds of relationships among the labels. The first one concerns pairwise positive entailment: pairs of labels, where the presence of one implies the presence of the other in all instances of a dataset. The second concerns exclusion: sets of labels that do not coexist in the same instances of the dataset. These relationships are represented as a deterministic Bayesian network. Marginal probabilities are entered as soft evidence in the network and through probabilistic inference become consistent with the discovered knowledge. Our approach offers robust improvements in mean average precision compared to the standard binary relevance approach across all 12 datasets involved in our experiments. The discovery process helps interesting implicit knowledge to emerge, which could be useful in itself.
【Keywords】: bayesian network; deterministic label relationships; knowledge discovery; label constraints; multi-label learning
【Paper Link】 【Pages】:925-934
【Authors】: Chong Peng ; Zhao Kang ; Huiqing Li ; Qiang Cheng
【Abstract】: A number of machine learning and computer vision problems, such as matrix completion and subspace clustering, require a matrix to be of low-rank. To meet this requirement, most existing methods use the nuclear norm as a convex proxy of the rank function and minimize it. However, the nuclear norm simply adds all nonzero singular values together instead of treating them equally as the rank function does, which may not be a good rank approximation when some singular values are very large. To reduce this undesirable weighting effect, we use a log-determinant function as a non-convex rank approximation which reduces the contributions of large singular values while keeping those of small singular values close to zero. We apply the method of augmented Lagrangian multipliers to optimize this non-convex rank approximation-based objective function and obtain closed-form solutions for all subproblems of minimizing different variables alternatively. The log-determinant low-rank optimization method is used to solve subspace clustering problem, for which we construct an affinity matrix based on the angular information of the low-rank representation to enhance its separability property. Extensive experimental results on face clustering and motion segmentation data demonstrate the effectiveness of the proposed method.
【Keywords】: low-rank representation; nuclear norm; rank approximation; subspace clustering
【Paper Link】 【Pages】:935-944
【Authors】: Abdulhakim Ali Qahtan ; Basma Alharbi ; Suojin Wang ; Xiangliang Zhang
【Abstract】: Detecting changes in multidimensional data streams is an important and challenging task. In unsupervised change detection, changes are usually detected by comparing the distribution in a current (test) window with a reference window. It is thus essential to design divergence metrics and density estimators for comparing the data distributions, which are mostly done for univariate data. Detecting changes in multidimensional data streams brings difficulties to the density estimation and comparisons. In this paper, we propose a framework for detecting changes in multidimensional data streams based on principal component analysis, which is used for projecting data into a lower dimensional space, thus facilitating density estimation and change-score calculations. The proposed framework also has advantages over existing approaches by reducing computational costs with an efficient density estimator, promoting the change-score calculation by introducing effective divergence metrics, and by minimizing the efforts required from users on the threshold parameter setting by using the Page-Hinkley test. The evaluation results on synthetic and real data show that our framework outperforms two baseline methods in terms of both detection accuracy and computational costs.
【Keywords】: change detection; data streams; density estimation; principal component analysis
【Paper Link】 【Pages】:945-954
【Authors】: Guo-Jun Qi ; Charu Aggarwal ; Deepak S. Turaga ; Daby M. Sow ; Phil Anno
【Abstract】: An important problem in large-scale sensor mining is that of selecting relevant sensors for prediction purposes. Selecting small subsets of sensors, also referred to as active sensors, often leads to lower operational costs, and it reduces the noise and information overload for prediction. Existing sensor selection and prediction models either select a set of sensors a priori, or they use adaptive algorithms to determine the most relevant sensors for prediction. Sensor data sets often show dynamically varying patterns, because of which it is suboptimal to select a fixed subset of active sensors. To address this problem, we develop a novel dynamic prediction model that uses the notion of hidden system states to dynamically select a varying subset of sensors. These hidden system states are automatically learned by our model in a data-driven manner. The proposed algorithm can rapidly switch between different sets of active sensors when the model detects the (periodic or intermittent) change in the system state. We derive the dynamic sensor selection strategy by minimizing the error rates in tracking and predicting sensor readings over time. We introduce the notion of state-stacked sparseness to select a subset of the most critical sensors as a function of evolving system state. We present experimental results on two real sensor datasets, corresponding to oil drilling rig sensors and intensive care unit (ICU) sensors, and demonstrate the superiority of our approach with respect to other models.
【Keywords】: dynamic sensor selection; state-driven model; state-stacked sparseness
【Paper Link】 【Pages】:955-964
【Authors】: Shiyou Qian ; Jian Cao ; Frédéric Le Mouël ; Issam Sahel ; Minglu Li
【Abstract】: Recommending routes for a group of competing taxi drivers is almost untouched in most route recommender systems. For this kind of problem, recommendation fairness and driving efficiency are two fundamental aspects. In the paper, we propose SCRAM, a sharing considered route assignment mechanism for fair taxi route recommendations. SCRAM aims to provide recommendation fairness for a group of competing taxi drivers, without sacrificing driving efficiency. By designing a concise route assignment mechanism, SCRAM achieves better recommendation fairness for competing taxis. By considering the sharing of road sections to avoid unnecessary competition, SCRAM is more efficient in terms of driving cost per customer (DCC). We test SCRAM based on a large number of historical taxi trajectories and validate the recommendation fairness and driving efficiency of SCRAM with extensive evaluations. Experimental results show that SCRAM achieves better recommendation fairness and higher driving efficiency than three compared approaches.
【Keywords】: assignment mechanism; fairness; recommender systems; taxis
【Paper Link】 【Pages】:965-974
【Authors】: Lu Qin ; Rong-Hua Li ; Lijun Chang ; Chengqi Zhang
【Abstract】: Mining dense subgraphs from a large graph is a fundamental graph mining task and can be widely applied in a variety of application domains such as network science, biology, graph database, web mining, graph compression, and micro-blogging systems. Here a dense subgraph is defined as a subgraph with high density (#.edge / #.node). Existing studies of this problem either focus on finding the densest subgraph or identifying an optimal clique-like dense subgraph, and they adopt a simple greedy approach to find the top-k dense subgraphs. However, their identified subgraphs cannot be used to represent the dense regions of the graph. Intuitively, to represent a dense region, the subgraph identified should be the subgraph with highest density in its local region in the graph. However, it is non-trivial to formally model a locally densest subgraph. In this paper, we aim to discover top-k such representative locally densest subgraphs of a graph. We provide an elegant parameter-free definition of a locally densest subgraph. The definition not only fits well with the intuition, but is also associated with several nice structural properties. We show that the set of locally densest subgraphs in a graph can be computed in polynomial time. We further propose three novel pruning strategies to largely reduce the search space of the algorithm. In our experiments, we use several real datasets with various graph properties to evaluate the effectiveness of our model using four quality measures and a case study. We also test our algorithms on several real web-scale graphs, one of which contains 118.14 million nodes and 1.02 billion edges, to demonstrate the high efficiency of the proposed algorithms.
【Keywords】: big data; dense subgraph; graph
【Paper Link】 【Pages】:975-984
【Authors】: Angeliki Rapti ; Spyros Sioutas ; Kostas Tsichlas ; Giannis Tzimas
【Abstract】: Suppose we have a virus or one competing idea/product that propagates over a multiple profile (e.g., social) network. Can we predict what proportion of the network will actually get "infected" (e.g., spread the idea or buy the competing product), when the nodes of the network appear to have different sensitivity based on their profile? For example, if there are two profiles A and B in a network and the nodes of profile A and profile B are susceptible to a highly spreading virus with probabilities βA and βB respectively, what percentage of both profiles will actually get infected from the virus at the end? To reverse the question, what are the necessary conditions so that a predefined percentage of the network is infected? We assume that nodes of different profiles can infect one another and we prove that under realistic conditions, apart from the weak profile (great sensitivity), the stronger profile (low sensitivity) will get infected as well. First, we focus on cliques with the goal to provide exact theoretical results as well as to get some intuition as to how a virus affects such a multiple profile network. Then, we move to the theoretical analysis of arbitrary networks. We provide bounds on certain properties of the network based on the probabilities of infection of each node in it when it reaches the steady state. Finally, we provide extensive experimental results that verify our theoretical results and at the same time provide more insight on the problem.
【Keywords】: epidemics; profiles; virus propagation
【Paper Link】 【Pages】:985-994
【Authors】: Shebuti Rayana ; Leman Akoglu
【Abstract】: Online reviews capture the testimonials of "real" people and help shape the decisions of other consumers. Due to the financial gains associated with positive reviews, however, opinion spam has become a widespread problem, with often paid spam reviewers writing fake reviews to unjustly promote or demote certain products or businesses. Existing approaches to opinion spam have successfully but separately utilized linguistic clues of deception, behavioral footprints, or relational ties between agents in a review system. In this work, we propose a new holistic approach called SPEAGLE that utilizes clues from all metadata (text, timestamp, rating) as well as relational data (network), and harness them collectively under a unified framework to spot suspicious users and reviews, as well as products targeted by spam. Moreover, our method can efficiently and seamlessly integrate semi-supervision, i.e., a (small) set of labels if available, without requiring any training or changes in its underlying algorithm. We demonstrate the effectiveness and scalability of SPEAGLE on three real-world review datasets from Yelp.com with filtered (spam) and recommended (non-spam) reviews, where it significantly outperforms several baselines and state-of-the-art methods. To the best of our knowledge, this is the largest scale quantitative evaluation performed to date for the opinion spam problem.
【Keywords】: heterogenous networks; metadata; opinion spam; scalable algorithms; semi-supervised learning
【Paper Link】 【Pages】:995-1004
【Authors】: Xiang Ren ; Ahmed El-Kishky ; Chi Wang ; Fangbo Tao ; Clare R. Voss ; Jiawei Han
【Abstract】: Entity recognition is an important but challenging research problem. In reality, many text collections are from specific, dynamic, or emerging domains, which poses significant new challenges for entity recognition with increase in name ambiguity and context sparsity, requiring entity detection without domain restriction. In this paper, we investigate entity recognition (ER) with distant-supervision and propose a novel relation phrase-based ER framework, called ClusType, that runs data-driven phrase mining to generate entity mention candidates and relation phrases, and enforces the principle that relation phrases should be softly clustered when propagating type information between their argument entities. Then we predict the type of each entity mention based on the type signatures of its co-occurring relation phrases and the type indicators of its surface name, as computed over the corpus. Specifically, we formulate a joint optimization problem for two tasks, type propagation with relation phrases and multi-view relation phrase clustering. Our experiments on multiple genres---news, Yelp reviews and tweets---demonstrate the effectiveness and robustness of ClusType, with an average of 37% improvement in F1 score over the best compared method.
【Keywords】: entity recognition and typing; relation phrase clustering
【Paper Link】 【Pages】:1005-1014
【Authors】: Matteo Riondato ; Eli Upfal
【Abstract】: We present an algorithm to extract an high-quality approximation of the (top-k) Frequent itemsets (FIs) from random samples of a transactional dataset. With high probability the approximation is a superset of the FIs, and no itemset with frequency much lower than the threshold is included in it. The algorithm employs progressive sampling, with a stopping condition based on bounds to the empirical Rademacher average, a key concept from statistical learning theory. The computation of the bounds uses characteristic quantities that can be obtained efficiently with a single scan of the sample. Therefore, evaluating the stopping condition is fast, and does not require an expensive mining of each sample. Our experimental evaluation confirms the practicality of our approach on real datasets, outperforming approaches based on one-shot static sampling.
【Keywords】: frequent itemsets; pattern mining; rademacher averages; sampling; statistical learning theory
【Paper Link】 【Pages】:1015-1024
【Authors】: Yu Rong ; Hong Cheng ; Zhiyu Mo
【Abstract】: In nowadays social networks, a huge volume of content containing rich information, such as reviews, ratings, microblogs, etc., is being generated, consumed and diffused by users all the time. Given the temporal information, we can obtain the event cascade which indicates the time sequence of the arrival of information to users. Many models have been proposed to explain how information diffuses. However, most existing models cannot give a clear explanation why every specific event happens in the event cascade. Such explanation is essential for us to have a deeper understanding of information diffusion as well as a better prediction of future event cascade. In order to uncover the mechanism of the happening of social events, we analyze the rating event data crawled from Douban.com, a Chinese social network, from year 2006 to 2011. We distinguish three factors: social, external and intrinsic influence which can explain the emergence of every specific event. Then we use the mixed Poisson process to model event cascade generated by different factors respectively and integrate different Poisson processes with shared parameters. The proposed model, called Combinational Mixed Poisson Process (CMPP) model, can explain not only how information diffuses in social networks, but also why a specific event happens. This model can help us to understand information diffusion from both macroscopic and microscopic perspectives. We develop an efficient Classification EM algorithm to infer the model parameters. The explanatory and predictive power of the proposed model has been demonstrated by the experiments on large real data sets.
【Keywords】: event cascade; external influence; information diffusion; intrinsic influence; poisson process; social influence
【Paper Link】 【Pages】:1025-1034
【Authors】: Natali Ruchansky ; Mark Crovella ; Evimaria Terzi
【Abstract】: In many applications, e.g., recommender systems and traffic monitoring, the data comes in the form of a matrix that is only partially observed and low rank. A fundamental data-analysis task for these datasets is matrix completion, where the goal is to accurately infer the entries missing from the matrix. Even when the data satisfies the low-rank assumption, classical matrix-completion methods may output completions with significant error -- in that the reconstructed matrix differs significantly from the true underlying matrix. Often, this is due to the fact that the information contained in the observed entries is insufficient. In this work, we address this problem by proposing an active version of matrix completion, where queries can be made to the true underlying matrix. Subsequently, we design Order&Extend, which is the first algorithm to unify a matrix-completion approach and a querying strategy into a single algorithm. Order&Extend is able identify and alleviate insufficient information by judiciously querying a small number of additional entries. In an extensive experimental evaluation on real-world datasets, we demonstrate that our algorithm is efficient and is able to accurately reconstruct the true matrix while asking only a small number of queries.
【Keywords】: active querying; matrix completion; recommender systems
【Paper Link】 【Pages】:1035-1044
【Authors】: Issei Sato ; Hiroshi Nakagawa
【Abstract】: The collapsed variational Bayes zero (CVB0) inference is a variational inference improved by marginalizing out parameters, the same as with the collapsed Gibbs sampler. A drawback of the CVB0 inference is the memory requirements. A probability vector must be maintained for latent topics for every token in a corpus. When the total number of tokens is N and the number of topics is K, the CVB0 inference requires Ο(NK) memory. A stochastic approximation of the CVB0 (SCVB0) inference can reduce Ο(NK) to Ο(VK), where V denotes the vocabulary size. We reformulate the existing SCVB0 inference by using the stochastic divergence minimization algorithm, with which convergence can be analyzed in terms of Martingale convergence theory. We also reveal the property of the CVB0 inference in terms of the leave-one-out perplexity, which leads to the estimation algorithm of the Dirichlet distribution parameters. The predictive performance of the propose SCVB0 inference is better than that of the original SCVB0 inference in four datasets.
【Keywords】: collapsed variational bayes inference; latent dirichlet allocation; online learning; stochastic optimization; topic modeling; variational bayes inference
【Paper Link】 【Pages】:1045-1054
【Authors】: Aaron Schein ; John William Paisley ; David M. Blei ; Hanna M. Wallach
【Abstract】: We present a Bayesian tensor factorization model for inferring latent group structures from dynamic pairwise interaction patterns. For decades, political scientists have collected and analyzed records of the form "country i took action a toward country j at time t" - known as dyadic events - in order to form and test theories of international relations. We represent these event data as a tensor of counts and develop Bayesian Poisson tensor factorization to infer a low-dimensional, interpretable representation of their salient patterns. We demonstrate that our model's predictive performance is better than that of standard non-negative tensor factorization methods. We also provide a comparison of our variational updates to their maximum likelihood counterparts. In doing so, we identify a better way to form point estimates of the latent factors than that typically used in Bayesian Poisson matrix factorization. Finally, we showcase our model as an exploratory analysis tool for political scientists. We show that the inferred latent factor matrices capture interpretable multilateral relations that both conform to and inform our knowledge of international a airs.
【Keywords】: bayesian inference; dyadic data; international relations; poisson tensor factorization
【Paper Link】 【Pages】:1055-1064
【Authors】: Neil Shah ; Danai Koutra ; Tianmin Zou ; Brian Gallagher ; Christos Faloutsos
【Abstract】: How can we describe a large, dynamic graph over time? Is it random? If not, what are the most apparent deviations from randomness -- a dense block of actors that persists over time, or perhaps a star with many satellite nodes that appears with some fixed periodicity? In practice, these deviations indicate patterns -- for example, botnet attackers forming a bipartite core with their victims over the duration of an attack, family members bonding in a clique-like fashion over a difficult period of time, or research collaborations forming and fading away over the years. Which patterns exist in real-world dynamic graphs, and how can we find and rank them in terms of importance? These are exactly the problems we focus on in this work. Our main contributions are (a) formulation: we show how to formalize this problem as minimizing the encoding cost in a data compression paradigm, (b) algorithm: we propose TIMECRUNCH, an effective, scalable and parameter-free method for finding coherent, temporal patterns in dynamic graphs and (c) practicality: we apply our method to several large, diverse real-world datasets with up to 36 million edges and 6.3 million nodes. We show that TIMECRUNCH is able to compress these graphs by summarizing important temporal structures and finds patterns that agree with intuition.
【Keywords】: clustering; compression; dynamic graph; network; summarization
【Paper Link】 【Pages】:1065-1074
【Authors】: Dafna Shahaf ; Eric Horvitz ; Robert Mankoff
【Abstract】: Humor is an integral aspect of the human experience. Motivated by the prospect of creating computational models of humor, we study the influence of the language of cartoon captions on the perceived humorousness of the cartoons. Our studies are based on a large corpus of crowdsourced cartoon captions that were submitted to a contest hosted by the New Yorker. Having access to thousands of captions submitted for the same image allows us to analyze the breadth of responses of people to the same visual stimulus. We first describe how we acquire judgments about the humorousness of different captions. Then, we detail the construction of a corpus where captions deemed funnier are paired with less-funny captions for the same cartoon. We analyze the caption pairs and find significant differences between the funnier and less-funny captions. Next, we build a classifier to identify funnier captions automatically. Given two captions and a cartoon, our classifier picks the funnier one 69% of the time for captions hinging on the same joke, and 64% of the time for any pair of captions. Finally, we use the classifier to find the best captions and study how its predictions could be used to significantly reduce the load on the cartoon contest's judges.
【Keywords】: cartoon; cartoon caption; humor
【Paper Link】 【Pages】:1075-1084
【Authors】: Junming Shao ; Zhichao Han ; Qinli Yang ; Tao Zhou
【Abstract】: In this paper, we introduce a new community detection algorithm, called Attractor, which automatically spots communities in a network by examining the changes of "distances" among nodes (i.e. distance dynamics). The fundamental idea is to envision the target network as an adaptive dynamical system, where each node interacts with its neighbors. The interaction will change the distances among nodes, while the distances will affect the interactions. Such interplay eventually leads to a steady distribution of distances, where the nodes sharing the same community move together and the nodes in different communities keep far away from each other. Building upon the distance dynamics, Attractor has several remarkable advantages: (a) It provides an intuitive way to analyze the community structure of a network, and more importantly, faithfully captures the natural communities (with high quality). (b) Attractor allows detecting communities on large-scale networks due to its low time complexity (O(|E|)). (c) Attractor is capable of discovering communities of arbitrary size, and thus small-size communities or anomalies, usually existing in real-world networks, can be well pinpointed. Extensive experiments show that our algorithm allows the effective and efficient community detection and has good performance compared to state-of-the-art algorithms.
【Keywords】: community detection; interaction model; network
【Paper Link】 【Pages】:1085-1094
【Authors】: Mohammad Shokoohi-Yekta ; Yanping Chen ; Bilson J. L. Campana ; Bing Hu ; Jesin Zakaria ; Eamonn J. Keogh
【Abstract】: The ability to make predictions about future events is at the heart of much of science; so, it is not surprising that prediction has been a topic of great interest in the data mining community for the last decade. Most of the previous work has attempted to predict the future based on the current value of a stream. However, for many problems the actual values are irrelevant, whereas the shape of the current time series pattern may foretell the future. The handful of research efforts that consider this variant of the problem have met with limited success. In particular, it is now understood that most of these efforts allow the discovery of spurious rules. We believe the reason why rule discovery in real-valued time series has failed thus far is because most efforts have more or less indiscriminately applied the ideas of symbolic stream rule discovery to real-valued rule discovery. In this work, we show why these ideas are not directly suitable for rule discovery in time series. Beyond our novel definitions/representations, which allow for meaningful and extendable specifications of rules, we further show novel algorithms that allow us to quickly discover high quality rules in very large datasets that accurately predict the occurrence of future events.
【Keywords】: motif discovery; prediction; rule discovery; time series
【Paper Link】 【Pages】:1095-1104
【Authors】: Julian Shun
【Abstract】: This paper presents efficient shared-memory parallel implementations and the first comprehensive experimental study of graph eccentricity estimation algorithms in the literature. The implementations include (1) a simple algorithm based on executing two-pass breadth-first searches from a sample of vertices, (2) algorithms with sub-quadratic worst-case running time for sparse graphs and non-trivial approximation guarantees that execute breadth-first searches from a carefully chosen set of vertices, (3) algorithms based on probabilistic counters, and (4) a well-known 2-approximation algorithm that executes one breadth-first search per connected component. Our experiments on large undirected real-world graphs show that the algorithm based on two-pass breadth-first searches works surprisingly well, outperforming the other algorithms in terms of running time and/or accuracy by up to orders of magnitude. The high accuracy, efficiency, and parallelism of our best implementation allows the fast generation of eccentricity estimates for large graphs, which are useful in many applications arising in large-scale network analysis.
【Keywords】: approximation; estimation; graph diameter; graph eccentricity; graph radius; parallel algorithms; real-world graphs
【Paper Link】 【Pages】:1105-1114
【Authors】: Dongjin Song ; David A. Meyer ; Dacheng Tao
【Abstract】: Signed networks, in which the relationship between two nodes can be either positive (indicating a relationship such as trust) or negative (indicating a relationship such as distrust), are becoming increasingly common. A plausible model for user behavior analytics in signed networks can be based upon the assumption that more extreme positive and negative relationships are explored and exploited before less extreme ones. Such a model implies that a personalized ranking list of latent links should place positive links on the top, negative links at the bottom, and unknown status links in between. Traditional ranking metrics, e.g., area under the receiver operating characteristic curve (AUC), are however not suitable for quantifying such a ranking list which includes positive, negative, and unknown status links. To address this issue, a generalized AUC (GAUC) which can measure both the head and tail of a ranking list has been introduced. Since GAUC weights each pairwise comparison equally and the calculation of GAUC requires quadratic time, we derive two lower bounds of GAUC which can be computed in linear time and put more emphasis on ranking positive links on the top and negative links at the bottom of a ranking list. Next, we develop two efficient latent link recommendation (ELLR) algorithms in order to recommend links by directly optimizing these two lower bounds, respectively. Finally, we compare these two ELLR algorithms with top-performing baseline methods over four benchmark datasets, among which the largest network has more than 100 thousand nodes and seven million entries. Thorough empirical studies demonstrate that the proposed ELLR algorithms outperform state-of-the-art approaches for link recommendation in signed networks at no cost in efficiency.
【Keywords】: gauc; link recommendation; recommender systems; signed networks
【Paper Link】 【Pages】:1115-1124
【Authors】: Shaoxu Song ; Chunping Li ; Xiaoquan Zhang
【Abstract】: Dirty data commonly exist. Simply discarding a large number of inaccurate points (as noises) could greatly affect clustering results. We argue that dirty data can be repaired and utilized as strong supports in clustering. To this end, we study a novel problem of clustering and repairing over dirty data at the same time. Referring to the minimum change principle in data repairing, the objective is to find a minimum modification of inaccurate points such that the large amount of dirty data can enhance the clustering. We show that the problem can be formulated as an integer linear programming (ILP) problem. Efficient approximation is then devised by a linear programming (LP) relaxation. In particular, we illustrate that an optimal solution of the LP problem can be directly obtained without calling a solver. A quadratic time approximation algorithm is developed based on the aforesaid LP solution. We further advance the algorithm to linear time cost, where a trade-off between effectiveness and efficiency is enabled. Empirical results demonstrate that both the clustering and cleaning accuracies can be improved by our approach of repairing and utilizing the dirty data in clustering.
【Keywords】: data cleaning; data repairing
【Paper Link】 【Pages】:1125-1133
【Authors】: Stergios Stergiou ; Kostas Tsioutsiouliklis
【Abstract】: The classic Set Cover problem requires selecting a minimum size subset A ⊆ F from a family of finite subsets F Of U such that the elements covered by A are the ones covered by F. It naturally occurs in many settings in web search, web mining and web advertising. The greedy algorithm that iteratively selects a set in F that covers the most uncovered elements, yields an optimum (1+ln |U|)-approximation but is inherently sequential. In this work we give the first MapReduce Set Cover algorithm that scales to problem sizes of ∼ 1 trillion elements and runs in logp Δ iterations for a nearly optimum approximation ratio of p ln Δ, where Δ is the cardinality of the largest set in F A web crawler is a system for bulk downloading of web pages. Given a set of seed URLs, the crawler downloads and extracts the hyperlinks embedded in them and schedules the crawling of the pages addressed by those hyperlinks for a subsequent iteration. While the average page out-degree is ∼ 50, the crawled corpus grows at a much smaller rate, implying a significant outlink overlap. Using our MapReduce Set Cover heuristic as a building block, we present the first large-scale seed generation algorithm that scales to ∼ 20 billion nodes and discovers new pages at a rate ∼ 4x faster than that obtained by prior art heuristics.
【Keywords】: map-reduce; max cover; set cover
【Paper Link】 【Pages】:1135-1144
【Authors】: Yu Su ; Shengqi Yang ; Huan Sun ; Mudhakar Srivatsa ; Sue Kase ; Michelle Vanni ; Xifeng Yan
【Abstract】: The big data era is witnessing a prevalent shift of data from homogeneous to heterogeneous, from isolated to linked. Exemplar outcomes of this shift are a wide range of graph data such as information, social, and knowledge graphs. The unique characteristics of graph data are challenging traditional search techniques like SQL and keyword search. Graph query is emerging as a promising complementary search form. In this paper, we study how to improve graph query by relevance feedback. Specifically, we focus on knowledge graph query, and formulate the graph relevance feedback (GRF) problem. We propose a general GRF framework that is able to (1) tune the original ranking function based on user feedback and (2) further enrich the query itself by mining new features from user feedback. As a consequence, a query-specific ranking function is generated, which is better aligned with the user search intent. Given a newly learned ranking function based on user feedback, we further investigate whether we shall re-rank the existing answers, or choose to search from scratch. We propose a strategy to train a binary classifier to predict which action will be more beneficial for a given query. The GRF framework is applied to searching DBpedia with graph queries derived from YAGO and Wikipedia. Experiment results show that GRF can improve the mean average precision by 80% to 100%.
【Keywords】: graph query; knowledge graph; relevance feedback
【Paper Link】 【Pages】:1145-1154
【Authors】: Zhaonan Sun ; Fei Wang ; Jianying Hu
【Abstract】: Comprehensive risk assessment lies in the core of enabling proactive healthcare delivery systems. In recent years, data-driven predictive modeling approaches have been increasingly recognized as promising techniques to help enhance healthcare quality and reduce cost. In this paper, we propose a data-driven comprehensive risk prediction method, named LINKAGE, which can be used to jointly assess a set of associated risks in support of holistic care management. Our method can not only perform prediction but also discover the relationships among those risks. The advantages of the proposed model include: 1) It can leverage the relationship between risks and domains and achieve better risk prediction performance; 2) It provides a data-driven approach to understand relationship between risks; 3) It leverages the information between risk prediction and risk association learning to regulate the improvement on both parts; 4) It provides flexibility to incorporate domain knowledge in learning risk associations. We validate the effectiveness of the proposed model on synthetic data and a real-world healthcare survey data set.
【Keywords】: comprehensive risk prediction; covariance matrix; generalized linear model; generalized thresholding; healthcare
【Paper Link】 【Pages】:1155-1164
【Authors】: Ben Tan ; Yangqiu Song ; ErHeng Zhong ; Qiang Yang
【Abstract】: Transfer learning, which leverages knowledge from source domains to enhance learning ability in a target domain, has been proven effective in various applications. One major limitation of transfer learning is that the source and target domains should be directly related. If there is little overlap between the two domains, performing knowledge transfer between these domains will not be effective. Inspired by human transitive inference and learning ability, whereby two seemingly unrelated concepts can be connected by a string of intermediate bridges using auxiliary concepts, in this paper we study a novel learning problem: Transitive Transfer Learning (abbreviated to TTL). TTL is aimed at breaking the large domain distances and transfer knowledge even when the source and target domains share few factors directly. For example, when the source and target domains are documents and images respectively, TTL could use some annotated images as the intermediate domain to bridge them. To solve the TTL problem, we propose a learning framework to mimic the human learning process. The framework is composed of an intermediate domain selection component and a knowledge transfer component. Extensive empirical evidence shows that the framework yields state-of-the-art classification accuracies on several classification data sets.
【Keywords】: nonnegative matrix tri-factorizations; transfer learning; transitive transfer learning
【Paper Link】 【Pages】:1165-1174
【Authors】: Jian Tang ; Meng Qu ; Qiaozhu Mei
【Abstract】: Unsupervised text embedding methods, such as Skip-gram and Paragraph Vector, have been attracting increasing attention due to their simplicity, scalability, and effectiveness. However, comparing to sophisticated deep learning architectures such as convolutional neural networks, these methods usually yield inferior results when applied to particular machine learning tasks. One possible reason is that these text embedding methods learn the representation of text in a fully unsupervised way, without leveraging the labeled information available for the task. Although the low dimensional representations learned are applicable to many different tasks, they are not particularly tuned for any task. In this paper, we fill this gap by proposing a semi-supervised representation learning method for text data, which we call the predictive text embedding (PTE). Predictive text embedding utilizes both labeled and unlabeled data to learn the embedding of text. The labeled information and different levels of word co-occurrence information are first represented as a large-scale heterogeneous text network, which is then embedded into a low dimensional space through a principled and efficient algorithm. This low dimensional embedding not only preserves the semantic closeness of words and documents, but also has a strong predictive power for the particular task. Compared to recent supervised approaches based on convolutional neural networks, predictive text embedding is comparable or more effective, much more efficient, and has fewer parameters to tune.
【Keywords】: predictive text embedding; representation learning
【Paper Link】 【Pages】:1175-1184
【Authors】: Ya-Wen Teng ; Chih-Hua Tai ; Philip S. Yu ; Ming-Syan Chen
【Abstract】: Recently the influence maximization problem has received much attention for its applications on viral marketing and product promotions. However, such influence maximization problems have not taken into account the monetary effect on the purchasing decision of individuals. To fulfill this gap, in this paper, we aim for maximizing the revenue by considering the quantity constraint on the promoted commodity. For this problem, we not only identify a proper small group of individuals as seeds for promotion but also determine the pricing of the commodity. To tackle the revenue maximization problem, we first introduce a strategic searching algorithm, referred to as Algorithm PRUB, which is able to derive the optimal solutions. After that, we further modify PRUB to propose a heuristic, Algorithm PRUB+IF, for obtaining feasible solutions more efficiently on larger instances. Experiments on real social networks with different valuation distributions demonstrate the effectiveness of PRUB and PRUB+IF.
【Keywords】: marketing; monetizing social networks; pricing; revenue maximization; social influence
【Paper Link】 【Pages】:1185-1194
【Authors】: Kenneth Tran ; Saghar Hosseini ; Lin Xiao ; Thomas Finley ; Mikhail Bilenko
【Paper Link】 【Pages】:1185-1194
【Authors】: Kenneth Tran ; Saghar Hosseini ; Lin Xiao ; Thomas Finley ; Mikhail Bilenko
【Abstract】: Stochastic Dual Coordinate Ascent (SDCA) has recently emerged as a state-of-the-art method for solving large-scale supervised learning problems formulated as minimization of convex loss functions. It performs iterative, random-coordinate updates to maximize the dual objective. Due to the sequential nature of the iterations, it is typically implemented as a single-threaded algorithm limited to in-memory datasets. In this paper, we introduce an asynchronous parallel version of the algorithm, analyze its convergence properties, and propose a solution for primal-dual synchronization required to achieve convergence in practice. In addition, we describe a method for scaling the algorithm to out-of-memory datasets via multi-threaded deserialization of block-compressed data. This approach yields sufficient pseudo-randomness to provide the same convergence rate as random-order in-memory access. Empirical evaluation demonstrates the efficiency of the proposed methods and their ability to fully utilize computational resources and scale to out-of-memory datasets.
【Keywords】:
【Paper Link】 【Pages】:1195-1204
【Authors】: Hastagiri P. Vanchinathan ; Andreas Marfurt ; Charles-Antoine Robelin ; Donald Kossmann ; Andreas Krause
【Abstract】: Suppose there is a large collection of items, each with an associated cost and an inherent utility that is revealed only once we commit to selecting it. Given a budget on the cumulative cost of the selected items, how can we pick a subset of maximal value? This task generalizes several important problems such as multi-arm bandits, active search and the knapsack problem. We present an algorithm, GP-SELECT, which utilizes prior knowledge about similarity between items, expressed as a kernel function. GP-SELECT uses Gaussian process prediction to balance exploration (estimating the unknown value of items) and exploitation (selecting items of high value). We extend GP-SELECT to be able to discover sets that simultaneously have high utility and are diverse. Our preference for diversity can be specified as an arbitrary monotone submodular function that quantifies the diminishing returns obtained when selecting similar items. Furthermore, we exploit the structure of the model updates to achieve an order of magnitude (up to 40X) speedup in our experiments without resorting to approximations. We provide strong guarantees on the performance of GP-SELECT and apply it to three real-world case studies of industrial relevance: (1) Refreshing a repository of prices in a Global Distribution System for the travel industry, (2) Identifying diverse, binding-affine peptides in a vaccine design task and (3) Maximizing clicks in a web-scale recommender system by recommending items to users.
【Keywords】: active learning; active search; design of experiments; kernel methods; recommender systems
【Paper Link】 【Pages】:1205-1214
【Authors】: Vivek Veeriah ; Rohit Durvasula ; Guo-Jun Qi
【Abstract】: This paper explores the idea of using deep neural network architecture with dynamically programmed layers for brain connectome prediction problem. Understanding the brain connectome structure is a very interesting and a challenging problem. It is critical in the research for epilepsy and other neuropathological diseases. We introduce a new deep learning architecture that exploits the spatial and temporal nature of the neuronal activation data. The architecture consists of a combination of Convolutional layer and a Recurrent layer for predicting the connectome of neurons based on their time-series of activation data. The key contribution of this paper is a dynamically programmed layer that is critical in determining the alignment between the neuronal activations of pair-wise combinations of neurons.
【Keywords】: brain connectome prediction; deep learning; dynamically programmed layer; time-series alignment
【Paper Link】 【Pages】:1215-1224
【Authors】: Chenguang Wang ; Yangqiu Song ; Ahmed El-Kishky ; Dan Roth ; Ming Zhang ; Jiawei Han
【Abstract】: One of the key obstacles in making learning protocols realistic in applications is the need to supervise them, a costly process that often requires hiring domain experts. We consider the framework to use the world knowledge as indirect supervision. World knowledge is general-purpose knowledge, which is not designed for any specific domain. Then the key challenges are how to adapt the world knowledge to domains and how to represent it for learning. In this paper, we provide an example of using world knowledge for domain dependent document clustering. We provide three ways to specify the world knowledge to domains by resolving the ambiguity of the entities and their types, and represent the data with world knowledge as a heterogeneous information network. Then we propose a clustering algorithm that can cluster multiple types and incorporate the sub-type information as constraints. In the experiments, we use two existing knowledge bases as our sources of world knowledge. One is Freebase, which is collaboratively collected knowledge about entities and their organizations. The other is YAGO2, a knowledge base automatically extracted from Wikipedia and maps knowledge to the linguistic knowledge base, WordNet. Experimental results on two text benchmark datasets (20newsgroups and RCV1) show that incorporating world knowledge as indirect supervision can significantly outperform the state-of-the-art clustering algorithms as well as clustering algorithms enhanced with world knowledge features.
【Keywords】: document clustering; heterogeneous information network; knowledge base; knowledge graph; world knowledge
【Paper Link】 【Pages】:1225-1234
【Authors】: Chi Wang ; Xueqing Liu ; Yanglei Song ; Jiawei Han
【Abstract】: Automatic construction of user-desired topical hierarchies over large volumes of text data is a highly desirable but challenging task. This study proposes to give users freedom to construct topical hierarchies via interactive operations such as expanding a branch and merging several branches. Existing hierarchical topic modeling techniques are inadequate for this purpose because (1) they cannot consistently preserve the topics when the hierarchy structure is modified; and (2) the slow inference prevents swift response to user requests. In this study, we propose a novel method, called STROD, that allows efficient and consistent modification of topic hierarchies, based on a recursive generative model and a scalable tensor decomposition inference algorithm with theoretical performance guarantee. Empirical evaluation shows that STROD reduces the runtime of construction by several orders of magnitude, while generating consistent and quality hierarchies.
【Keywords】: interactive data exploration; ontology learning; tensor decomposition; topic modeling
【Paper Link】 【Pages】:1235-1244
【Authors】: Hao Wang ; Naiyan Wang ; Dit-Yan Yeung
【Abstract】: Collaborative filtering (CF) is a successful approach commonly used by many recommender systems. Conventional CF-based methods use the ratings given to items by users as the sole source of information for learning to make recommendation. However, the ratings are often very sparse in many applications, causing CF-based methods to degrade significantly in their recommendation performance. To address this sparsity problem, auxiliary information such as item content information may be utilized. Collaborative topic regression (CTR) is an appealing recent method taking this approach which tightly couples the two components that learn from two different sources of information. Nevertheless, the latent representation learned by CTR may not be very effective when the auxiliary information is very sparse. To address this problem, we generalize recently advances in deep learning from i.i.d. input to non-i.i.d. (CF-based) input and propose in this paper a hierarchical Bayesian model called collaborative deep learning (CDL), which jointly performs deep representation learning for the content information and collaborative filtering for the ratings (feedback) matrix. Extensive experiments on three real-world datasets from different domains show that CDL can significantly advance the state of the art.
【Keywords】: deep learning; recommender systems; text mining; topic model
【Paper Link】 【Pages】:1245-1254
【Authors】: Jialei Wang ; Ryohei Fujimaki ; Yosuke Motohashi
【Abstract】: Model interpretability has been recognized to play a key role in practical data mining. Interpretable models provide significant insights on data and model behaviors and may convince end-users to employ certain models. In return for these advantages, however, there is generally a sacrifice in accuracy, i.e., flexibility of model representation (e.g., linear, rule-based, etc.) and model complexity needs to be restricted in order for users to be able to understand the results. This paper proposes oblique treed sparse additive models (OT-SpAMs). Our main focus is on developing a model which sacrifices a certain degree of interpretability for accuracy but achieves entirely sufficient accuracy with such fully non-linear models as kernel support vector machines (SVMs). OT-SpAMs are instances of region-specific predictive models. They divide feature spaces into regions with sparse oblique tree splitting and assign local sparse additive experts to individual regions. In order to maintain OT-SpAM interpretability, we have to keep the overall model structure simple, and this produces simultaneous model selection issues for sparse oblique region structures and sparse local experts. We address this problem by extending factorized asymptotic Bayesian inference. We demonstrate, on simulation, benchmark, and real world datasets that, in terms of accuracy, OT-SpAMs outperform state-of-the-art interpretable models and perform competitively with kernel SVMs, while still providing results that are highly understandable.
【Keywords】: interpretable model; model selection; sparseness
【Paper Link】 【Pages】:1255-1264
【Authors】: Weiqing Wang ; Hongzhi Yin ; Ling Chen ; Yizhou Sun ; Shazia Wasim Sadiq ; Xiaofang Zhou
【Abstract】: With the rapid development of location-based social networks (LBSNs), spatial item recommendation has become an important means to help people discover attractive and interesting venues and events, especially when users travel out of town. However, this recommendation is very challenging compared to the traditional recommender systems. A user can visit only a limited number of spatial items, leading to a very sparse user-item matrix. Most of the items visited by a user are located within a short distance from where he/she lives, which makes it hard to recommend items when the user travels to a far away place. Moreover, user interests and behavior patterns may vary dramatically across different geographical regions. In light of this, we propose Geo-SAGE, a geographical sparse additive generative model for spatial item recommendation in this paper. Geo-SAGE considers both user personal interests and the preference of the crowd in the target region, by exploiting both the co-occurrence pattern of spatial items and the content of spatial items. To further alleviate the data sparsity issue, Geo-SAGE exploits the geographical correlation by smoothing the crowd's preferences over a well-designed spatial index structure called spatial pyramid. We conduct extensive experiments and the experimental results clearly demonstrate our Geo-SAGE model outperforms the state-of-the-art.
【Keywords】: cold start; location-based service; recommender system; sparse additive model; spatial pyramid
【Paper Link】 【Pages】:1265-1274
【Authors】: Yichen Wang ; Robert Chen ; Joydeep Ghosh ; Joshua C. Denny ; Abel N. Kho ; You Chen ; Bradley A. Malin ; Jimeng Sun
【Abstract】: Computational phenotyping is the process of converting heterogeneous electronic health records (EHRs) into meaningful clinical concepts. Unsupervised phenotyping methods have the potential to leverage a vast amount of labeled EHR data for phenotype discovery. However, existing unsupervised phenotyping methods do not incorporate current medical knowledge and cannot directly handle missing, or noisy data. We propose Rubik, a constrained non-negative tensor factorization and completion method for phenotyping. Rubik incorporates 1) guidance constraints to align with existing medical knowledge, and 2) pairwise constraints for obtaining distinct, non-overlapping phenotypes. Rubik also has built-in tensor completion that can significantly alleviate the impact of noisy and missing data. We utilize the Alternating Direction Method of Multipliers (ADMM) framework to tensor factorization and completion, which can be easily scaled through parallel computing. We evaluate Rubik on two EHR datasets, one of which contains 647,118 records for 7,744 patients from an outpatient clinic, the other of which is a public dataset containing 1,018,614 CMS claims records for 472,645 patients. Our results show that Rubik can discover more meaningful and distinct phenotypes than the baselines. In particular, by using knowledge guidance constraints, Rubik can also discover sub-phenotypes for several major diseases. Rubik also runs around seven times faster than current state-of-the-art tensor methods. Finally, Rubik is scalable to large datasets containing millions of EHR records.
【Keywords】: computational phenotyping; constraint optimization; healthcare analytics; tensor analysis
【Paper Link】 【Pages】:1275-1284
【Authors】: Yingzi Wang ; Nicholas Jing Yuan ; Defu Lian ; Linli Xu ; Xing Xie ; Enhong Chen ; Yong Rui
【Abstract】: Mobility prediction enables appealing proactive experiences for location-aware services and offers essential intelligence to business and governments. Recent studies suggest that human mobility is highly regular and predictable. Additionally, social conformity theory indicates that people's movements are influenced by others. However, existing approaches for location prediction fail to organically combine both the regularity and conformity of human mobility in a unified model, and lack the capacity to incorporate heterogeneous mobility datasets to boost prediction performance. To address these challenges, in this paper we propose a hybrid predictive model integrating both the regularity and conformity of human mobility as well as their mutual reinforcement. In addition, we further elevate the predictive power of our model by learning location profiles from heterogeneous mobility datasets based on a gravity model. We evaluate the proposed model using several city-scale mobility datasets including location check-ins, GPS trajectories of taxis, and public transit data. The experimental results validate that our model significantly outperforms state-of-the-art approaches for mobility prediction in terms of multiple metrics such as accuracy and percentile rank. The results also suggest that the predictability of human mobility is time-varying, e.g., the overall predictability is higher on workdays than holidays while predicting users' unvisited locations is more challenging for workdays than holidays.
【Keywords】: collaborative filtering; conformity; gravity model; location prediction; location profile; regularity; spatial influence
【Paper Link】 【Pages】:1285-1294
【Authors】: Zheng Wang ; Prithwish Chakraborty ; Sumiko R. Mekaru ; John S. Brownstein ; Jieping Ye ; Naren Ramakrishnan
【Abstract】: Influenza-like-illness (ILI) is among of the most common diseases worldwide, and reliable forecasting of the same can have significant public health benefits. Recently, new forms of disease surveillance based upon digital data sources have been proposed and are continuing to attract attention over traditional surveillance methods. In this paper, we focus on short-term ILI case count prediction and develop a dynamic Poisson autoregressive model with exogenous inputs variables (DPARX) for flu forecasting. In this model, we allow the autoregressive model to change over time. In order to control the variation in the model, we construct a model similarity graph to specify the relationship between pairs of models at two time points and embed prior knowledge in terms of the structure of the graph. We formulate ILI case count forecasting as a convex optimization problem, whose objective balances the autoregressive loss and the model similarity regularization induced by the structure of the similarity graph. We then propose an efficient algorithm to solve this problem by block coordinate descent. We apply our model and the corresponding learning method on historical ILI records for 15 countries around the world using a variety of syndromic surveillance data sources. Our approach provides consistently better forecasting results than state-of-the-art models available for short-term ILI case count forecasting.
【Keywords】: autoregressive models; flu forecasting; time series methods
【Paper Link】 【Pages】:1295-1304
【Authors】: Jörg Wicker ; Nicolas Krauter ; Bettina Derstorff ; Christof Stönner ; Efstratios Bourtsoukidis ; Thomas Klüpfel ; Jonathan Williams ; Stefan Kramer
【Abstract】: While the physiological response of humans to emotional events or stimuli is well-investigated for many modalities (like EEG, skin resistance, ...), surprisingly little is known about the exhalation of so-called Volatile Organic Compounds (VOCs) at quite low concentrations in response to such stimuli. VOCs are molecules of relatively small mass that quickly evaporate or sublimate and can be detected in the air that surrounds us. The paper introduces a new field of application for data mining, where trace gas responses of people reacting on-line to films shown in cinemas (or movie theaters) are related to the semantic content of the films themselves. To do so, we measured the VOCs from a movie theater over a whole month in intervals of thirty seconds, and annotated the screened films by a controlled vocabulary compiled from multiple sources. To gain a better understanding of the data and to reveal unknown relationships, we have built prediction models for so-called forward prediction (the prediction of future VOCs from the past), backward prediction (the prediction of past scene labels from future VOCs), which is some form of abductive reasoning, and Granger causality. Experimental results show that some VOCs and some labels can be predicted with relatively low error, and that hint for causality with low p-values can be detected in the data. The data set is publicly available at: https://github.com/jorro/smelloffear.
【Keywords】: abductive reasoning; application; atmospheric chemistry; breath analysis; causality; data mining; emotional response analysis; movie analysis
【Paper Link】 【Pages】:1305-1314
【Authors】: Wush Chi-Hsuan Wu ; Mi-Yen Yeh ; Ming-Syan Chen
【Abstract】: In the aspect of a Demand-Side Platform (DSP), which is the agent of advertisers, we study how to predict the winning price such that the DSP can win the bid by placing a proper bidding value in the real-time bidding (RTB) auction. We propose to leverage the machine learning and statistical methods to train the winning price model from the bidding history. A major challenge is that a DSP usually suffers from the censoring of the winning price, especially for those lost bids in the past. To solve it, we utilize the censored regression model, which is widely used in the survival analysis and econometrics, to fit the censored bidding data. Note, however, the assumption of censored regression does not hold on the real RTB data. As a result, we further propose a mixture model, which combines linear regression on bids with observable winning prices and censored regression on bids with the censored winning prices, weighted by the winning rate of the DSP. Experiment results show that the proposed mixture model in general prominently outperforms linear regression in terms of the prediction accuracy.
【Keywords】: demand-side platform; display advertising; learning with partial labels; real-time bidding
【Paper Link】 【Pages】:1315-1324
【Authors】: Pengtao Xie ; Yuntian Deng ; Eric P. Xing
【Abstract】: Restricted Boltzmann Machine (RBM) has shown great effectiveness in document modeling. It utilizes hidden units to discover the latent topics and can learn compact semantic representations for documents which greatly facilitate document retrieval, clustering and classification. The popularity (or frequency) of topics in text corpora usually follow a power-law distribution where a few dominant topics occur very frequently while most topics (in the long-tail region) have low probabilities. Due to this imbalance, RBM tends to learn multiple redundant hidden units to best represent dominant topics and ignore those in the long-tail region, which renders the learned representations to be redundant and non-informative. To solve this problem, we propose Diversified RBM (DRBM) which diversifies the hidden units, to make them cover not only the dominant topics, but also those in the long-tail region. We define a diversity metric and use it as a regularizer to encourage the hidden units to be diverse. Since the diversity metric is hard to optimize directly, we instead optimize its lower bound and prove that maximizing the lower bound with projected gradient ascent can increase this diversity metric. Experiments on document retrieval and clustering demonstrate that with diversification, the document modeling power of DRBM can be greatly improved.
【Keywords】: diversified restricted boltzmann machine; diversity; document modeling; power-law distribution; topic modeling
【Paper Link】 【Pages】:1325-1334
【Authors】: Wenlei Xie ; David Bindel ; Alan J. Demers ; Johannes Gehrke
【Abstract】: Personalized PageRank is a standard tool for finding vertices in a graph that are most relevant to a query or user. To personalize PageRank, one adjusts node weights or edge weights that determine teleport probabilities and transition probabilities in a random surfer model. There are many fast methods to approximate PageRank when the node weights are personalized; however, personalization based on edge weights has been an open problem since the dawn of personalized PageRank over a decade ago. In this paper, we describe the first fast algorithm for computing PageRank on general graphs when the edge weights are personalized. Our method, which is based on model reduction, outperforms existing methods by nearly five orders of magnitude. This huge performance gain over previous work allows us --- for the very first time --- to solve learning-to-rank problems for edge weight personalization at interactive speeds, a goal that had not previously been achievable for this class of problems.
【Keywords】: personalized pagerank
【Paper Link】 【Pages】:1335-1344
【Authors】: Eric P. Xing ; Qirong Ho ; Wei Dai ; Jin Kyu Kim ; Jinliang Wei ; Seunghak Lee ; Xun Zheng ; Pengtao Xie ; Abhimanu Kumar ; Yaoliang Yu
【Abstract】: How can one build a distributed framework that allows efficient deployment of a wide spectrum of modern advanced machine learning (ML) programs for industrial-scale problems using Big Models (100s of billions of parameters) on Big Data (terabytes or petabytes)- Contemporary parallelization strategies employ fine-grained operations and scheduling beyond the classic bulk-synchronous processing paradigm popularized by MapReduce, or even specialized operators relying on graphical representations of ML programs. The variety of approaches tends to pull systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of different ML programs at scale. We propose a general-purpose framework that systematically addresses data- and model-parallel challenges in large-scale ML, by leveraging several fundamental properties underlying ML programs that make them different from conventional operation-centric programs: error tolerance, dynamic structure, and nonuniform convergence; all stem from the optimization-centric nature shared in ML programs' mathematical definitions, and the iterative-convergent behavior of their algorithmic solutions. These properties present unique opportunities for an integrative system design, built on bounded-latency network synchronization and dynamic load-balancing scheduling, which is efficient, programmable, and enjoys provable correctness guarantees. We demonstrate how such a design in light of ML-first principles leads to significant performance improvements versus well-known implementations of several ML programs, allowing them to run in much less time and at considerably larger model sizes, on modestly-sized computer clusters.
【Keywords】: big data; big model; data-parallelism; distributed systems; machine learning; model-parallelism; theory
【Paper Link】 【Pages】:1345-1354
【Authors】: Tingyang Xu ; Jiangwen Sun ; Jinbo Bi
【Abstract】: Longitudinal analysis is important in many disciplines, such as the study of behavioral transitions in social science. Only very recently, feature selection has drawn adequate attention in the context of longitudinal modeling. Standard techniques, such as generalized estimating equations, have been modified to select features by imposing sparsity-inducing regularizers. However, they do not explicitly model how a dependent variable relies on features measured at proximal time points. Recent graphical Granger modeling can select features in lagged time points but ignores the temporal correlations within an individual's repeated measurements. We propose an approach to automatically and simultaneously determine both the relevant features and the relevant temporal points that impact the current outcome of the dependent variable. Meanwhile, the proposed model takes into account the non-i.i.d nature of the data by estimating the within-individual correlations. This approach decomposes model parameters into a summation of two components and imposes separate block-wise LASSO penalties to each component when building a linear model in terms of the past τ measurements of features. One component is used to select features whereas the other is used to select temporal contingent points. An accelerated gradient descent algorithm is developed to efficiently solve the related optimization problem with detailed convergence analysis and asymptotic analysis. Computational results on both synthetic and real world problems demonstrate the superior performance of the proposed approach over existing techniques.
【Keywords】: longitudinal modeling; regression; regularization methods; sparse predictive modeling
【Paper Link】 【Pages】:1355-1364
【Authors】: Feng Yan ; Olatunji Ruwase ; Yuxiong He ; Trishul M. Chilimbi
【Abstract】: Big deep neural network (DNN) models trained on large amounts of data have recently achieved the best accuracy on hard tasks, such as image and speech recognition. Training these DNNs using a cluster of commodity machines is a promising approach since training is time consuming and compute-intensive. To enable training of extremely large DNNs, models are partitioned across machines. To expedite training on very large data sets, multiple model replicas are trained in parallel on different subsets of the training examples with a global parameter server maintaining shared weights across these replicas. The correct choice for model and data partitioning and overall system provisioning is highly dependent on the DNN and distributed system hardware characteristics. These decisions currently require significant domain expertise and time consuming empirical state space exploration. This paper develops performance models that quantify the impact of these partitioning and provisioning decisions on overall distributed system performance and scalability. Also, we use these performance models to build a scalability optimizer that efficiently determines the optimal system configuration that minimizes DNN training time. We evaluate our performance models and scalability optimizer using a state-of-the-art distributed DNN training framework on two benchmark applications. The results show our performance models estimate DNN training time with high estimation accuracy and our scalability optimizer correctly chooses the best configurations, minimizing the training time of distributed DNNs.
【Keywords】: deep learning; distributed system; optimization; performance modeling; scalability
【Paper Link】 【Pages】:1365-1374
【Authors】: Pinar Yanardag ; S. V. N. Vishwanathan
【Abstract】: In this paper, we present Deep Graph Kernels, a unified framework to learn latent representations of sub-structures for graphs, inspired by latest advancements in language modeling and deep learning. Our framework leverages the dependency information between sub-structures by learning their latent representations. We demonstrate instances of our framework on three popular graph kernels, namely Graphlet kernels, Weisfeiler-Lehman subtree kernels, and Shortest-Path graph kernels. Our experiments on several benchmark datasets show that Deep Graph Kernels achieve significant improvements in classification accuracy over state-of-the-art graph kernels.
【Keywords】: bioinformatics; collaboration networks; deep learning; graph kernels; r-convolution kernels; social networks; string kernels; structured data
【Paper Link】 【Pages】:1375-1384
【Authors】: Pei Yang ; Jingrui He
【Abstract】: In many real world applications such as satellite image analysis, gene function prediction, and insider threat detection, the data collected from heterogeneous sources often exhibit multiple types of heterogeneity, such as task heterogeneity, view heterogeneity, and label heterogeneity. To address this problem, we propose a Hierarchical Multi-Latent Space (HiMLS) learning approach to jointly model the triple types of heterogeneity. The basic idea is to learn a hierarchical multi-latent space by which we can simultaneously leverage the task relatedness, view consistency and the label correlations to improve the learning performance. We first propose a multi-latent space framework to model the complex heterogeneity, which is used as a building block to stack up a multi-layer structure so as to learn the hierarchical multi-latent space. In such a way, we can gradually learn the more abstract concepts in the higher level. Then, a deep learning algorithm is proposed to solve the optimization problem. The experimental results on various data sets show the effectiveness of the proposed approach.
【Keywords】: heterogeneous learning; multi-label learning; multi-task learning; multi-view learning
【Paper Link】 【Pages】:1385-1394
【Authors】: Sen Yang ; Qian Sun ; Shuiwang Ji ; Peter Wonka ; Ian Davidson ; Jieping Ye
【Abstract】: Investigations into brain connectivity aim to recover networks of brain regions connected by anatomical tracts or by functional associations. The inference of brain networks has recently attracted much interest due to the increasing availability of high-resolution brain imaging data. Sparse inverse covariance estimation with lasso and group lasso penalty has been demonstrated to be a powerful approach to discover brain networks. Motivated by the hierarchical structure of the brain networks, we consider the problem of estimating a graphical model with tree-structural regularization in this paper. The regularization encourages the graphical model to exhibit a brain-like structure. Specifically, in this hierarchical structure, hundreds of thousands of voxels serve as the leaf nodes of the tree. A node in the intermediate layer represents a region formed by voxels in the subtree rooted at that node. The whole brain is considered as the root of the tree. We propose to apply the tree-structural regularized graphical model to estimate the mouse brain network. However, the dimensionality of whole-brain data, usually on the order of hundreds of thousands, poses significant computational challenges. Efficient algorithms that are capable of estimating networks from high-dimensional data are highly desired. To address the computational challenge, we develop a screening rule which can quickly identify many zero blocks in the estimated graphical model, thereby dramatically reducing the computational cost of solving the proposed model. It is based on a novel insight on the relationship between screening and the so-called proximal operator that we first establish in this paper. We perform experiments on both synthetic data and real data from the Allen Developing Mouse Brain Atlas; results demonstrate the effectiveness and efficiency of the proposed approach.
【Keywords】: brain networks; graphical lasso; proximal operator; screening; second-order method; tree-structural regularization
【Paper Link】 【Pages】:1395-1404
【Authors】: Yang Yang ; Yizhou Sun ; Jie Tang ; Bo Ma ; Juan-Zi Li
【Abstract】: Given an entity in a source domain, finding its matched entities from another (target) domain is an important task in many applications. Traditionally, the problem was usually addressed by first extracting major keywords corresponding to the source entity and then query relevant entities from the target domain using those keywords. However, the method would inevitably fails if the two domains have less or no overlapping in the content. An extreme case is that the source domain is in English and the target domain is in Chinese. In this paper, we formalize the problem as entity matching across heterogeneous sources and propose a probabilistic topic model to solve the problem. The model integrates the topic extraction and entity matching, two core subtasks for dealing with the problem, into a unified model. Specifically, for handling the text disjointing problem, we use a cross-sampling process in our model to extract topics with terms coming from all the sources, and leverage existing matching relations through latent topic layers instead of at text layers. Benefit from the proposed model, we can not only find the matched documents for a query entity, but also explain why these documents are related by showing the common topics they share. Our experiments in two real-world applications show that the proposed model can extensively improve the matching performance (+19.8% and +7.1% in two applications respectively) compared with several alternative methods.
【Keywords】: cross-lingual matching; heterogeneous sources; topic model
【Paper Link】 【Pages】:1405-1414
【Authors】: Jinfeng Yi ; Lijun Zhang ; Tianbao Yang ; Wei Liu ; Jun Wang
【Abstract】: Semi-supervised clustering leverages side information such as pairwise constraints to guide clustering procedures. Despite promising progress, existing semi-supervised clustering approaches overlook the condition of side information being generated sequentially, which is a natural setting arising in numerous real-world applications such as social network and e-commerce system analysis. Given emerged new constraints, classical semi-supervised clustering algorithms need to re-optimize their objectives over all data samples and constraints in availability, which prevents them from efficiently updating the obtained data partitions. To address this challenge, we propose an efficient dynamic semi-supervised clustering framework that casts the clustering problem into a search problem over a feasible convex set, i.e., a convex hull with its extreme points being an ensemble of m data partitions. According to the principle of ensemble clustering, the optimal partition lies in the convex hull, and can thus be uniquely represented by an m-dimensional probability simplex vector. As such, the dynamic semi-supervised clustering problem is simplified to the problem of updating a probability simplex vector subject to the newly received pairwise constraints. We then develop a computationally efficient updating procedure to update the probability simplex vector in O(m2) time, irrespective of the data size n. Our empirical studies on several real-world benchmark datasets show that the proposed algorithm outperforms the state-of-the-art semi-supervised clustering algorithms with visible performance gain and significantly reduced running time.
【Keywords】: convex hull; probability simplex; semi-supervised clustering; sequential constraints
【Paper Link】 【Pages】:1415-1424
【Authors】: Chao Zhang ; Yu Zheng ; Xiuli Ma ; Jiawei Han
【Abstract】: Recent years have witnessed the wide proliferation of geo-sensory applications wherein a bundle of sensors are deployed at different locations to cooperatively monitor the target condition. Given massive geo-sensory data, we study the problem of mining spatial co-evolving patterns (SCPs), i.e., groups of sensors that are spatially correlated and co-evolve frequently in their readings. SCP mining is of great importance to various real-world applications, yet it is challenging because (1) the truly interesting evolutions are often flooded by numerous trivial fluctuations in the geo-sensory time series; and (2) the pattern search space is extremely large due to the spatiotemporal combinatorial nature of SCP. In this paper, we propose a two-stage method called Assember. In the first stage, Assember filters trivial fluctuations using wavelet transform and detects frequent evolutions for individual sensors via a segment-and-group approach. In the second stage, Assember generates SCPs by assembling the frequent evolutions of individual sensors. Leveraging the spatial constraint, it conceptually organizes all the SCPs into a novel structure called the SCP search tree, which facilitates the effective pruning of the search space to generate SCPs efficiently. Our experiments on both real and synthetic data sets show that Assember is effective, efficient, and scalable.
【Keywords】: co-evolving pattern; sensor network; spatiotemporal data
【Paper Link】 【Pages】:1425-1434
【Authors】: Hao Zhang ; Gunhee Kim ; Eric P. Xing
【Abstract】: We propose a dynamic topic model for monitoring temporal evolution of market competition by jointly leveraging tweets and their associated images. For a market of interest (e.g. luxury goods), we aim at automatically detecting the latent topics (e.g. bags, clothes, luxurious) that are competitively shared by multiple brands (e.g. Burberry, Prada, and Chanel), and tracking temporal evolution of the brands' stakes over the shared topics. One of key applications of our work is social media monitoring that can provide companies with temporal summaries of highly overlapped or discriminative topics with their major competitors. We design our model to correctly address three major challenges: multiview representation of text and images, modeling of competitiveness of multiple brands over shared topics, and tracking their temporal evolution. As far as we know, no previous model can satisfy all the three challenges. For evaluation, we analyze about 10 millions of tweets and 8 millions of associated images of the 23 brands in the two categories of luxury and beer. Through experiments, we show that the proposed approach is more successful than other candidate methods for the topic modeling of competition. We also quantitatively demonstrate the generalization power of the proposed method for three prediction tasks.
【Keywords】: dynamic topic models; market competition; text and images
【Paper Link】 【Pages】:1435-1444
【Authors】: Jiawei Zhang ; Philip S. Yu ; Yuanhua Lv
【Abstract】: Nowadays, to facilitate the communication and cooperation among employees, a new family of online social networks has been adopted in many companies, which are called the "enterprise social networks" (ESNs). ESNs can provide employees with various professional services to help them deal with daily work issues. Meanwhile, employees in companies are usually organized into different hierarchies according to the relative ranks of their positions. The company internal management structure can be outlined with the organizational chart visually, which is normally confidential to the public out of the privacy and security concerns. In this paper, we want to study the IOC (Inference of Organizational Chart) problem to identify company internal organizational chart based on the heterogeneous online ESN launched in it. IOC is very challenging to address as, to guarantee smooth operations, the internal organizational charts of companies need to meet certain structural requirements (about its depth and width). To solve the IOC problem, a novel unsupervised method Create (ChArT REcovEr) is proposed in this paper, which consists of 3 steps: (1) social stratification of ESN users into different social classes, (2) supervision link inference from managers to subordinates, and (3) consecutive social classes matching to prune the redundant supervision links. Extensive experiments conducted on real-world online ESN dataset demonstrate that Create can perform very well in addressing the IOC problem.
【Keywords】: data mining; enterprise social network; organization chart inference
【Paper Link】 【Pages】:1445-1454
【Authors】: Jing Zhang ; Jie Tang ; Cong Ma ; Hanghang Tong ; Yu Jing ; Juanzi Li
【Abstract】: Estimating similarity between vertices is a fundamental issue in network analysis across various domains, such as social networks and biological networks. Methods based on common neighbors and structural contexts have received much attention. However, both categories of methods are difficult to scale up to handle large networks (with billions of nodes). In this paper, we propose a sampling method that provably and accurately estimates the similarity between vertices. The algorithm is based on a novel idea of random path. Specifically, given a network, we perform R random walks, each starting from a randomly picked vertex and walking T steps. Theoretically, the algorithm guarantees that the sampling size R = O(2ε-2 log2 T) depends on the error-bound ε, the confidence level (1 -- δ), and the path length T of each random walk. We perform extensive empirical study on a Tencent microblogging network of 1,000,000,000 edges. We show that our algorithm can return top-k similar vertices for any vertex in a network 300× faster than the state-of-the-art methods. We also use two applications-identity resolution and structural hole spanner finding--to evaluate the accuracy of the estimated similarities. Our results demonstrate that the proposed algorithm achieves clearly better performance than several alternative methods.
【Keywords】: random path; social network; vertex similarity
【Paper Link】 【Pages】:1455-1464
【Authors】: Wei Zhang ; Jianyong Wang
【Abstract】: Event-based social networks (EBSNs), in which organizers publish events to attract other users in local city to attend offline, emerge in recent years and grow rapidly. Due to the large volume of events in EBSNs, event recommendation is essential. A few recent works focus on this task, while almost all the methods need that each event to be recommended should have been registered by some users to attend. Thus they ignore two essential characteristics of events in EBSNs: (1) a large number of new events will be published every day which means many events have few participants in the beginning, (2) events have life cycles which means outdated events should not be recommended. Overall, event recommendation in EBSNs inevitably faces the cold-start problem. In this work, we address the new problem of cold-start local event recommendation in EBSNs. We propose a collective Bayesian Poisson factorization (CBPF) model for handling this problem. CBPF takes recently proposed Bayesian Poisson factorization as its basic unit to model user response to events, social relation, and content text separately. Then it further jointly connects these units by the idea of standard collective matrix factorization model. Moreover, in our model event textual content, organizer, and location information are utilized to learn representation of cold-start events for predicting user response to them. Besides, an efficient coordinate ascent algorithm is adopted to learn the model. We conducted comprehensive experiments on real datasets crawled from EBSNs and the results demonstrate our proposed model is effective and outperforms several alternative methods.
【Keywords】: bayesian poisson factorization; cold-start recommendation; event recommendation; event-based social networks
【Paper Link】 【Pages】:1465-1474
【Authors】: Weinan Zhang ; Jun Wang
【Abstract】: We study and formulate arbitrage in display advertising. Real-Time Bidding (RTB) mimics stock spot exchanges and utilises computers to algorithmically buy display ads per impression via a real-time auction. Despite the new automation, the ad markets are still informationally inefficient due to the heavily fragmented marketplaces. Two display impressions with similar or identical effectiveness (e.g., measured by conversion or click-through rates for a targeted audience) may sell for quite different prices at different market segments or pricing schemes. In this paper, we propose a novel data mining paradigm called Statistical Arbitrage Mining (SAM) focusing on mining and exploiting price discrepancies between two pricing schemes. In essence, our SAMer is a meta-bidder that hedges advertisers' risk between CPA (cost per action)-based campaigns and CPM (cost per mille impressions)-based ad inventories; it statistically assesses the potential profit and cost for an incoming CPM bid request against a portfolio of CPA campaigns based on the estimated conversion rate, bid landscape and other statistics learned from historical data. In SAM, (i) functional optimisation is utilised to seek for optimal bidding to maximise the expected arbitrage net profit, and (ii) a portfolio-based risk management solution is leveraged to reallocate bid volume and budget across the set of campaigns to make a risk and return trade-off. We propose to jointly optimise both components in an EM fashion with high efficiency to help the meta-bidder successfully catch the transient statistical arbitrage opportunities in RTB. Both the offline experiments on a real-world large-scale dataset and online A/B tests on a commercial platform demonstrate the effectiveness of our proposed solution in exploiting arbitrage in various model settings and market environments.
【Keywords】: display ads; real-time bidding; statistical arbitrage
【Paper Link】 【Pages】:1475-1484
【Authors】: Wenlu Zhang ; Rongjian Li ; Tao Zeng ; Qian Sun ; Sudhir Kumar ; Jieping Ye ; Shuiwang Ji
【Abstract】: A central theme in learning from image data is to develop appropriate image representations for the specific task at hand. Traditional methods used handcrafted local features combined with high-level image representations to generate image-level representations. Thus, a practical challenge is to determine what features are appropriate for specific tasks. For example, in the study of gene expression patterns in Drosophila melanogaster, texture features based on wavelets were particularly effective for determining the developmental stages from in situ hybridization (ISH) images. Such image representation is however not suitable for controlled vocabulary (CV) term annotation because each CV term is often associated with only a part of an image. Here, we developed problem-independent feature extraction methods to generate hierarchical representations for ISH images. Our approach is based on the deep convolutional neural networks (CNNs) that can act on image pixels directly. To make the extracted features generic, the models were trained using a natural image set with millions of labeled examples. These models were transferred to the ISH image domain and used directly as feature extractors to compute image representations. Furthermore, we employed multi-task learning method to fine-tune the pre-trained models with labeled ISH images, and also extracted features from the fine-tuned models. Experimental results showed that feature representations computed by deep models based on transfer and multi-task learning significantly outperformed other methods for annotating gene expression patterns at different stage ranges. We also demonstrated that the intermediate layers of deep models produced the best gene expression pattern representations.
【Keywords】: bioinformatics; deep learning; image analysis; multi-task learning; transfer learning
【Paper Link】 【Pages】:1485-1494
【Authors】: Yutao Zhang ; Jie Tang ; Zhilin Yang ; Jian Pei ; Philip S. Yu
【Abstract】: More often than not, people are active in more than one social network. Identifying users from multiple heterogeneous social networks and integrating the different networks is a fundamental issue in many applications. The existing methods tackle this problem by estimating pairwise similarity between users in two networks. However, those methods suffer from potential inconsistency of matchings between multiple networks. In this paper, we propose COSNET (COnnecting heterogeneous Social NETworks with local and global consistency), a novel energy-based model, to address this problem by considering both local and global consistency among multiple networks. An efficient subgradient algorithm is developed to train the model by converting the original energy-based objective function into its dual form. We evaluate the proposed model on two different genres of data collections: SNS and Academia, each consisting of multiple heterogeneous social networks. Our experimental results validate the effectiveness and efficiency of the proposed model. On both data collections, the proposed COSNET method significantly outperforms several alternative methods by up to 10-30% (p << 0:001, t-test) in terms of F1-score. We also demonstrate that applying the integration results produced by our method can improve the accuracy of expert finding, an important task in social networks.
【Keywords】: energy-based model; network integration; social network
【Paper Link】 【Pages】:1495-1502
【Authors】: Huasha Zhao ; Biye Jiang ; John F. Canny ; Bobby Jaros
【Abstract】: Gibbs sampling is a workhorse for Bayesian inference but has several limitations when used for parameter estimation, and is often much slower than non-sampling inference methods. SAME (State Augmentation for Marginal Estimation) [15, 8] is an approach to MAP parameter estimation which gives improved parameter estimates over direct Gibbs sampling. SAME can be viewed as cooling the posterior parameter distribution and allows annealed search for the MAP parameters, often yielding very high quality estimates. But it does so at the expense of additional samples per iteration and generally slower performance. On the other hand, SAME dramatically increases the parallelism in the sampling schedule, and is an excellent match for modern (SIMD) hardware. In this paper we explore the application of SAME to graphical model inference on modern hardware. We show that combining SAME with factored sample representation (or approximation) gives throughput competitive with the fastest symbolic methods, but with potentially better quality. We describe experiments on Latent Dirichlet Allocation, achieving speeds similar to the fastest reported methods (online Variational Bayes) and lower cross-validated loss than other LDA implementations. The method is simple to implement and should be applicable to many other models.
【Keywords】: gibbs sampling; latent dirichlet allocation; simulated annealing
【Paper Link】 【Pages】:1503-1512
【Authors】: Liang Zhao ; Qian Sun ; Jieping Ye ; Feng Chen ; Chang-Tien Lu ; Naren Ramakrishnan
【Abstract】: Spatial event forecasting from social media is an important problem but encounters critical challenges, such as dynamic patterns of features (keywords) and geographic heterogeneity (e.g., spatial correlations, imbalanced samples, and different populations in different locations). Most existing approaches (e.g., LASSO regression, dynamic query expansion, and burst detection) are designed to address some of these challenges, but not all of them. This paper proposes a novel multi-task learning framework which aims to concurrently address all the challenges. Specifically, given a collection of locations (e.g., cities), we propose to build forecasting models for all locations simultaneously by extracting and utilizing appropriate shared information that effectively increases the sample size for each location, thus improving the forecasting performance. We combine both static features derived from a predefined vocabulary by domain experts and dynamic features generated from dynamic query expansion in a multi-task feature learning framework; we investigate different strategies to balance homogeneity and diversity between static and dynamic terms. Efficient algorithms based on Iterative Group Hard Thresholding are developed to achieve efficient and effective model training and prediction. Extensive experimental evaluations on Twitter data from four different countries in Latin America demonstrated the effectiveness of our proposed approach.
【Keywords】: dynamic query expansion; event forecasting; hard thresholding; lasso; multi-task learning
【Paper Link】 【Pages】:1513-1522
【Authors】: Qingyuan Zhao ; Murat A. Erdogdu ; Hera Y. He ; Anand Rajaraman ; Jure Leskovec
【Abstract】: Social networking websites allow users to create and share content. Big information cascades of post resharing can form as users of these sites reshare others' posts with their friends and followers. One of the central challenges in understanding such cascading behaviors is in forecasting information outbreaks, where a single post becomes widely popular by being reshared by many users. In this paper, we focus on predicting the final number of reshares of a given post. We build on the theory of self-exciting point processes to develop a statistical model that allows us to make accurate predictions. Our model requires no training or expensive feature engineering. It results in a simple and efficiently computable formula that allows us to answer questions, in real-time, such as: Given a post's resharing history so far, what is our current estimate of its final number of reshares? Is the post resharing cascade past the initial stage of explosive growth? And, which posts will be the most reshared in the future? We validate our model using one month of complete Twitter data and demonstrate a strong improvement in predictive accuracy over existing approaches. Our model gives only 15% relative error in predicting final size of an average information cascade after observing it for just one hour.
【Keywords】: cascade prediction; contagion; information diffusion; self-exciting point process; social media
【Paper Link】 【Pages】:1523-1532
【Authors】: Xun Zheng ; Yaoliang Yu ; Eric P. Xing
【Abstract】: Topic models are effective probabilistic tools for processing large collections of unstructured data. With the exponential growth of modern industrial data, and consequentially also with our ambition to explore much bigger models, there is a real pressing need to significantly scale up topic modeling algorithms, which has been taken up in lots of previous works, culminating in the recent fast Markov chain Monte Carlo sampling algorithms in [10, 23] for the unsupervised latent Dirichlet allocation (LDA) formulations. In this work we extend the recent sampling advances for unsupervised LDA models to supervised tasks. We focus on the Gibbs MedLDA model [27] that is able to simultaneously discover latent structures and make accurate predictions. By combining a set of sampling techniques we are able to reduce the O(K3 + DK2 + DNK complexity in [27] to O(DK + DN) when there are K topics and D documents with average length N. To our best knowledge, this is the first linear time sampling algorithm for supervised topic models. Our algorithm requires minimal modifications to incorporate most loss functions in a variety of supervised tasks, and we observe in our experiments an order of magnitude speedup over the current state-of-the-art implementation, while achieving similar prediction performances. The open-source C++ implementation of the proposed algorithm is available at https://github.com/xunzheng/light_medlda.
【Keywords】: inference; large margin classification; mcmc; regression; scale mixtures; topic models
【Paper Link】 【Pages】:1533-1542
【Authors】: Yan Zheng ; Jeff M. Phillips
【Abstract】: Kernel density estimates are a robust way to reconstruct a continuous distribution from a discrete point set. Typically their effectiveness is measured either in L1 or L2 error. In this paper we investigate the challenges in using L∞ (or worst case) error, a stronger measure than L1 or L2. We present efficient solutions to two linked challenges: how to evaluate the L∞ error between two kernel density estimates and how to choose the bandwidth parameter for a kernel density estimate built on a subsample of a large data set.
【Keywords】: bandwidth selection; coresets; kernel density estimates
【Paper Link】 【Pages】:1543-1552
【Authors】: Shi Zhi ; Bo Zhao ; Wenzhu Tong ; Jing Gao ; Dian Yu ; Heng Ji ; Jiawei Han
【Abstract】: When integrating information from multiple sources, it is common to encounter conflicting answers to the same question. Truth discovery is to infer the most accurate and complete integrated answers from conflicting sources. In some cases, there exist questions for which the true answers are excluded from the candidate answers provided by all sources. Without any prior knowledge, these questions, named no-truth questions, are difficult to be distinguished from the questions that have true answers, named has-truth questions. In particular, these no-truth questions degrade the precision of the answer integration system. We address such a challenge by introducing source quality, which is made up of three fine-grained measures: silent rate, false spoken rate and true spoken rate. By incorporating these three measures, we propose a probabilistic graphical model, which simultaneously infers truth as well as source quality without any a priori training involving ground truth answers. Moreover, since inferring this graphical model requires parameter tuning of the prior of truth, we propose an initialization scheme based upon a quantity named truth existence score, which synthesizes two indicators, namely, participation rate and consistency rate. Compared with existing methods, our method can effectively filter out no-truth questions, which results in more accurate source quality estimation. Consequently, our method provides more accurate and complete answers to both has-truth and no-truth questions. Experiments on three real-world datasets illustrate the notable advantage of our method over existing state-of-the-art truth discovery methods.
【Keywords】: information extraction; information integration system; knowledge base; knowledge graph; probabilistic graphical model; source quality; truth discovery; truth existence; truth finding
【Paper Link】 【Pages】:1553-1562
【Authors】: Li Zhou ; David G. Andersen ; Mu Li ; Alexander J. Smola
【Abstract】: In this paper we present a novel data structure for sparse vectors based on Cuckoo hashing. It is highly memory efficient and allows for random access at near dense vector level rates. This allows us to solve sparse l1 programming problems exactly and without preprocessing at a cost that is identical to dense linear algebra both in terms of memory and speed. Our approach provides a feasible alternative to the hash kernel and it excels whenever exact solutions are required, such as for feature selection.
【Keywords】: hashing; linear models; sparse vectors
【Paper Link】 【Pages】:1563-1572
【Authors】: Yang Zhou ; Ling Liu ; David Buttler
【Abstract】: Meta paths are good mechanisms to improve the quality of graph analysis on heterogeneous information networks. This paper presents a meta path graph clustering framework, VEPATHCLUSTER, that combines meta path vertex-centric clustering with meta path edge-centric clustering for improving the clustering quality of heterogeneous networks. First, we propose an edge-centric path graph model to capture the meta-path dependencies between pairwise path edges. We model a heterogeneous network containing M types of meta paths as M vertex-centric path graphs and M edge-centric path graphs. Second, we propose a clustering-based multigraph model to capture the fine-grained clustering-based relationships between pairwise vertices and between pairwise path edges. We perform clustering analysis on both a unified vertex-centric path graph and each edge-centric path graph to generate vertex clustering and edge clusterings of the original heterogeneous network respectively. Third, a reinforcement algorithm is provided to tightly integrate vertex-centric clustering and edge-centric clustering by mutually enhancing each other. Finally, an iterative learning strategy is presented to dynamically refine both vertex-centric clustering and edge-centric clustering by continuously learning the contributions and adjusting the weights of different path graphs.
【Keywords】: edge-centric random walk; meta path graph clustering; vertex/edge-centric clustering; vertex/edge-centric path graph/multigraph
【Paper Link】 【Pages】:1573-1582
【Authors】: Wen-Yuan Zhu ; Wen-Chih Peng ; Ling-Jyh Chen ; Kai Zheng ; Xiaofang Zhou
【Abstract】: With the explosion of smartphones and social network services, location-based social networks (LBSNs) are increasingly seen as tools for businesses (e.g., restaurants, hotels) to promote their products and services. In this paper, we investigate the key techniques that can help businesses promote their locations by advertising wisely through the underlying LBSNs. In order to maximize the benefit of location promotion, we formalize it as an influence maximization problem in an LBSN, i.e., given a target location and an LBSN, which a set of k users (called seeds) should be advertised initially such that they can successfully propagate and attract most other users to visit the target location. Existing studies have proposed different ways to calculate the information propagation probability, that is how likely a user may influence another, in the settings of static social network. However, it is more challenging to derive the propagation probability in an LBSN since it is heavily affected by the target location and the user mobility, both of which are dynamic and query dependent. This paper proposes two user mobility models, namely Gaussian-based and distance-based mobility models, to capture the check-in behavior of individual LBSN user, based on which location-aware propagation probabilities can be derived respectively. Extensive experiments based on two real LBSN datasets have demonstrated the superior effectiveness of our proposals than existing static models of propagation probabilities to truly reflect the information propagation in LBSNs.
【Keywords】: check-in behavior; influence maximization; location-based social network; propagation probability
【Paper Link】 【Pages】:1583-1592
【Authors】: Yada Zhu ; Hongxia Yang ; Jingrui He
【Abstract】: This paper targets the problem of cargo pricing optimization in the air cargo business. Given the features associated with a pair of origination and destination, how can we simultaneously predict both the optimal price for the bid stage and the outcome of the transaction (win rate) in the decision stage? In addition, it is often the case that the matrix representing pairs of originations and destinations has a block structure, i.e., the originations and destinations can be co-clustered such that the predictive models are similar within the same co-cluster, and exhibit significant variation among different co-clusters. How can we uncover the co-clusters of originations and destinations while constructing the dual predictive models for the two stages? We take the first step at addressing these problems. In particular, we propose a probabilistic framework to simultaneously construct dual predictive models and uncover the co-clusters of originations and destinations. It maximizes the conditional probability of observing the responses from both the quotation stage and the decision stage, given the features and the co-clusters. By introducing an auxiliary distribution based on the co-clustering assumption, such conditional probability can be converted into an objective function. To minimize the objective function, we propose the cocoa algorithm, which will generate both the suite of predictive models for all the pairs of originations and destinations, as well as the co-clusters consisting of similar pairs. Experimental results on both synthetic data and real data from cargo price bidding demonstrate the effectiveness and efficiency of the proposed algorithm.
【Keywords】: co-clustering; dual predictive models
【Paper Link】 【Pages】:1593-1602
【Authors】: Honglei Zhuang ; Aditya G. Parameswaran ; Dan Roth ; Jiawei Han
【Abstract】: Crowdsourcing is the de-facto standard for gathering annotated data. While, in theory, data annotation tasks are assumed to be attempted by workers independently, in practice, data annotation tasks are often grouped into batches to be presented and annotated by workers together, in order to save on the time or cost overhead of providing instructions or necessary background. Thus, even though independence is usually assumed between annotations on data items within the same batch, in most cases, a worker's judgment on a data item can still be affected by other data items within the batch, leading to additional errors in collected labels. In this paper, we study the data annotation bias when data items are presented as batches to be judged by workers simultaneously. We propose a novel worker model to characterize the annotating behavior on data batches, and present how to train the worker model on annotation data sets. We also present a debiasing technique to remove the effect of such annotation bias from adversely affecting the accuracy of labels obtained. Our experimental results on synthetic and real-world data sets demonstrate that our proposed method can achieve up to +57% improvement in F1-score compared to the standard majority voting baseline.
【Keywords】: annotation bias; crowdsourcing; worker accuracy model
【Paper Link】 【Pages】:1603-1612
【Authors】: Kostas Zoumpatianos ; Yin Lou ; Themis Palpanas ; Johannes Gehrke
【Abstract】: Data series are a prevalent data type that has attracted lots of interest in recent years. Most of the research has focused on how to efficiently support similarity or nearest neighbor queries over large data series collections (an important data mining task), and several data series summarization and indexing methods have been proposed in order to solve this problem. Nevertheless, up to this point very little attention has been paid to properly evaluating such index structures, with most previous work relying solely on randomly selected data series to use as queries (with/without adding noise). In this work, we show that random workloads are inherently not suitable for the task at hand and we argue that there is a need for carefully generating a query workload. We define measures that capture the characteristics of queries, and we propose a method for generating workloads with the desired properties, that is, effectively evaluating and comparing data series summarizations and indexes. In our experimental evaluation, with carefully controlled query workloads, we shed light on key factors affecting the performance of nearest neighbor search in large data series collections.
【Keywords】: data series; indexing; similarity search; workloads
【Paper Link】 【Pages】:1621
【Authors】: Deepak Agarwal
【Abstract】: Scaling web applications like recommendation systems, search and computational advertising is challenging. Such systems have to make astronomical number of decisions every day on what to serve users when they are visiting the website and/or using the mobile app. Machine learning and statistical modeling approaches that can obtain insights by continuously processing large amounts of data emitted at very high frequency by these applications have emerged as the method of choice. However, there are three challenges to scale such methods : a) scientific b) infrastructure and c) organizational. I will provide an overview of these challenges and the strategies we have adopted at LinkedIn to address those. Throughout, I will illustrate with examples from real-world applications at LinkedIn.
【Keywords】: machine learning; statistical modeling; web applications
【Paper Link】 【Pages】:1623
【Authors】: Amr Awadallah
【Abstract】: As Hadoop and the surrounding projects & vendors mature, their impact on the data management sector is growing. Amr will talk about his views on how that impact will change over the next five years. How central will Hadoop be to the data center of 2020? What industries will benefit most? Which technologies are at risk of displacement or encroachment?
【Keywords】: data center technologies; data management; hadoop
【Paper Link】 【Pages】:1625
【Authors】: Vasant Dhar
【Abstract】: Computers are making more and more decisions for us, and increasingly so in areas that require human judgment. There is a palpable increase in machine intelligence across the touch points of our lives, driven by the proliferation of data feeding into intelligent algorithms capable of learning useful patterns and acting on them. A natural question to ask is how we should be thinking about the role of computers in managing our money. Should we trust our money to a robot? In an era of big data and machines to make sense of it all, do machines have an inherent advantage over humans? There is a surge of interest in Artificial Intelligence for financial prediction. Should we pay attention? Or is this an area where human judgment and input is always essential?
【Keywords】: artificial intelligence; computational finance; quantitative trading
【Paper Link】 【Pages】:1627
【Authors】: Waqar Hasan ; Min Wang
【Abstract】: Visa is the payments technology that forms the backbone of the world's financial systems by handling more than 7 trillion dollars of payments annually and our data reflects how the world spends money. We will describe technical achievements we have made in the area of fraud and cover some open challenges in data science.
【Keywords】: data science
【Paper Link】 【Pages】:1629
【Authors】: George John
【Abstract】: In 2008, Rocket Fuel's founders saw a gap in the digital advertising market. None of the existing players were building autonomous systems based on big data and artificial intelligence, but instead they were offering fairly simple technology and relying on human campaign managers to drive success. Five years later in 2013, Rocket Fuel had the best technology IPO of the year on NASDAQ, reported $240 million in revenue, and was ranked by accounting firm Deloitte as the #1 fastest-growing technology company in North America. Along the way we learned that it's okay to be bold in our expectations of what is possible with fully autonomous systems, we learned that mainstream customers will buy advanced technology if it's delivered in a familiar way, and we also learned that it's incredibly difficult to debug the complex "robot psychology" when a number of complex autonomous systems interact. We also had excellent luck and timing: as we were building the company, real-time ad impression-level auctions with machine-to-machine buying and selling became commonplace, and marketers became increasingly focused on delivering better results for their company and delivering better personalized and relevant digital experiences for their customers. The case study presentation will present a fast-paced overview of the business and technology context for Rocket Fuel at inception and at present, key learnings and decisions, and the road ahead.
【Keywords】: advertising; artificial intelligence; big data analytics; computational advertising; real-time bidding
【Paper Link】 【Pages】:1631
【Authors】: Anil Kamath
【Abstract】: Is my marketing working, how much is marketing helping the business and which campaigns and channels are effective? The key challenge for the Chief Marketing Officer is to tie investment in marketing to business results. In an increasingly complex marketing environment -- marketing organizations are being called upon to prove and optimize the return on marketing investment across different paid earned and owned marketing channels. In this talk we will show how data science and optimization techniques can be applied to cross channel data to attribute marketing effectiveness, drive media planning and real-time optimization of campaigns. Using terabytes of multi-channel data we answer questions such as what is the impact of different marketing campaigns on our business, how should we allocate our marketing dollars between different channels and when should I spend them and how do we execute our marketing campaigns based on the synergies of the different channels.
【Keywords】: data science; marketing mix modeling; optimization
【Paper Link】 【Pages】:1633
【Authors】: Bassel Ojjeh
【Abstract】: Financial services and healthcare companies could be the biggest beneficiaries of big data. Their real-time decision engines can be vastly improved by leveraging the latest advances in big data analytics. However, these companies are challenged in leveraging Open Software Systems (OSS). This presentation covers how, in collaboration with financial services and healthcare institutions, we built an OSS project to deliver a real-time decisioning engine for their respective applications. I will address two key issues. First, I will describe the strategy behind our hiring process to attract millennial big data developers and the results of this endeavor. Second, I will recount the collaboration effort that we had with our large clients and the various milestones we achieved during that process. I will explain the goals regarding big data analysis that our large clients presented to us and how we accomplished those goals. In particular, I will discuss how we leveraged open source to deliver a real-time decisioning software product called Fatafat to these institutions. An advantage of developing applications in Fatafat is that it is already integrated with Hadoop, Kafka for real-time data streaming, HBase and Cassandra for NoSQL data storage. I will talk about how these companies benefited from Fatafat and some of challenges we had in the design of this software. I will provide quantifiable improvements in key metrics driven by Fatafat and interesting, unsolved problems/challenges that need to be addressed for faster and wider adoption of OSS by these companies.
【Keywords】: financial services; healthcare; open source software; real-time decision engines
【Paper Link】 【Pages】:1635
【Authors】: Joseph Sirosh
【Abstract】: Several exciting trends are driving the birth of the intelligent cloud. The vast majority of world's data is now connected data resident in the cloud. The majority of world's new software is now connected software, also resident in or using the cloud. New cloud based Machine Learning as a Service platforms help transform data into intelligence and build cloud-hosted intelligent APIs for connected software applications. Face analysis, computer vision, text analysis, speech recognition, and more traditional analytics such as churn prediction, recommendations, anomaly detection, forecasting, and clustering are all available now as cloud APIs, and far more are being created at a rapid pace. Cloud hosted marketplaces for crowdsourcing intelligent APIs have been launched. In this talk I will review what these trends mean for the future of data science and show examples of revolutionary applications that you can build using cloud platforms.
【Keywords】: cloud computing; intelligent apis; machine learning as a service
【Paper Link】 【Pages】:1637
【Authors】: Christopher White
【Abstract】: DARPA has been investing in data science and building open source tools for applications ranging from counter threat finance, through radar operations and cancer research, to anti-human trafficking. This presentation will cover previous work at DARPA, experience building real-world applications for defense and law enforcement to analyze data, and the future of computer science as an enabler for content discovery, information extraction, relevance determination, and information visualization. The talk will be a mix of background, detailed examples, and software demonstration. It will cover the importance of anchoring in applications, minimization of design-to-testing time, development with users-in-the-loop, error tolerance of machine learning, design for diverse user populations, and the necessity of open source software integration. It will conclude by covering a few next directions for special projects at Microsoft.
【Keywords】: data mining applications; open source tools
【Paper Link】 【Pages】:1639
【Authors】: Qiang Yang
【Abstract】: It is extremely important in many application domains to have accurate models of user behavior. Data mining allows user models to be constructed based on vast available data automatically. User modeling has found applications in mobile APP recommendations, social networking, financial product marketing and customer service in telecommunications. Successful user modeling should be aware of several critical issues: who are the target users' How should the solutions be updated when new data come in? How should user feedback be handled? What are the "pain" points of users' In this talk, I will discuss my own experience on user modeling with big data. I will draw examples from telecommunications and the Internet industry, contrasting and highlighting some lessons learned in these industries.
【Keywords】: internet; recommendations; telecommunications; user modeling
【Paper Link】 【Pages】:1641-1650
【Authors】: Panagiotis Adamopoulos ; Vilma Todri
【Abstract】: This paper studies a novel social media venture and seeks to understand the effectiveness of marketing strategies in social media platforms by evaluating their impact on participating brands and organizations. We use a real-world data set and employ a promising research approach combining econometric with predictive modeling techniques in a causal estimation framework that allows for more accurate counterfactuals. Based on the results of the presented analysis and focusing on the long-term business value of marketing strategies in social media, we find that promotional events leveraging implicit or explicit advocacy in social media platforms result in significant abnormal returns for the participating brand, in terms of expanding the social media fan base of the firm. The effect is also economically significant as it corresponds to an increase of several thousand additional new followers per day for an average size brand. We also precisely quantify the impact of various promotion characteristics and demonstrate what types of promotions are more effective and for which brands, while suggesting specific tactical strategies. For instance, despite the competition for consumers' attention, brands and marketers should broadcast marketing messages on social networks during the time of peak usage in order to maximize their returns. Overall, we provide actionable insights with major implications for firms and social media platforms and contribute to the related literature as we discover new rich findings enabled by the employed causal estimation framework.
【Keywords】: business analytics; business value; causal inference; counterfactual; event study; social commerce; social media
【Paper Link】 【Pages】:1651-1660
【Authors】: Deepak Agarwal ; Bee-Chung Chen ; Qi He ; Zhenhao Hua ; Guy Lebanon ; Yiming Ma ; Pannagadatta Shivaswamy ; Hsiao-Ping Tseng ; Jaewon Yang ; Liang Zhang
【Abstract】: LinkedIn dynamically delivers update activities from a user's interpersonal network to more than 300 million members in the personalized feed that ranks activities according their "relevance" to the user. This paper discloses the implementation details behind this personalized feed system at LinkedIn which can not be found from related work, and addresses the scalability and data sparsity challenges for deploying the system online. More specifically, we focus on the personalization models by generating three kinds of affinity scores: Viewer-ActivityType Affinity, Viewer-Actor Affinity, and Viewer-Actor-ActivityType Affinity. Extensive experiments based on online bucket tests (A/B experiments) and offline evaluation illustrate the effect of our personalization models in LinkedIn feed.
【Keywords】: feed relevance; large scale learning; personalization
【Paper Link】 【Pages】:1661-1670
【Authors】: Rakesh Agrawal ; Behzad Golshan ; Evangelos E. Papalexakis
【Abstract】: Access to diverse perspectives nurtures an informed citizenry. Google and Bing have emerged as the duopoly that largely arbitrates which English language documents are seen by web searchers. A recent study shows that there is now a large overlap in the top organic search results produced by them. Thus, citizens may no longer be able to gain different perspectives by using different search engines. We present the results of our empirical study that indicates that by mining Twitter data one can obtain search results that are quite distinct from those produced by Google and Bing. Additionally, our user study found that these results were quite informative. The gauntlet is now on search engines to test whether our findings hold in their infrastructure for different social networks and whether enabling diversity has sufficient business imperative for them.
【Keywords】: bing; google; search engine; search result com- parison; social media search; twitter; web search
【Paper Link】 【Pages】:1671-1680
【Authors】: Marco Arlorio ; Jean Daniel Coïsson ; Giorgio Leonardi ; Monica Locatelli ; Luigi Portinale
【Abstract】: This paper discusses the data mining approach followed in a project called TRAQUASwine, aimed at the definition of methods for data analytical assessment of the authenticity and protection, against fake versions, of some of the highest value Nebbiolo-based wines from Piedmont region in Italy. This is a big issue in the wine market, where commercial frauds related to such a kind of products are estimated to be worth millions of Euros. The objective is twofold: to show that the problem can be addressed without expensive and hyper-specialized wine analyses, and to demonstrate the actual usefulness of classification algorithms for data mining on the resulting chemical profiles. Following Wagstaff's proposal for practical exploitation of machine learning (and data mining) approaches, we describe how data have been collected and prepared for the production of different datasets, how suitable classification models have been identified and how the interpretation of the results suggests the emergence of an active role of classification techniques, based on standard chemical profiling, for the assesment of the authenticity of the wines target of the study.
【Keywords】: compliance and fraud; multi-label and multi-class learning
【Paper Link】 【Pages】:1681-1690
【Authors】: Yukino Baba ; Hisashi Kashima ; Yasunobu Nohara ; Eiko Kai ; Partha Pratim Ghosh ; Rafiqul Islam Maruf ; Ashir Ahmed ; Masahiro Kuroda ; Sozo Inoue ; Tatsuo Hiramatsu ; Michio Kimura ; Shuji Shimizu ; Kunihisa Kobayashi ; Koji Tsuda ; Masashi Sugiyama ; Mathieu Blondel ; Naonori Ueda ; Masaru Kitsuregawa ; Naoki Nakashima
【Abstract】: Non-communicable diseases (NCDs) are no longer just a problem for high-income countries, but they are also a problem that affects developing countries. Preventive medicine is definitely the key to combat NCDs; however, the cost of preventive programs is a critical issue affecting the popularization of these medicine programs in developing countries. In this study, we investigate predictive modeling for providing a low-cost preventive medicine program. In our two-year-long field study in Bangladesh, we collected the health checkup results of 15,075 subjects, the data of 6,607 prescriptions, and the follow-up examination results of 2,109 subjects. We address three prediction problems, namely subject risk prediction, drug recommendation, and future risk prediction, by using machine learning techniques; our multiple-classifier approach successfully reduced the costs of health checkups, a multi-task learning method provided accurate recommendation for specific types of drugs, and an active learning method achieved an efficient assignment of healthcare workers for the follow-up care of subjects.
【Keywords】: data mining; healthcare; preventive medicine
【Paper Link】 【Pages】:1691-1700
【Authors】: Senjuti Basu Roy ; Ankur Teredesai ; Kiyana Zolfaghar ; Rui Liu ; David Hazel ; Stacey Newman ; Albert Marinez
【Abstract】: Congestive Heart Failure (CHF) is a serious chronic condition often leading to 50% mortality within 5 years. Improper treatment and post-discharge care of CHF patients leads to repeat frequent hospitalizations (i.e., readmissions). Accurately predicting patient's risk-of-readmission enables care-providers to plan resources, perform factor analysis, and improve patient quality of life. In this paper, we describe a supervised learning framework, Dynamic Hierarchical Classification (DHC) for patient's risk-of-readmission prediction. Learning the hierarchy of classifiers is often the most challenging component of such classification schemes. The novelty of our approach is to algorithmically generate various layers and combine them to predict overall 30-day risk-of-readmission. While the components of DHC are generic, in this work, we focus on congestive heart failure (CHF), a pressing chronic condition. Since healthcare data is diverse and rich and each source and feature-subset provides different insights into a complex problem, our DHC based prediction approach intelligently leverages each source and feature-subset to optimize different objectives (such as, Recall or AUC) for CHF risk-of-readmission. DHC's algorithmic layering capability is trained and tested over two real world datasets and is currently integrated into the clinical decision support tools at MultiCare Health System (MHS), a major provider of healthcare services in the northwestern US. It is integrated into a QlikView App (with EMR integration planned for Q2) and currently scores patients everyday, helping to mitigate readmissions and improve quality of care, leading to healthier outcomes and cost savings.
【Keywords】: deployed solutions; healthcare; hierarchical classification model
【Paper Link】 【Pages】:1701-1710
【Authors】: Josep Lluis Berral ; Nicolás Poggi ; David Carrera ; Aaron Call ; Rob Reinauer ; Daron Green
【Abstract】: This article presents ALOJA-Machine Learning (ALOJA-ML) an extension to the ALOJA project that uses machine learning techniques to interpret Hadoop benchmark performance data and performance tuning; here we detail the approach, efficacy of the model and initial results. The ALOJA-ML project is the latest phase of a long-term collaboration between BSC and Microsoft, to automate the characterization of cost-effectiveness on Big Data deployments, focusing on Hadoop. Hadoop presents a complex execution environment, where costs and performance depends on a large number of software (SW) configurations and on multiple hardware (HW) deployment choices. Recently the ALOJA project presented an open, vendor-neutral repository, featuring over 16.000 Hadoop executions. These results are accompanied by a test bed and tools to deploy and evaluate the cost-effectiveness of the different hardware configurations, parameter tunings, and Cloud services. Despite early success within ALOJA from expert-guided benchmarking, it became clear that a genuinely comprehensive study requires automation of modeling procedures to allow a systematic analysis of large and resource-constrained search spaces. ALOJA-ML provides such an automated system allowing knowledge discovery by modeling Hadoop executions from observed benchmarks across a broad set of configuration parameters. The resulting empirically-derived performance models can be used to forecast execution behavior of various workloads; they allow a-priori prediction of the execution times for new configurations and HW choices and they offer a route to model-based anomaly detection. In addition, these models can guide the benchmarking exploration efficiently, by automatically prioritizing candidate future benchmark tests. Insights from ALOJA-ML's models can be used to reduce the operational time on clusters, speed-up the data acquisition and knowledge discovery process, and importantly, reduce running costs. In addition to learning from the methodology presented in this work, the community can benefit in general from ALOJA data-sets, framework, and derived insights to improve the design and deployment of Big Data applications.
【Keywords】: data-center management; execution experiences; hadoop; machine learning; modeling and prediction
【Paper Link】 【Pages】:1711-1720
【Authors】: Mirela Madalina Botezatu ; Jasmina Bogojeska ; Ioana Giurgiu ; Hagen Völzer ; Dorothea Wiesmann
【Abstract】: We present a novel technique that optimizes the dispatching of incident tickets to the agents in an IT Service Support Environment. Unlike the common skill-based dispatching, our approach also takes empirical evidence on the agent's speed from historical data into account. Our solution consists of two parts. First, a novel technique clusters historic tickets into incident categories that are discriminative in terms of agent's performance. Second, a dispatching policy selects, for an incoming ticket, the fastest available agent according to the target cluster. We show that, for ticket data collected from several Service Delivery Units, our new dispatching technique can reduce service time between $35\%$ and $44\%$.
【Keywords】: combined affinity matrix; fuzzy clustering; ticket clustering; graph cut; spectral clustering; ticket dispatching
【Paper Link】 【Pages】:1721-1730
【Authors】: Rich Caruana ; Yin Lou ; Johannes Gehrke ; Paul Koch ; Marc Sturm ; Noemie Elhadad
【Abstract】: In machine learning often a tradeoff must be made between accuracy and intelligibility. More accurate models such as boosted trees, random forests, and neural nets usually are not intelligible, but more intelligible models such as logistic regression, naive-Bayes, and single decision trees often have significantly worse accuracy. This tradeoff sometimes limits the accuracy of models that can be applied in mission-critical applications such as healthcare where being able to understand, validate, edit, and trust a learned model is important. We present two case studies where high-performance generalized additive models with pairwise interactions (GA2Ms) are applied to real healthcare problems yielding intelligible models with state-of-the-art accuracy. In the pneumonia risk prediction case study, the intelligible model uncovers surprising patterns in the data that previously had prevented complex learned models from being fielded in this domain, but because it is intelligible and modular allows these patterns to be recognized and removed. In the 30-day hospital readmission case study, we show that the same methods scale to large datasets containing hundreds of thousands of patients and thousands of attributes while remaining intelligible and providing accuracy comparable to the best (unintelligible) machine learning methods.
【Keywords】: additive models; classification; healthcare; intelligibility; interaction detection; logistic regression; risk prediction
【Paper Link】 【Pages】:1731-1740
【Authors】: Emily Denton ; Jason Weston ; Manohar Paluri ; Lubomir D. Bourdev ; Rob Fergus
【Abstract】: Understanding the content of user's image posts is a particularly interesting problem in social networks and web settings. Current machine learning techniques focus mostly on curated training sets of image-label pairs, and perform image classification given the pixels within the image. In this work we instead leverage the wealth of information available from users: firstly, we employ user hashtags to capture the description of image content; and secondly, we make use of valuable contextual information about the user. We show how user metadata (age, gender, etc.) combined with image features derived from a convolutional neural network can be used to perform hashtag prediction. We explore two ways of combining these heterogeneous features into a learning framework: (i) simple concatenation; and (ii) a 3-way multiplicative gating, where the image model is conditioned on the user metadata. We apply these models to a large dataset of de-identified Facebook posts and demonstrate that modeling the user can significantly improve the tag prediction quality over current state-of-the-art methods.
【Keywords】: deep learning; hashtagging; large scale image annotation; social media; user modeling
【Paper Link】 【Pages】:1741-1750
【Authors】: Amit Dhurandhar ; Bruce Graves ; Rajesh Kumar Ravi ; Gopikrishnan Maniachari ; Markus Ettl
【Abstract】: An accredited biennial 2014 study by the Association of Certified Fraud Examiners claims that on average 5% of a company's revenue is lost because of unchecked fraud every year. The reason for such heavy losses are that it takes around 18 months for a fraud to be caught and audits catch only 3% of the actual fraud. This begs the need for better tools and processes to be able to quickly and cheaply identify potential malefactors. In this paper, we describe a robust tool to identify procurement related fraud/risk, though the general design and the analytical components could be adapted to detecting fraud in other domains. Besides analyzing standard transactional data, our solution analyzes multiple public and private data sources leading to wider coverage of fraud types than what generally exists in the marketplace. Moreover, our approach is more principled in the sense that the learning component, which is based on investigation feedback has formal guarantees. Though such a tool is ever evolving, a deployment of this tool over the past 12 months has found many interesting cases from compliance risk and fraud point of view across more than 150 countries and 65000+ vendors, increasing the number of true positives found by over 80\% compared with other state-of-the-art tools that the domain experts were previously using.
【Keywords】: big data; collusion; fraud; online learning; procurement; risk; social network
【Paper Link】 【Pages】:1751-1758
【Authors】: Brendan Andrew Duncan ; Charles Peter Elkan
【Abstract】: This paper shows how to learn probabilistic classifiers that model how sales prospects proceed through stages from first awareness to final success or failure. Specifically,we present two models, called DQM for direct qualification model and FFM for full funnel model, that can be used to rank initial leads based on their probability of conversion to a sales opportunity, probability of successful sale, and/or expected revenue. Training uses the large amount of historical data collected by customer relationship management or marketing automation software. The trained models can replace traditional lead scoring systems, which are hand-tuned and therefore error-prone and not probabilistic. DQM and FFM are designed to overcome the selection bias caused by available data being based on a traditional lead scoring system. Experimental results are shown on real sales data from two companies. Features in the training data include demographic and behavioral information about each lead. For both companies, both methods achieve high AUC scores. For one company, they result in a a 307% increase in number of successful sales, as well as a dramatic increase in total revenue. In addition, we describe the results of the DQM method in actual use. These results show that the method has additional benefits that include decreased time needed to qualify leads, and decreased number of calls placed to schedule a product demo. The proposed methods find high-quality leads earlier in the sales process because they focus on features that measure the fit of potential customers with the product being sold, in addition to their behavior.
【Keywords】: decision trees; gradient boosted trees; machine learning; marketing; predictive lead scoring; sales
【Paper Link】 【Pages】:1759-1768
【Authors】: Varun R. Embar ; Indrajit Bhattacharya ; Vinayaka Pandit ; Roman Vaculín
【Abstract】: Various industries are turning to social media to identify key influencers on topics of interest. Following this trend, the All England Lawn Tennis and Croquet Club (AELTC) is keen to analyze the `social pulse' around the famous Wimbledon Championships. IBM developed and deployed social influence analysis capability for AELTC during the 2014 edition of the Championship. The design and implementation of influence analysis technology in the real world involves several challenges. In this paper, we define various functional and usability criteria that social influence scores should satisfy, and propose a multi-dimensional definition of influence that satisfies these criteria. We highlight the need to identify both all-time influencers and recent influencers, and track user influences over multiple time-scales for this purpose. We also stress the importance of aspect-specific influence analysis, and investigate an approach that uses an aspect hierarchy that annotates tweets with topics or aspects before analyzing them for influence. We also describe interesting insights discovered by our tool and the lessons that we learnt from this engagement.
【Keywords】: aspect-specific; ibm; multi-dimensional; multiple time-scales; online updates; social influence analysis; wimbledon
【Paper Link】 【Pages】:1769-1778
【Authors】: Shobeir Fakhraei ; James R. Foulds ; Madhusudana V. S. Shashanka ; Lise Getoor
【Abstract】: Detecting unsolicited content and the spammers who create it is a long-standing challenge that affects all of us on a daily basis. The recent growth of richly-structured social networks has provided new challenges and opportunities in the spam detection landscape. Motivated by the Tagged.com social network, we develop methods to identify spammers in evolving multi-relational social networks. We model a social network as a time-stamped multi-relational graph where vertices represent users, and edges represent different activities between them. To identify spammer accounts, our approach makes use of structural features, sequence modelling, and collective reasoning. We leverage relational sequence information using k-gram features and probabilistic modelling with a mixture of Markov models. Furthermore, in order to perform collective reasoning and improve the predictive power of a noisy abuse reporting system, we develop a statistical relational model using hinge-loss Markov random fields (HL-MRFs), a class of probabilistic graphical models which are highly scalable. We use Graphlab Create and Probabilistic Soft Logic (PSL) to prototype and experimentally evaluate our solutions on internet-scale data from Tagged.com. Our experiments demonstrate the effectiveness of our approach, and show that models which incorporate the multi-relational nature of the social network significantly gain predictive performance over those that do not.
【Keywords】: collective classification; graph mining; graphlab; heterogeneous networks; k-grams; hinge-loss markov random fields (hl-mrf); multi-relational networks; probabilistic soft logic (psl); sequence mining; social networks; social spam; spam; tree-augmented naive bayes
【Paper Link】 【Pages】:1779-1788
【Authors】: Ronen Feldman ; Oded Netzer ; Aviv Peretz ; Binyamin Rosenfeld
【Abstract】: We present an end-to-end text mining methodology for relation extraction of adverse drug reactions (ADRs) from medical forums on the Web. Our methodology is novel in that it combines three major characteristics: (i) an underlying concept of using a head-driven phrase structure grammar (HPSG) based parser; (ii) domain-specific relation patterns, the acquisition of which is done primarily using unsupervised methods applied to a large, unlabeled text corpus; and (iii) automated post-processing algorithms for enhancing the set of extracted relations. We empirically demonstrate the ability of our proposed approach to predict ADRs prior to their reporting by the Food and Drug Administration (FDA). Put differently, we put our approach to a predictive test by demonstrating that our methodology can credibly point to ADRs that were not uncovered in clinical trials for evaluating new drugs that come to market but were only reported later on by the FDA as a label change.
【Keywords】: adverse drug reactions; hpsg; medical forums; pharmaceutical drugs; text mining
【Paper Link】 【Pages】:1789-1798
【Authors】: Antonino Freno ; Martin Saveski ; Rodolphe Jenatton ; Cédric Archambeau
【Abstract】: Purchase logs collected in e-commerce platforms provide rich information about customer preferences. These logs can be leveraged to improve the quality of product recommendations by feeding them to machine-learned ranking models. However, a variety of deployment constraints limit the naive applicability of machine learning to this problem. First, the amount and the dimensionality of the data make in-memory learning simply not possible. Second, the drift of customers' preference over time require to retrain the ranking model regularly with freshly collected data. This limits the time that is available for training to prohibitively short intervals. Third, ranking in real-time is necessary whenever the query complexity prevents us from caching the predictions. This constraint requires to minimize prediction time (or equivalently maximize the data throughput), which in turn may prevent us from achieving the accuracy necessary in web-scale industrial applications. In this paper, we investigate how the practical challenges faced in this setting can be tackled via an online learning to rank approach. Sparse models will be the key to reduce prediction latency, whereas one-pass stochastic optimization will minimize the training time and restrict the memory footprint. Interestingly, and perhaps surprisingly, extensive experiments show that one-pass learning preserves most of the predictive performance. Additionally, we study a variety of online learning algorithms that enforce sparsity and provide insights to help the practitioner make an informed decision about which approach to pick. We report results on a massive purchase log dataset from the Amazon retail website, as well as on several benchmarks from the LETOR corpus.
【Keywords】: one-pass learning; online learning to rank; recommendation systems; sparse models
【Paper Link】 【Pages】:1799-1808
【Authors】: Oana Goga ; Patrick Loiseau ; Robin Sommer ; Renata Teixeira ; Krishna P. Gummadi
【Abstract】: Matching the profiles of a user across multiple online social networks brings opportunities for new services and applications as well as new insights on user online behavior, yet it raises serious privacy concerns. Prior literature has showed that it is possible to accurately match profiles, but their evaluation focused only on sampled datasets. In this paper, we study the extent to which we can reliably match profiles in practice, across real-world social networks, by exploiting public attributes, i.e., information users publicly provide about themselves. Today's social networks have hundreds of millions of users, which brings completely new challenges as a reliable matching scheme must identify the correct matching profile out of the millions of possible profiles. We first define a set of properties for profile attributes--Availability, Consistency, non-Impersonability, and Discriminability (ACID)--that are both necessary and sufficient to determine the reliability of a matching scheme. Using these properties, we propose a method to evaluate the accuracy of matching schemes in real practical cases. Our results show that the accuracy in practice is significantly lower than the one reported in prior literature. When considering entire social networks, there is a non-negligible number of profiles that belong to different users but have similar attributes, which leads to many false matches. Our paper sheds light on the limits of matching profiles in the real world and illustrates the correct methodology to evaluate matching schemes in realistic scenarios.
【Keywords】: matching accounts; online social networks; reliability
【Paper Link】 【Pages】:1809-1818
【Authors】: Mihajlo Grbovic ; Vladan Radosavljevic ; Nemanja Djuric ; Narayan Bhamidipati ; Jaikit Savla ; Varun Bhagwan ; Doug Sharp
【Abstract】: In recent years online advertising has become increasingly ubiquitous and effective. Advertisements shown to visitors fund sites and apps that publish digital content, manage social networks, and operate e-mail services. Given such large variety of internet resources, determining an appropriate type of advertising for a given platform has become critical to financial success. Native advertisements, namely ads that are similar in look and feel to content, have had great success in news and social feeds. However, to date there has not been a winning formula for ads in e-mail clients. In this paper we describe a system that leverages user purchase history determined from e-mail receipts to deliver highly personalized product ads to Yahoo Mail users. We propose to use a novel neural language-based algorithm specifically tailored for delivering effective product recommendations, which was evaluated against baselines that included showing popular products and products predicted based on co-occurrence. We conducted rigorous offline testing using a large-scale product purchase data set, covering purchases of more than 29 million users from 172 e-commerce websites. Ads in the form of product recommendations were successfully tested on online traffic, where we observed a steady 9% lift in click-through rates over other ad formats in mail, as well as comparable lift in conversion rates. Following successful tests, the system was launched into production during the holiday season of 2014.
【Keywords】: audience modeling; computational advertising; data mining
【Paper Link】 【Pages】:1819-1828
【Authors】: Mihajlo Grbovic ; Vladan Radosavljevic ; Nemanja Djuric ; Narayan Bhamidipati ; Ananth Nagarajan
【Abstract】: As one of the leading platforms for creative content, Tumblr offers advertisers a unique way of creating brand identity. Advertisers can tell their story through images, animation, text, music, video, and more, and can promote that content by sponsoring it to appear as an advertisement in the users' live feeds. In this paper, we present a framework that enabled two of the key targeted advertising components for Tumblr, gender and interest targeting. We describe the main challenges encountered during the development of the framework, which include the creation of a ground truth for training gender prediction models, as well as mapping Tumblr content to a predefined interest taxonomy. For purposes of inferring user interests, we propose a novel semi-supervised neural language model for categorization of Tumblr content (i.e., post tags and post keywords). The model was trained on a large-scale data set consisting of $6.8$ billion user posts, with a very limited amount of categorized keywords, and was shown to have superior performance over the baseline approaches. We successfully deployed gender and interest targeting capability in Yahoo production systems, delivering inference for users that covers more than 90% of daily activities on Tumblr. Online performance results indicate advantages of the proposed approach, where we observed 20% increase in user engagement with sponsored posts in comparison to untargeted campaigns.
【Keywords】: audience modeling; computational advertising; data mining
【Paper Link】 【Pages】:1829-1838
【Authors】: Ben Green ; Alejandra Caro ; Matthew Conway ; Robert Manduca ; Tom Plagge ; Abby Miller
【Abstract】: After decades of urban investment dominated by sprawl and outward growth, municipal governments in the United States are responsible for the upkeep of urban neighborhoods that have not received sufficient resources or maintenance in many years. One of city governments' biggest challenges is to revitalize decaying neighborhoods given only limited resources. In this paper, we apply data science techniques to administrative data to help the City of Memphis, Tennessee improve distressed neighborhoods. We develop new methods to efficiently identify homes in need of rehabilitation and to predict the impacts of potential investments on neighborhoods. Our analyses allow Memphis to design neighborhood-improvement strategies that generate greater impacts on communities. Since our work uses data that most US cities already collect, our models and methods are highly portable and inexpensive to implement. We also discuss the challenges we encountered while analyzing government data and deploying our tools, and highlight important steps to improve future data-driven efforts in urban policy.
【Keywords】: neighborhood distress; social good; urban revitalization
【Paper Link】 【Pages】:1839-1847
【Authors】: Daniel N. Hill ; Robert Moakler ; Alan E. Hubbard ; Vadim Tsemekhman ; Foster J. Provost ; Kiril Tsemekhman
【Abstract】: Predictive models are often employed to decide actions in interactive online systems. For example, ads are selectively served to users who are modeled as being inclined to purchase the product being advertised. News feed items are populated based on a model of the user's interests. A common consequence of these predictive models is the creation of a spurious correlation, or confounding, between the action and its desired outcome. In the above examples, the targeted users are likely to buy the product or find the news item regardless of the intervention. This presents a challenge for measuring the true impact of these systems. Here we present a novel framework for estimating causal effects that relies on neither randomized experiments nor adjusting for the potentially explosive number of variables used in predictive models. We propose the identification and instrumentation of events that mediate the effect of the action. When the effect of an action depends on a mediating event that is not subject to the same confounders, the problem of causal estimation is greatly simplified. We demonstrate this approach in display advertising using ad viewability as a natural experiment that mediates the impact of served ads. Approximately 45% of display ad impressions never make it into a viewable portion of the user's browser. We show that an analysis based on ad viewability can massively reduce the amount of bias in estimating campaign lift. We integrate the use of negative controls as well as the identification and adjustment for residual confounding to further reduce the bias in estimated lift to less than 10%. A system using these techniques is deployed to monitor the daily causal impact of many large-scale advertising campaigns.
【Keywords】: causal analysis; display advertising; natural experiments; negative controls; performance measurement
【Paper Link】 【Pages】:1849-1858
【Authors】: Henning Hohnhold ; Deirdre O'Brien ; Diane Tang
【Abstract】: Over the past 10+ years, online companies large and small have adopted widespread A/B testing as a robust data-based method for evaluating potential product improvements. In online experimentation, it is straightforward to measure the short-term effect, i.e., the impact observed during the experiment. However, the short-term effect is not always predictive of the long-term effect, i.e., the final impact once the product has fully launched and users have changed their behavior in response. Thus, the challenge is how to determine the long-term user impact while still being able to make decisions in a timely manner. We tackle that challenge in this paper by first developing experiment methodology for quantifying long-term user learning. We then apply this methodology to ads shown on Google search, more specifically, to determine and quantify the drivers of ads blindness and sightedness, the phenomenon of users changing their inherent propensity to click on or interact with ads. We use these results to create a model that uses metrics measurable in the short-term to predict the long-term. We learn that user satisfaction is paramount: ads blindness and sightedness are driven by the quality of previously viewed or clicked ads, as measured by both ad relevance and landing page quality. Focusing on user satisfaction both ensures happier users but also makes business sense, as our results illustrate. We describe two major applications of our findings: a conceptual change to our search ads auction that further increased the importance of ads quality, and a 50% reduction of the ad load on Google's mobile search interface. The results presented in this paper are generalizable in two major ways. First, the methodology may be used to quantify user learning effects and to evaluate online experiments in contexts other than ads. Second, the ads blindness/sighted-ness results indicate that a focus on user satisfaction could help to reduce the ad load on the internet at large with long-term neutral, or even positive, business impact.
【Keywords】: controlled experiments; randomized experiments
【Paper Link】 【Pages】:1859-1868
【Authors】: Thomas Holleczek ; Dang The Anh ; Shanyang Yin ; Yunye Jin ; Spiros Antonatos ; Han Leong Goh ; Samantha Low ; Amy Shi Nash
【Abstract】: Understanding how people use public transport is important for the operation and future planning of the underlying transport networks. We have therefore developed and deployed a traffic measurement system for a key player in the transportation industry to gain insights into crowd behavior for planning purposes. The system has been in operation for several months and reports, at hourly intervals, (1) the crowdedness of subway stations, (2) the flows of people inside interchange stations, and (3) the expected travel time for each possible route in the subway network of Singapore. The core of our system is an efficient algorithm which detects individual subway trips from anonymized real-time data generated by the location based system of Singtel, the country's largest telecommunications company. To assess the accuracy of our system, we engaged an independent market research company to conduct a field study--a manual count of the number of passengers boarding and disembarking at a selected station on three separate days. A strong correlation between the calculations of our algorithm and the manual counts was found. One of our key findings is that travelers do not always choose the route with the shortest travel time in the subway network of Singapore. We have therefore also been developing a mobile app which allows users to plan their trips based on the average travel time between stations.
【Keywords】: call detail records (cdrs); cellular networks; monitoring system; public transport
【Paper Link】 【Pages】:1869-1878
【Authors】: Elena Ikonomovska ; Sina Jafarpour ; Ali Dasdan
【Abstract】: We study online meta-learners for real-time bid prediction that predict by selecting a single best predictor among several subordinate prediction algorithms, here called "experts". These predictors belong to the family of context-dependent past performance estimators that make a prediction only when the instance to be predicted falls within their areas of expertise. Within the advertising ecosystem, it is very common for the contextual information to be incomplete, hence, it is natural for some of the experts to abstain from making predictions on some of the instances. Experts' areas of expertise can overlap, which makes their predictions less suitable for merging; as such, they lend themselves better to the problem of best expert selection. In addition, their performance varies over time, which gives the expert selection problem a non-stochastic, adversarial flavor. In this paper we propose to use probability sampling (via Thompson Sampling) as a meta-learning algorithm that samples from the pool of experts for the purpose of bid prediction. We show performance results from the comparison of our approach to multiple state-of-the-art algorithms using exploration scavenging on a log file of over 300 million ad impressions, as well as comparison to a baseline rule-based model using production traffic from a leading DSP platform.
【Keywords】: bayesian online learning; multi-armed bandits; online advertising; online algorithms; randomized probability matching
【Paper Link】 【Pages】:1879-1888
【Authors】: Peng Jiang ; Yadong Zhu ; Yi Zhang ; Quan Yuan
【Abstract】: Although marketing researchers and sociologists have recognized the large impact of life stage on consumer's purchasing behaviors, existing recommender systems have not taken this impact into consideration. In this paper, we found obvious correlation between life stage and purchasing behavior in many E-commerce categories. For example, a mum may look for different suitable products when her baby is at different ages. Motivated by this, we introduce the conception of life stage into recommender systems and propose to predict a user's current life-stage and recommend products correspondingly. We propose a new Maximum Entropy Semi Markov Model to segment and label consumer life stage based on the observed purchasing data over time. In the mom-baby product category where the life stage transition is deterministic, we develop an efficient approximate solution using large scale logistic regression and a Viterbi-like algorithm. We also propose a Gaussian mixture model to efficiently handle multi-kids life stage prediction problem. We integrate the life stage information predicted into the recommender system behind the largest online shopping website taobao.com. Both offline and online experiments demonstrate the effectiveness of the proposed life-stage based recommendation approach.
【Keywords】: e-commerce; life stage; recommender system
【Paper Link】 【Pages】:1889-1898
【Authors】: Yushi Jing ; David Liu ; Dmitry Kislyuk ; Andrew Zhai ; Jiajing Xu ; Jeff Donahue ; Sarah Tavel
【Abstract】: We demonstrate that, with the availability of distributed computation platforms such as Amazon Web Services and open-source tools, it is possible for a small engineering team to build, launch and maintain a cost-effective, large-scale visual search system. We also demonstrate, through a comprehensive set of live experiments at Pinterest, that content recommendation powered by visual search improves user engagement. By sharing our implementation details and learnings from launching a commercial visual search engine from scratch, we hope visual search becomes more widely incorporated into today's commercial applications.
【Keywords】: computer vision; deep learning; distributed systems; information retrieval
【Paper Link】 【Pages】:1899-1908
【Authors】: Gunhee Kim ; Leonid Sigal
【Abstract】: We present an approach for generating pictorial storylines from large collections of online photo streams shared by visitors to theme parks (e.g. Disneyland), along with publicly available information such as visitor's maps. The story graph visualizes various events and activities recurring across visitors' photo sets, in the form of hierarchically branching narrative structure associated with attractions and districts in theme parks. We first estimate story elements of each photo stream, including the detection of faces and supporting objects, and attraction-based localization. We then create spatio-temporal story graphs via an inference of sparse time-varying directed graphs. Through quantitative evaluation and crowdsourcing-based user studies via Amazon Mechanical Turk, we show that the story graphs serve as a more convenient mid-level data structure to perform photo-based recommendation tasks than other alternatives. We also present storybook-like demo examples regarding exploration, recommendation, and temporal analysis, which may be most beneficial uses of the story graphs to visitors.
【Keywords】: photo storylines; summarization and exploration of multimedia data; user modeling
【Paper Link】 【Pages】:1909-1918
【Authors】: Himabindu Lakkaraju ; Everaldo Aguiar ; Carl Shan ; David Miller ; Nasir Bhanpuri ; Rayid Ghani ; Kecia L. Addison
【Abstract】: Many school districts have developed successful intervention programs to help students graduate high school on time. However, identifying and prioritizing students who need those interventions the most remains challenging. This paper describes a machine learning framework to identify such students, discusses features that are useful for this task, applies several classification algorithms, and evaluates them using metrics important to school administrators. To help test this framework and make it practically useful, we partnered with two U.S. school districts with a combined enrollment of approximately 200,000 students. We together designed several evaluation metrics to assess the goodness of machine learning algorithms from an educator's perspective. This paper focuses on students at risk of not finishing high school on time, but our framework lays a strong foundation for future work on other adverse academic outcomes.
【Keywords】: applications; education; evaluation metrics; risk prediction
【Paper Link】 【Pages】:1919-1928
【Authors】: Yair Lakretz ; Gal Chechik ; Naama Friedmann ; Michal Rosen-Zvi
【Abstract】: Reading is a complex cognitive process, errors in which may assume diverse forms. In this study, introducing a novel approach, we use two families of probabilistic graphical models to analyze patterns of reading errors made by dyslexic people: an LDA-based model and two Naëve Bayes models which differ by their assumptions about the generation process of reading errors. The models are trained on a large corpus of reading errors. Results show that a Naëve Bayes model achieves highest accuracy compared to labels given by clinicians (AUC = 0.801 ± 0.05), thus providing the first automated and objective diagnosis tool for dyslexia which is solely based on reading errors data. Results also show that the LDA-based model best captures patterns of reading errors and could therefore contribute to the understanding of dyslexia and to future improvement of the diagnostic procedure. Finally, we draw on our results to shed light on a theoretical debate about the definition and heterogeneity of dyslexia. Our results support a model assuming multiple dyslexia subtypes, that of a heterogeneous view of dyslexia.
【Keywords】: diagnosis; dyslexia; latent dirichlet allocation; naëve bayes; probabilistic graphical models
【Paper Link】 【Pages】:1929-1938
【Authors】: Mounia Lalmas ; Janette Lehmann ; Guy Shaked ; Fabrizio Silvestri ; Gabriele Tolomei
【Abstract】: Click-through rate (CTR) is the most common metric used to assess the performance of an online advert; another performance of an online advert is the user post-click experience. In this paper, we describe the method we have implemented in Yahoo Gemini to measure the post-click experience on Yahoo mobile news streams via an automatic analysis of advert landing pages. We measure the post-click experience by means of two well-known metrics, dwell time and bounce rate. We show that these metrics can be used as proxy of an advert post-click experience, and that a negative post-click experience has a negative effect on user engagement and future ad clicks. We then put forward an approach that analyses advert landing pages, and show how these can affect dwell time and bounce rate. Finally, we develop a prediction model for advert quality based on dwell time, which was deployed on Yahoo mobile news stream app running on iOS. The results show that, using dwell time as a proxy of post-click experience, we can prioritise higher quality ads. We demonstrate the impact of this on users via A/B testing.
【Keywords】: advertising quality; dwell time prediction; mobile advertising; native advertising; post-click experience
【Paper Link】 【Pages】:1939-1947
【Authors】: Nikolay Laptev ; Saeed Amizadeh ; Ian Flint
【Abstract】: This paper introduces a generic and scalable framework for automated anomaly detection on large scale time-series data. Early detection of anomalies plays a key role in maintaining consistency of person's data and protects corporations against malicious attackers. Current state of the art anomaly detection approaches suffer from scalability, use-case restrictions, difficulty of use and a large number of false positives. Our system at Yahoo, EGADS, uses a collection of anomaly detection and forecasting models with an anomaly filtering layer for accurate and scalable anomaly detection on time-series. We compare our approach against other anomaly detection systems on real and synthetic data with varying time-series characteristics. We found that our framework allows for 50-60% improvement in precision and recall for a variety of use-cases. Both the data and the framework are being open-sourced. The open-sourcing of the data, in particular, represents the first of its kind effort to establish the standard benchmark for anomaly detection.
【Keywords】: anomaly detection; scalable anomaly detection; time-series
【Paper Link】 【Pages】:1949-1958
【Authors】: Joonseok Lee ; Ariel Fuxman ; Bo Zhao ; Yuanhua Lv
【Abstract】: Users today are constantly switching back and forth from applications where they consume or create content (such as e-books and productivity suites like Microsoft Office and Google Docs) to search engines where they satisfy their information needs. Unfortunately, though, this leads to a suboptimal user experience as the search engine lacks any knowledge about the content that the user is authoring or consuming in the application. As a result, productivity suites are starting to incorporate features that let the user "explore while they work". Existing work in the literature that can be applied to this problem takes a standard bag-of-words information retrieval approach, which consists of automatically creating a query that includes not only the target phrase or entity chosen by the user but also relevant terms from the context. While these approaches have been successful, they are inherently limited to returning results (documents) that have a syntactic match with the keywords in the query. We argue that the limitations of these approaches can be overcome by leveraging semantic signals from a knowledge graph built from knowledge bases such as Wikipedia. We present a system called Lewis for retrieving contextually relevant entity results leveraging a knowledge graph, and perform a large scale crowdsourcing experiment in the context of an e-reader scenario, which shows that Lewis can outperform the state-of-the-art contextual entity recommendation systems by more than 20% in terms of the MAP score.
【Keywords】: context; context-selection betweenness; entity recommendation; knowledge base; semantic
【Paper Link】 【Pages】:1959-1968
【Authors】: Cheng Li ; Yue Lu ; Qiaozhu Mei ; Dong Wang ; Sandeep Pandey
【Abstract】: We present the problem of click-through prediction for advertising in Twitter timeline, which displays a stream of Tweets from accounts a user choose to follow. Traditional computational advertising usually appears in two forms: sponsored search that places ads onto the search result page when a query is issued to a search engine, and contextual advertising that places ads onto a regular, usually static Web page. Compared with these two paradigms, placing ads into a Tweet stream is particularly challenging given the nature of the data stream: the context into which an ad can be placed updates dynamically and never replicates. Every ad is therefore placed into a unique context. This makes the information available for training a machine learning model extremely sparse. In this study, we propose a learning-to-rank method which not only addresses the sparsity of training signals but also can be trained and updated online. The proposed method is evaluated using both offline experiments and online A/B tests, which involve very large collections of Twitter data and real Twitter users. Results of the experiments prove the effectiveness and efficiency of our solution, and its superiority over the current production model adopted by Twitter.
【Keywords】: click-through prediction; online advertising; social media stream
【Paper Link】 【Pages】:1969-1978
【Authors】: Ying Li ; Jose D. Contreras ; Luis J. Salazar
【Abstract】: We present the research, and product development and deployment, of Voice Analyzer' by Jobaline Inc. This is a patent pending technology that analyzes voice data and predicts human emotions elicited by the paralinguistic elements of a voice. Human voice characteristics, such as tone, complement the verbal communication. In several contexts of communication, "how" things are said is just as important as "what" is being said. This paper provides an overview of our deployed system, the raw data, the data processing steps, and the prediction algorithms we experimented with. A case study is included where, given a voice clip, our model predicts the degree in which a listener will find the voice "engaging". Our prediction results were verified through independent market research with 75% in agreement on how an average listener would feel. One application of Jobaline Voice Analyzer technology is for assisting companies to hire workers in the service industry where customers' emotional response to workers' voice may affect the service outcome. Jobaline Voice Analyzer is deployed in production as a product offer to our clients to help them identify workers who will better engage with their customers. We will also share some discoveries and lessons learned.
【Keywords】: deployed system; predictive modeling; speech signal processing; voice analysis
【Paper Link】 【Pages】:1979-1988
【Authors】: Shigeru Maya ; Kai Morino ; Hiroshi Murata ; Ryo Asaoka ; Kenji Yamanishi
【Abstract】: In this paper, we propose a method to cluster the spacial patterns of the visual field in glaucoma patients to analyze the progression patterns of glaucoma. The degree of progression in the visual field of glaucoma patients can be divided into several regions by straight line boundaries, we call this specific structure Direct Product Structure in this paper. Since we can observe the direct product structure in the visual fields, we propose a bottom-up hierarchical clustering method to embed this structure into the clustering structure. In our method, according to the minimum description length (MDL) principle, we select the best cluster division so that the total code length required for encoding the data as well as the clustering structure is minimum. We can thereby select the clusters that are robust to the noise in the position of the direct product structure for clustering. We demonstrate the effectiveness of our method using an artificial dataset and a real glaucoma dataset. Our proposed method performed better than existing methods for both datasets. For the real glaucoma dataset in particular, our method discovered the characteristic progressive patterns of glaucoma as specific features of clusters. These patterns agree with clinical knowledge. Furthermore, we show that our clusters can be applied to improve the accuracy of predicting glaucoma progression. Thus, our clusters contain rich information of glaucoma, and hence can contribute to further development in glaucoma research.
【Keywords】: glaucoma; hierarchical clustering; normalized maximum likelihood distribution; the minimum description length
【Paper Link】 【Pages】:1989-1998
【Authors】: Xu Miao ; Chun-Te Chu ; Lijun Tang ; Yitong Zhou ; Joel Young ; Anmol Bhasin
【Abstract】: Personalization is a long-standing problem in data mining and machine learning. Companies make personalized product recommendations to millions of users every second. In addition to the recommendation problem, with the emerging of personal devices, many conventional problems, e.g., recognition, need to be personalized as well. Moreover, as the number of users grows huge, solving personalization becomes quite challenging. In this paper, we formalize the generic personalization problem as an optimization problem. We propose several ADMM algorithms to solve this problem in a distributed way including a new Asynchronous ADMM that removes all synchronous barriers to maximize the training throughput. We provide a mathematical analysis to show that the proposed Asynchronous ADMM algorithm holds a linear convergence rate which is the best to our knowledge. The distributed personalization allows training to be performed in either a cluster or even on a user's device. This can improve the privacy protection as no personal data is uploaded, while personal models can still be shared with each other. We apply this approach to two industry problems, \emph{Facial Expression Recognition} and \emph{Job Recommendation}. Experiments demonstrate more than 30\% relative error reduction on both problems. Asynchronous ADMM allows faster training for problems with millions of users since it eliminates all network I/O waiting time to maximize the cluster CPU throughput. Experiments demonstrate 4 times faster than original synchronous ADMM algorithm.
【Keywords】: admm; big data; distributed optimization; personalization
【Paper Link】 【Pages】:1999-2008
【Authors】: Rajendu Mitra ; Ramachandra Kota ; Sambaran Bandyopadhyay ; Vijay Arya ; Brian Sullivan ; Richard Mueller ; Heather Storey ; Gerard Labut
【Abstract】: The connectivity model of a power distribution network can easily become outdated due to system changes occurring in the field. Maintaining and sustaining an accurate connectivity model is a key challenge for distribution utilities worldwide. This work shows that voltage time series measurements collected from customer smart meters exhibit correlations that are consistent with the hierarchical structure of the distribution network. These correlations may be leveraged to cluster customers based on common ancestry and help verify and correct an existing connectivity model. Additionally, customers may be clustered in combination with voltage data from circuit metering points, spatial data from the geographical information system, and any existing but partially accurate connectivity model to infer customer to transformer and phase connectivity relationships with high accuracy. We report analysis and validation results based on data collected from multiple feeders of a large electric distribution network in North America. To the best of our knowledge, this is the first large scale measurement study of customer voltage data and its use in inferring network connectivity information.
【Keywords】: clustering; data mining; power distribution grids; topology inference; voltage time series
【Paper Link】 【Pages】:2009-2018
【Authors】: Marjan Momtazpour ; Jinghe Zhang ; Saifur Rahman ; Ratnesh K. Sharma ; Naren Ramakrishnan
【Abstract】: The analysis of large scale data logged from complex cyber-physical systems, such as microgrids, often entails the discovery of invariants capturing functional as well as operational relationships underlying such large systems. We describe a latent factor approach to infer invariants underlying system variables and how we can leverage these relationships to monitor a cyber-physical system. In particular we illustrate how this approach helps rapidly identify outliers during system operation.
【Keywords】: latent factors; outlier detection; regression; system invariants
【Paper Link】 【Pages】:2019-2028
【Authors】: Meenakshi Nagarajan ; Angela D. Wilkins ; Benjamin J. Bachman ; Ilya B. Novikov ; Shenghua Bao ; Peter J. Haas ; María E. Terrón-Díaz ; Sumit Bhatia ; Anbu K. Adikesavan ; Jacques J. Labrie ; Sam Regenbogen ; Christie M. Buchovecky ; Curtis R. Pickering ; Linda Kato ; Andreas Martin Lisewski ; Ana Lelescu ; Houyin Zhang ; Stephen Boyer ; Griff Weber ; Ying Chen ; Lawrence A. Donehower ; W. Scott Spangler ; Olivier Lichtarge
【Abstract】: We present KnIT, the Knowledge Integration Toolkit, a system for accelerating scientific discovery and predicting previously unknown protein-protein interactions. Such predictions enrich biological research and are pertinent to drug discovery and the understanding of disease. Unlike a prior study, KnIT is now fully automated and demonstrably scalable. It extracts information from the scientific literature, automatically identifying direct and indirect references to protein interactions, which is knowledge that can be represented in network form. It then reasons over this network with techniques such as matrix factorization and graph diffusion to predict new, previously unknown interactions. The accuracy and scope of KnIT's knowledge extractions are validated using comparisons to structured, manually curated data sources as well as by performing retrospective studies that predict subsequent literature discoveries using literature available prior to a given date. The KnIT methodology is a step towards automated hypothesis generation from text, with potential application to other scientific domains.
【Keywords】: hypothesis generation.; scientific discovery; text mining
【Paper Link】 【Pages】:2029-2038
【Authors】: Vinod Nair ; Ameya Raul ; Shwetabh Khanduja ; Vikas Bahirwani ; Sundararajan Sellamanickam ; S. Sathiya Keerthi ; Steve Herbert ; Sudheer Dhulipalla
【Abstract】: We propose a machine learning based framework for building a hierarchical monitoring system to detect and diagnose service issues. We demonstrate its use for building a monitoring system for a distributed data storage and computing service consisting of tens of thousands of machines. Our solution has been deployed in production as an end-to-end system, starting from telemetry data collection from individual machines, to a visualization tool for service operators to examine the detection outputs. Evaluation results are presented on detecting 19 customer impacting issues in the past three months.
【Keywords】: high-dimensional time series; service monitoring; unsupervised learning
【Paper Link】 【Pages】:2039-2047
【Authors】: Eric Potash ; Joe Brew ; Alexander Loewi ; Subhabrata Majumdar ; Andrew Reece ; Joe Walsh ; Eric Rozier ; Emile Jorgenson ; Raed Mansour ; Rayid Ghani
【Abstract】: Lead poisoning is a major public health problem that affects hundreds of thousands of children in the United States every year. A common approach to identifying lead hazards is to test all children for elevated blood lead levels and then investigate and remediate the homes of children with elevated tests. This can prevent exposure to lead of future residents, but only after a child has been poisoned. This paper describes joint work with the Chicago Department of Public Health (CDPH) in which we build a model that predicts the risk of a child to being poisoned so that an intervention can take place before that happens. Using two decades of blood lead level tests, home lead inspections, property value assessments, and census data, our model allows inspectors to prioritize houses on an intractably long list of potential hazards and identify children who are at the highest risk. This work has been described by CDPH as pioneering in the use of machine learning and predictive analytics in public health and has the potential to have a significant impact on both health and economic outcomes for communities across the US.
【Keywords】: lead poisoning; machine learning; public health; public policy; social good
【Paper Link】 【Pages】:2049-2058
【Authors】: Kevin B. Pratt
【Abstract】: We demonstrate a protocol for proving strongly that a black-box machine learning technique robustly predicts the future in dynamic, indefinite contexts. We propose necessary components of the proof protocol and demonstrate results visualizations to support evaluation of the proof components. Components include contemporaneously verifiable discrete predictions, deterministic computability of longitudinal predictions, imposition of realistic costs and domain constraints, exposure to diverse contexts, statistically significant excess benefits relative to a priori benchmarks and Monte Carlo trials, insignificant decay of excess benefits, pathology detection and an extended real-time trial "in the wild." We apply the protocol to a big data machine learning technique deployed since 2011 that finds persistent, exploitable opportunities in many of 41 segments of US financial markets, the existence of which opportunities substantially contradict the Efficient Market Hypothesis.
【Keywords】: big data; black box; dynamic context; experiment design; hypothesis testing; longitudinal study; market forecast; pattern analysis; prediction; simulation
【Paper Link】 【Pages】:2059-2068
【Authors】: Johann Schleier-Smith
【Abstract】: Machine learning techniques have proved effective in recommender systems and other applications, yet teams working to deploy them lack many of the advantages that those in more established software disciplines today take for granted. The well-known Agile methodology advances projects in a chain of rapid development cycles, with subsequent steps often informed by production experiments. Support for such workflow in machine learning applications remains primitive. The platform developed at if(we) embodies a specific machine learning approach and a rigorous data architecture constraint, so allowing teams to work in rapid iterative cycles. We require models to consume data from a time-ordered event history, and we focus on facilitating creative feature engineering. We make it practical for data scientists to use the same model code in development and in production deployment, and make it practical for them to collaborate on complex models. We deliver real-time recommendations at scale, returning top results from among 10,000,000 candidates with sub-second response times and incorporating new updates in just a few seconds. Using the approach and architecture described here, our team can routinely go from ideas for new models to production-validated results within two weeks.
【Keywords】: agile; machine learning; recommender systems
【Paper Link】 【Pages】:2069-2078
【Authors】: Manu Sethi ; Yupeng Yan ; Anand Rangarajan ; Ranga Raju Vatsavai ; Sanjay Ranka
【Abstract】: Urban neighborhood classification using very high resolution (VHR) remote sensing imagery is a challenging and {\em emerging} application. A semi-supervised learning approach for identifying neighborhoods is presented which employs superpixel tessellation representations of VHR imagery. The image representation utilizes homogeneous and irregularly shaped regions termed superpixels and derives novel features based on intensity histograms, geometry, corner and superpixel density and scale of tessellation. The semi-supervised learning approach uses a support vector machine (SVM) to obtain a preliminary classification which is then subsequently refined using graph Laplacian propagation. Several intermediate stages in the pipeline are presented to showcase the important features of this approach. We evaluated this approach on four different geographic settings with varying neighborhood types and compared it with the recent Gaussian Multiple Learning algorithm. This evaluation shows several advantages, including model building, accuracy, and efficiency which makes it a great choice for deployment in large scale applications like global human settlement mapping and population distribution (e.g., LandScan), and change detection.
【Keywords】: neighborhoods; remote sensing; segmentation
【Paper Link】 【Pages】:2079-2088
【Authors】: Elham Shaabani ; Ashkan Aleali ; Paulo Shakarian ; John Bertetto
【Abstract】: Gang violence is a major problem in the United States accounting for a large fraction of homicides and other violent crime. In this paper, we study the problem of early identification of violent gang members. Our approach relies on modified centrality measures that take into account additional data of the individuals in the social network of co-arrestees which together with other arrest metadata provide a rich set of features for a classification algorithm. We show our approach obtains high precision and recall (0.89 and 0.78 respectively) in the case where the entire network is known and out-performs current approaches used by law-enforcement to the problem in the case where the network is discovered overtime by virtue of new arrests - mimicking real-world law-enforcement operations. Operational issues are also discussed as we are preparing to leverage this method in an operational environment.
【Keywords】: criminology; social network analysis
【Paper Link】 【Pages】:2089-2097
【Authors】: Vinay Shashidhar ; Nishant Pandey ; Varun Aggarwal
【Abstract】: In this paper, we address the problem of grading spontaneous speech using a combination of machine learning and crowdsourcing. Traditional machine learning techniques solve the stated problem inadequately as automatic speaker-independent speech transcription is inaccurate. The features derived from it are also inaccurate and so is the machine learning model developed for speech evaluation. We propose a framework that combines machine learning with crowdsourcing. This entails identifying human intelligence tasks in the feature derivation step and using crowdsourcing to get them completed. We post the task of speech transcription to a large community of online workers (crowd). We also get spoken English grades from the crowd. We achieve 95% transcription accuracy by combining transcriptions from multiple crowd workers. Speech and prosody features are derived by force aligning the speech samples on these highly accurate transcriptions. Additionally, we derive surface and semantic level features directly from the transcription. We demonstrate the efficacy of our approach by predicting expert graded speech sample of 566 adult non-native speakers across two different countries - India and Philippines. Using the regression modeling technique, we are able achieve a Pearson correlation of 0.79 on the Philippines set and 0.74 on the Indian set with expert grades, an accuracy much higher than any previously reported machine learning approach. Our approach has an accuracy that rivals that of expert agreement. We show the value of the system through a case study in a real-world industrial deployment. This work is timely given the huge requirement of spoken English training and assessment.
【Keywords】: crowdsourcing; machine learning; speech recognition; spoken english evaluation; spontaneous speech grading
【Paper Link】 【Pages】:2099-2108
【Authors】: Jianqiang Shen ; Sahin Cem Geyik ; Ali Dasdan
【Abstract】: In digital advertising, advertisers want to reach the right audience over media channels such as display, mobile, video, or social at the appropriate cost. The right audience for an advertiser consists of existing customers as well as valuable prospects, those that can potentially be turned into future customers. Identifying valuable prospects is called the audience extension problem because advertisers find new customers by extending the desirable criteria for their starting point, which is their existing audience or customers. The complexity of the audience extension problem stems from the difficulty of defining desirable criteria objectively, the number of desirable criteria (such as similarity, diversity, performance) to simultaneously satisfy, and the expected runtime (a few minutes) to find a solution over billions of cookie-based users. In this paper, we formally define the audience extension problem, propose an algorithm that extends a given audience set efficiently under multiple desirable criteria, and experimentally validate its performance. Instead of iterating over individual users, the algorithm takes in Boolean rules that define the seed audience and returns a new set of Boolean rules that corresponds to the extended audience that satisfy the multiple criteria.
【Keywords】: audience extension; online advertising; targeting
【Paper Link】 【Pages】:2109-2118
【Authors】: Virginia Smith ; Miriam Connor ; Isabelle Stanton
【Abstract】: tl;dr: Longform articles are extended, in-depth pieces that often serve as feature stories in newspapers and magazines. In this work, we develop a system to automatically identify longform content across the web. Our novel classifier is highly accurate despite huge variation within longform in terms of topic, voice, and editorial taste. It is also scalable and interpretable, requiring a surprisingly small set of features based only on language and parse structures, length, and document interest. We implement our system at scale and use it to identify a corpus of several million longform documents. Using this corpus, we provide the first web-scale study with quantifiable and measurable information on longform, giving new insight into questions posed by the media on the past and current state of this famed literary medium.
【Keywords】: feature engineering; machine learning; natural language processing; web mining
【Paper Link】 【Pages】:2119-2126
【Authors】: Sriram Somanchi ; Samrachana Adhikari ; Allen Lin ; Elena Eneva ; Rayid Ghani
【Abstract】: Code Blue is an emergency code used in hospitals to indicate when a patient goes into cardiac arrest and needs resuscitation. When Code Blue is called, an on-call medical team staffed by physicians and nurses is paged and rushes in to try to save the patient's life. It is an intense, chaotic, and resource-intensive process, and despite the considerable effort, survival rates are still less than 20% [4]. Research indicates that patients actually start showing clinical signs of deterioration some time before going into cardiac arrest [1][2[][3], making early prediction, and possibly intervention, feasible. In this paper, we describe our work, in partnership with NorthShore University HealthSystem, that preemptively flags patients who are likely to go into cardiac arrest, using signals extracted from demographic information, hospitalization history, vitals and laboratory measurements in patient-level electronic medical records. We find that early prediction of Code Blue is possible and when compared with state of the art existing method used by hospitals (MEWS - Modified Early Warning Score)[4], our methods perform significantly better. Based on these results, this system is now being considered for deployment in hospital settings.
【Keywords】: cardiac arrest; code blue; data mining; early prediction; electronic medical records; machine learning; svm
【Paper Link】 【Pages】:2127-2136
【Authors】: Nemanja Spasojevic ; Zhisheng Li ; Adithya Rao ; Prantik Bhattacharyya
【Abstract】: For many users on social networks, one of the goals when broadcasting content is to reach a large audience. The probability of receiving reactions to a message differs for each user and depends on various factors, such as location, daily and weekly behavior patterns and the visibility of the message. While previous work has focused on overall network dynamics and message flow cascades, the problem of recommending personalized posting times has remained an under-explored topic of research. In this study, we formulate a when-to-post problem, where the objective is to find the best times for a user to post on social networks in order to maximize the probability of audience responses. To understand the complexity of the problem, we examine user behavior in terms of post-to-reaction times, and compare cross-network and cross-city weekly reaction behavior for users in different cities, on both Twitter and Facebook. We perform this analysis on over a billion posted messages and observed reactions, and propose multiple approaches for generating personalized posting schedules. We empirically assess these schedules on a sampled user set of 0.5 million active users and more than 25 million messages observed over a 56 day period. We show that users see a reaction gain of up to 17% on Facebook and 4% on Twitter when the recommended posting times are used. We open the dataset used in this study, which includes timestamps for over 144 million posts and over 1.1 billion reactions. The personalized schedules derived here are used in a fully deployed production system to recommend posting times for millions of users every day.
【Keywords】: behavior analysis; online social networks; personalization; posting times; recommended systems; user modeling
【Paper Link】 【Pages】:2137-2146
【Authors】: Andrew Stanton ; Amanda Thart ; Ashish Jain ; Priyank Vyas ; Arpan Chatterjee ; Paulo Shakarian
【Abstract】: The Islamic State of Iraq and al-Sham (ISIS) is a dominant insurgent group operating in Iraq and Syria that rose to prominence when it took over Mosul in June, 2014. In this paper, we present a data-driven approach to analyzing this group using a dataset consisting of 2200 incidents of military activity surrounding ISIS and the forces that oppose it (including Iraqi, Syrian, and the American-led coalition). We combine ideas from logic programming and causal reasoning to mine for association rules for which we present evidence of causality. We present relationships that link ISIS vehicle-bourne improvised explosive device (VBIED) activity in Syria with military operations in Iraq, coalition air strikes, and ISIS IED activity, as well as rules that may serve as indicators of spikes in indirect fire, suicide attacks, and arrests.
【Keywords】: causalit; rule learning; security informatics
【Paper Link】 【Pages】:2147-2156
【Authors】: Qian Sun ; Mohammad Shafkat Amin ; Baoshi Yan ; Craig Martell ; Vita Markman ; Anmol Bhasin ; Jieping Ye
【Abstract】: LinkedIn Groups provide a platform on which professionals with similar background, target and specialities can share content, take part in discussions and establish opinions on industry topics. As in most online social communities, spam content in LinkedIn Groups poses great challenges to the user experience and could eventually lead to substantial loss of active users. Building an intelligent and scalable spam detection system is highly desirable but faces difficulties such as lack of labeled training data, particularly for languages other than English. In this paper, we take the spam (Spanish) job posting detection as the target problem and build a generic machine learning pipeline for multi-lingual spam detection. The main components are feature generation and knowledge migration via transfer learning. Specifically, in the feature generation phase, a relatively large labeled data set is generated via machine translation. Together with a large set of unlabeled human written Spanish data, unigram features are generated based on the frequency. In the second phase, machine translated data are properly reweighted to capture the discrepancy from human written ones and classifiers can be built on top of them. To make effective use of a small portion of labeled data available in human written Spanish, an adaptive transfer learning algorithm is proposed to further improve the performance. We evaluate the proposed method on LinkedIn's production data and the promising results verify the efficacy of our proposed algorithm. The pipeline is ready for production.
【Keywords】: classification; nlp; text mining; transfer learning
【Paper Link】 【Pages】:2157-2166
【Authors】: Vincent S. Tseng ; Jia-Ching Ying ; Che-Wei Huang ; Yimin Kao ; Kuan-Ta Chen
【Abstract】: In recent years, fraud is increasing rapidly with the development of modern technology and global communication. Although many literatures have addressed the fraud detection problem, these existing works focus only on formulating the fraud detection problem as a binary classification problem. Due to limitation of information provided by telecommunication records, such classifier-based approaches for fraudulent phone call detection normally do not work well. In this paper, we develop a graph-mining-based fraudulent phone call detection framework for a mobile application to automatically annotate fraudulent phone numbers with a "fraud" tag, which is a crucial prerequisite for distinguishing fraudulent phone calls from normal phone calls. Our detection approach performs a weighted HITS algorithm to learn the trust value of a remote phone number. Based on telecommunication records, we build two kinds of directed bipartite graph: i) CPG and ii) UPG to represent telecommunication behavior of users. To weight the edges of CPG and UPG, we extract features for each pair of user and remote phone number in two different yet complementary aspects: 1) duration relatedness (DR) between user and phone number; and 2) frequency relatedness (FR) between user and phone number. Upon weighted CPG and UPG, we determine a trust value for each remote phone number. Finally, we conduct a comprehensive experimental study based on a dataset collected through an anti-fraud mobile application, Whoscall. The results demonstrate the effectiveness of our weighted HITS-based approach and show the strength of taking both DR and FR into account in feature extraction.
【Keywords】: fraudulent phone call detection; telecommunication fraud; trust value mining; weighted hits algorithm
【Paper Link】 【Pages】:2167-2176
【Authors】: Liudmila Ulanova ; Tan Yan ; Haifeng Chen ; Guofei Jiang ; Eamonn J. Keogh ; Kai Zhang
【Abstract】: The long term operation of physical systems inevitably leads to their wearing out, and may cause degradations in performance or the unexpected failure of the entire system. To reduce the possibility of such unanticipated failures, the system must be monitored for tell-tale symptoms of degradation that are suggestive of imminent failure. In this work, we introduce a novel time series analysis technique that allows the decomposition of the time series into trend and fluctuation components, providing the monitoring software with actionable information about the changes of the system's behavior over time. We analyze the underlying problem and formulate it to a Quadratic Programming (QP) problem that can be solved with existing QP-solvers. However, when the profiling resolution is high, as generally required by real-world applications, such a decomposition becomes intractable to general QP-solvers. To speed up the problem solving, we further transform the problem and present a novel QP formulation, Non-negative QP, for the problem and demonstrate a tractable solution that bypasses the use of slow general QP-solvers. We demonstrate our ideas on both synthetic and real datasets, showing that our method allows us to accurately extract the degradation phenomenon of time series. We further demonstrate the generality of our ideas by applying them beyond classic machine prognostics to problems in identifying the influence of news events on currency exchange rates and stock prices. We fully implement our profiling system and deploy it into several physical systems, such as chemical plants and nuclear power plants, and it greatly helps detect the degradation phenomenon, and diagnose the corresponding components.
【Keywords】: degradation profiling; optimization; time series analysis
【Paper Link】 【Pages】:2177-2185
【Authors】: Bhanu Chandra Vattikonda ; Santhosh Kodipaka ; Hongyan Zhou ; Vacha Dave ; Saikat Guha ; Alex C. Snoeren
【Abstract】: Search engines derive revenue by displaying sponsored results along with organic results in response to user queries. In general, search engines run a per-query, on-line auction amongst interested advertisers to select sponsored results to display. In doing so, they must carefully balance the revenue derived from sponsored results against potential degradation in user experience due to less-relevant results. Hence, major search engines attempt to analyze the relevance of potential sponsored results to the user's query using supervised learning algorithms. Past work has employed a bag-of-words approach using features extracted from both the query and potential sponsored result to train the ranker. We show that using features that capture the advertiser's intent can significantly improve the performance of relevance ranking. In particular, we consider the ad keyword the advertiser submits as part of the auction process as a direct expression of intent. We leverage the search engine itself to interpret the ad keyword by submitting the ad keyword as an independent query and incorporating the results as features when determining the relevance of the advertiser's sponsored result to the user's original query. We achieve 43.2% improvement in precision-recall AUC over the best previously published baseline and 2.7% improvement in the production system of a large search engine.
【Keywords】: ad relevance; online advertising; sponsored search
【Paper Link】 【Pages】:2187-2196
【Authors】: Vasilis Verroios ; Panagiotis Papadimitriou ; Ramesh Johari ; Hector Garcia-Molina
【Abstract】: An important problem that online work marketplaces face is grouping clients into clusters, so that in each cluster clients are similar with respect to their hiring criteria. Such a separation allows the marketplace to "learn" more accurately the hiring criteria in each cluster and recommend the right contractor to each client, for a successful collaboration. We propose a Maximum Likelihood definition of the "optimal" client clustering along with an efficient Expectation-Maximization clustering algorithm that can be applied in large marketplaces. Our results on the job hirings at oDesk over a seven-month period show that our client-clustering approach yields significant gains compared to "learning" the same hiring criteria for all clients. In addition, we analyze the clustering results to find interesting differences between the hiring criteria in the different groups of clients.
【Keywords】: client clustering; crowdsourcing; hiring criteria; hiring modeling; job marketplaces; logistic regression clustering; outsourcing; sparse logistic regression; work marketplaces
【Paper Link】 【Pages】:2197-2206
【Authors】: Qing Wang ; Hengshu Zhu ; Wei Hu ; Zhiyong Shen ; Yuan Yao
【Abstract】: Analyzing team tactics plays an important role in the professional soccer industry. Recently, the progressing ability to track the mobility of ball and players makes it possible to accumulate extensive match logs, which open a venue for better tactical analysis. However, traditional methods for tactical analysis largely rely on the knowledge and manual labor of domain experts. To this end, in this paper we propose an unsupervised approach to automatically discerning the typical tactics, i.e., tactical patterns, of soccer teams through mining the historical match logs. To be specific, we first develop a novel model named Team Tactic Topic Model (T3M) for learning the latent tactical patterns, which can model the locations and passing relations of players simultaneously. Furthermore, we demonstrate several potential applications enabled by the proposed T3M, such as automatic tactical pattern discovery, pass segment annotation, and spatial analysis of player roles. Finally, we implement an intelligent demo system to empirically evaluate our approach based on the data collected from La Liga 2013-2014. Indeed, by visualizing the results obtained from T3M, we can successfully observe many meaningful tactical patterns and interesting discoveries, such as using which tactics a team is more likely to score a goal and how a team's playing tactic changes in sequential matches across a season.
【Keywords】: professional soccer; tactical patterns; topic model
【Paper Link】 【Pages】:2207-2215
【Authors】: Xinyu Wei ; Patrick Lucey ; Stuart Morgan ; Peter Carr ; Machar Reid ; Sridha Sridharan
【Abstract】: In professional sport, an enormous amount of fine-grain performance data can be generated at near millisecond intervals in the form of vision-based tracking data. One of the first sports to embrace this technology has been tennis, where Hawk-Eye technology has been used to both aid umpiring decisions, and to visualize shot trajectories for broadcast purposes. These data have tremendous untapped applications in terms of "opponent planning'', where a large amount of recent data is used to learn contextual behavior patterns of individual players, and ultimately predict the likelihood of a particular type of serve. Since the type of serve selected by a player may be contingent on the match context (i.e., is the player down break-point, or is serving for the match etc.), the characteristics of the player (i.e., the player may have a very fast serve, hit heavy with topspin or kick, or slice serves into the body) as well as the characteristics of the opponent (e.g., the opponent may prefer to play from the baseline or "chip-and-charge'' into the net). In this paper we present a method which recommends the most likely serves of a player in a given context. We show by utilizing a "style prior", we can improve the prediction/recommendation. Such an approach also allows us to quantify the similarity between players, which is useful in enriching the dataset for future prediction. We conduct our analysis on Hawk-Eye data collected from three recent Australian Open Grand-Slam Tournaments and show how our approach can be used in practice.
【Keywords】: ball tracking; hawk-eye; player behaviour analysis; serve prediction; style; tennis
【Paper Link】 【Pages】:2217-2226
【Authors】: Jian Xu ; Kuang-chih Lee ; Wentong Li ; Hang Qi ; Quan Lu
【Abstract】: In targeted online advertising, advertisers look for maximizing campaign performance under delivery constraint within budget schedule. Most of the advertisers typically prefer to impose the delivery constraint to spend budget smoothly over the time in order to reach a wider range of audiences and have a sustainable impact. Since lots of impressions are traded through public auctions for online advertising today, the liquidity makes price elasticity and bid landscape between demand and supply change quite dynamically. Therefore, it is challenging to perform smooth pacing control and maximize campaign performance simultaneously. In this paper, we propose a smart pacing approach in which the delivery pace of each campaign is learned from both offline and online data to achieve smooth delivery and optimal performance goals. The implementation of the proposed approach in a real DSP system is also presented. Experimental evaluations on both real online ad campaigns and offline simulations show that our approach can effectively improve campaign performance and achieve delivery goals.
【Keywords】: budget pacing; campaign optimization; demand-side platform
【Paper Link】 【Pages】:2227-2236
【Authors】: Ya Xu ; Nanyu Chen ; Addrian Fernandez ; Omar Sinno ; Anmol Bhasin
【Abstract】: A/B testing, also known as bucket testing, split testing, or controlled experiment, is a standard way to evaluate user engagement or satisfaction from a new service, feature, or product. It is widely used among online websites, including social network sites such as Facebook, LinkedIn, and Twitter to make data-driven decisions. At LinkedIn, we have seen tremendous growth of controlled experiments over time, with now over 400 concurrent experiments running per day. General A/B testing frameworks and methodologies, including challenges and pitfalls, have been discussed extensively in several previous KDD work [7, 8, 9, 10]. In this paper, we describe in depth the experimentation platform we have built at LinkedIn and the challenges that arise particularly when running A/B tests at large scale in a social network setting. We start with an introduction of the experimentation platform and how it is built to handle each step of the A/B testing process at LinkedIn, from designing and deploying experiments to analyzing them. It is then followed by discussions on several more sophisticated A/B testing scenarios, such as running offline experiments and addressing the network effect, where one user's action can influence that of another. Lastly, we talk about features and processes that are crucial for building a strong experimentation culture.
【Keywords】: a/b testing; controlled experiments; measurement; network a/b testing; online experiments; social network
【Paper Link】 【Pages】:2237-2246
【Authors】: Kui Yu ; Dawei Wang ; Wei Ding ; Jian Pei ; David L. Small ; Shafiqul Islam ; Xindong Wu
【Abstract】: Reliable tornado forecasting with a long-lead time can greatly support emergency response and is of vital importance for the economy and society. The large number of meteorological variables in spatiotemporal domains and the complex relationships among variables remain the top difficulties for a long-lead tornado forecasting. Standard data mining approaches to tackle high dimensionality are usually designed to discover a single set of features without alternating options for domain scientists to select more reliable and physical interpretable variables. In this work, we provide a new solution to use the concept of multiple Markov boundaries in local causal discovery to identify multiple sets of the precursors for tornado forecasting. Specifically, our algorithm first confines the extremely large feature spaces to a small core feature space, then it mines multiple sets of the precursors from the core feature space that may equally contribute to tornado forecasting. With the multiple sets of the precursors, we are able to report to domain scientists the predictive but practical set of precursors. An extensive empirical study is conducted on eight benchmark data sets and the historical tornado data near Oklahoma City, OK in the United States. Experimental results show that the tornado precursors we identified can help to improve the reliability of long-lead time catastrophic tornado forecasting.
【Keywords】: core feature space; distributed feature data; multiple markov boundaries; tornado forecasting
【Paper Link】 【Pages】:2247-2256
【Authors】: Chao Yuan ; Matthias Behmann ; Bernhard Meerbeck
【Abstract】: The goal of combustion optimization of a coal-fired boiler is to improve its operating efficiency while reducing emissions at the same time. Being able to take measurements for key combustion ingredients, such as O2, CO, H2O is crucial for the feedback loop needed by this task. One state-of-the-art laser technique, namely, Tunable Diode Laser Absorption Spectroscopy (TDLAS) is able to measure the average value of gas concentration along a laser beam path. A active research direction in TDLAS is how to reconstruct gas concentration images based on these path averages. However, in reality the number of such paths is usually very limited, leading to an extremely under-constrained estimation problem. Another overlooked aspect of the problem is that how can we arrange paths such that the reconstructed image is more accurate? We propose a Bayesian approach based on Gaussian process (GP) to address both image reconstruction and path arrangement problems, simultaneously. Specifically, we use the GP posterior mean as the reconstructed image, and average posterior pixel variance as our objective function to optimize the path arrangement. Our algorithms have been integrated in Siemens SPPA-P3000 control system that provides real-time combustion optimization of boilers around the world.
【Keywords】: combustion optimization; gaussian process; tdlas
【Paper Link】 【Pages】:2257-2266
【Authors】: Weinan Zhang ; Amr Ahmed ; Jie Yang ; Vanja Josifovski ; Alexander J. Smola
【Abstract】: Business-to-consumer (B2C) emails are usually generated by filling structured user data (e.g.purchase, event) into templates. Extracting structured data from B2C emails allows users to track important information on various devices. However, it also poses several challenges, due to the requirement of short response time for massive data volume, the diversity and complexity of templates, and the privacy and legal constraints. Most notably, email data is legally protected content, which means no one except the receiver can review the messages or derived information. In this paper we first introduce a system which can extract structured information automatically without requiring human review of any personal content. Then we focus on how to annotate product names from the extracted texts, which is one of the most difficult problems in the system. Neither general learning methods, such as binary classifiers, nor more specific structure learning methods, suchas Conditional Random Field (CRF), can solve this problem well. To accomplish this task, we propose a hybrid approach, which basically trains a CRF model using the labels predicted by binary classifiers (weak learners). However, the performance of weak learners can be low, therefore we use Expectation Maximization (EM) algorithm on CRF to remove the noise and improve the accuracy, without the need to label and inspect specific emails. In our experiments, the EM-CRF model can significantly improve the product name annotations over the weak learners and plain CRFs.
【Keywords】: business intelligence; structured information extraction
【Paper Link】 【Pages】:2267-2276
【Authors】: Yu Zheng ; Xiuwen Yi ; Ming Li ; Ruiyuan Li ; Zhangqing Shan ; Eric Chang ; Tianrui Li
【Abstract】: In this paper, we forecast the reading of an air quality monitoring station over the next 48 hours, using a data-driven method that considers current meteorological data, weather forecasts, and air quality data of the station and that of other stations within a few hundred kilometers. Our predictive model is comprised of four major components: 1) a linear regression-based temporal predictor to model the local factors of air quality, 2) a neural network-based spatial predictor to model global factors, 3) a dynamic aggregator combining the predictions of the spatial and temporal predictors according to meteorological data, and 4) an inflection predictor to capture sudden changes in air quality. We evaluate our model with data from 43 cities in China, surpassing the results of multiple baseline methods. We have deployed a system with the Chinese Ministry of Environmental Protection, providing 48-hour fine-grained air quality forecasts for four major Chinese cities every hour. The forecast function is also enabled on Microsoft Bing Map and MS cloud platform Azure. Our technology is general and can be applied globally for other cities.
【Keywords】: air quality forecast; big data; urban air; urban computing
【Paper Link】 【Pages】:2277-2286
【Authors】: Erheng Zhong ; Nathan Liu ; Yue Shi ; Suju Rajan
【Abstract】: Content recommendation systems are typically based on one of the following paradigms: user based customization, or recommendations based on either collaborative filtering or low rank matrix factorization methods, or with systems that impute user interest profiles based on content browsing behavior and retrieve items similar to the interest profiles. All of these systems have a distinct disadvantage, namely data sparsity and cold-start on items or users. Furthermore, very few content recommendation solutions explicitly model the wealth of information in implicit negative feedback from the users. In this paper, we propose a hybrid solution that makes use of a latent factor model to infer user interest vectors. The hybrid approach enables us to overcome both the data sparsity and cold-start problems. Our proposed method is learned purely on implicit user feedback, both positive and negative. Exploiting the information in the negative feedback allows the user profiles generated to be discriminative. We also provide a Map/Reduce framework based implementation that enables scaling our solution to real-world recommendation problems. We demonstrate the efficacy of our proposed approach with both offline experiments and A/B tests on live traffic on Yahoo properties.
【Keywords】: collaborative filtering; content-based recommendation; large-scale recommender systems; latent factor models; user profiling
【Paper Link】 【Pages】:2287-2296
【Authors】: Wenliang Zhong ; Rong Jin ; Cheng Yang ; Xiaowei Yan ; Qi Zhang ; Qiang Li
【Abstract】: A large number of recommender systems have been developed to serve users with interesting news, ads, products or other contents. One main limitation with the existing work is that they do not take into account the inventory size of of items to be recommended. As a result, popular items are likely to be out of stock soon as they have been recommended and sold to many users, significantly affecting the impact of recommendation and user experience. This observation motivates us to develop a novel aware recommender system. It jointly optimizes the recommended items for all users based on both user preference and inventory sizes of different items. It requires solving a non-smooth optimization involved estimating a matrix of nxn, where n is the number of items. With the proliferation of items, this approach can quickly become computationally infeasible. We address this challenge by developing a dual method that reduces the number of variables from n^2 to n, significantly improving the computational efficiency. We also extend this approach to the online setting, which is particularly important for big promotion events. Our empirical studies based on a real benchmark data with 100 millions of user visits from Tmall verify the effectiveness of the proposed approach.
【Keywords】: online learning; recommendation; tmall
【Paper Link】 【Pages】:2297-2303
【Authors】: Zhengyi Zhou ; David S. Matteson
【Abstract】: Predicting ambulance demand accurately at fine time and location scales is critical for ambulance fleet management and dynamic deployment. Large-scale datasets in this setting typically exhibit complex spatio-temporal dynamics and sparsity at high resolutions. We propose a predictive method using spatio-temporal kernel density estimation (stKDE) to address these challenges, and provide spatial density predictions for ambulance demand in Toronto, Canada as it varies over hourly intervals. Specifically, we weight the spatial kernel of each historical observation by its informativeness to the current predictive task. We construct spatio-temporal weight functions to incorporate various temporal and spatial patterns in ambulance demand, including location-specific seasonalities and short-term serial dependence. This allows us to draw out the most helpful historical data, and exploit spatio-temporal patterns in the data for accurate and fast predictions. We further provide efficient estimation and customizable prediction procedures. stKDE is easy to use and interpret by non-specialized personnel from the emergency medical service industry. It also has significantly higher statistical accuracy than the current industry practice, with a comparable amount of computational expense.
【Keywords】: emergency medical service; kernel density estimation; non-homogeneous poisson point process
【Paper Link】 【Pages】:2307-2308
【Authors】: Shlomo Berkovsky ; Jill Freyne
【Abstract】: The quantity of accessible information has been growing rapidly and far exceeded human processing capabilities. The sheer abundance of information often prevents users from discovering the desired information, or aggravates making informed and correct choices. This highlights the pressing need for intelligent personalized applications that simplify information access and discovery by taking into account users' preferences and needs. One type of personalized application that has recently become tremendously popular in research and industry is recommender systems. These provide to users personalized recommendations about information and products they may be interested to examine or purchase. Extensive research into recommender systems has yielded a variety of techniques, which have been published at a variety of conferences and adopted by numerous Web-sites. This tutorial will provide the participants with broad overview and thorough understanding of algorithms and practically deployed Web and mobile applications of personalized technologies.
【Keywords】: recommender systems; user modeling; web personalization
【Paper Link】 【Pages】:2309-2310
【Authors】: Alex Beutel ; Leman Akoglu ; Christos Faloutsos
【Abstract】: How can we model users' preferences? How do anomalies, fraud, and spam effect our models of normal users? How can we modify our models to catch fraudsters? In this tutorial we will answer these questions - connecting graph analysis tools for user behavior modeling to anomaly and fraud detection. In particular, we will focus on the application of subgraph analysis, label propagation, and latent factor models to static, evolving, and attributed graphs. For each of these techniques we will give a brief explanation of the algorithms and the intuition behind them. We will then give examples of recent research using the techniques to model, understand and predict normal behavior. With this intuition for how these methods are applied to graphs and user behavior, we will focus on state-of-the-art research showing how the outcomes of these methods are effected by fraud, and how they have been used to catch fraudsters.
【Keywords】: anomalous behavior; fraud detection; outlier detection; recommendation systems; user behavior modeling
【Paper Link】 【Pages】:2311-2312
【Authors】: Xin Fu ; Hernán Asorey
【Abstract】: Data Science is an increasingly popular area of Knowledge Discovery and Data Mining. Leading consumer Web companies such as Amazon, Facebook, eBay, Google and LinkedIn, as well as B2B companies like Salesforce, possess Petabytes of data. Through effective mining of this data, they create products and services that benefit millions of users and generate tremendous amount of business value. It is widely acknowledged that Data Scientists play key roles in the creation of these products, from pattern identification, idea generation and product prototyping to experiment design and launch decisions. Nonetheless, they also face common challenges, such as the gap between creating a prototype and turning it into a scalable product, or the frustration of generating innovative product ideas that do not get adopted. Organizers of this tutorial have many years of experience leading Data Science teams in some of the most successful consumer Web companies. In this tutorial, we introduce the framework that we created to nurture data-driven product innovations. The core of this framework is the focus on scale and impact - we take the audience through a discussion on how to balance between velocity and scale, between product innovation and product operation, and between theoretical research and practical impact. We also share some guidelines for successful data-driven product innovation with real examples from our experiences. We end the tutorial by discussing the organizational perspective of data-driven product innovation: how to structure Data Science teams so Data Scientists collaborate effectively with other functions, and how to hire and grow talents into Data Scientist roles.
【Keywords】: data product; data science; innovation
【Paper Link】 【Pages】:2313-2314
【Authors】: Aristides Gionis ; Charalampos E. Tsourakakis
【Abstract】: Finding dense subgraphs is a fundamental graph-theoretic problem, that lies in the heart of numerous graph-mining applications, ranging from finding communities in social networks, to detecting regulatory motifs in DNA, and to identifying real-time stories in news. The problem of finding dense subgraphs has been studied extensively in theoretical computer science, and recently, due to the relevance of the problem in real-world applications, it has attracted considerable attention in the data-mining community. In this tutorial we aim to provide a comprehensive overview of (i) major algorithmic techniques for finding dense subgraphs in large graphs and (ii) graph mining applications that rely on dense subgraph extraction. We will present fundamental concepts and algorithms that date back to 80's, as well as the latest advances in the area, from theoretical and from practical point-of-view. We will motivate the problem of finding dense subgraphs by discussing how it can be used in real-world applications. We will discuss different density definitions and the complexity of the corresponding optimization problems. We will also present efficient algorithms for different density measures and under different computational models. Specifically, we will focus on scalable streaming, distributed and MapReduce algorithms. Finally we will discuss problem variants, extensions, and will provide pointers for future research directions.
【Keywords】: dense subgraphs; graph mining
【Paper Link】 【Pages】:2315-2316
【Authors】: Manuel Gomez-Rodriguez ; Le Song
【Abstract】: In recent years, there has been an increasing effort on developing realistic models, and learning and inference algorithms to understand, predict, and influence diffusion over networks. This has been in part due to the increasing availability and granularity of large-scale diffusion data, which, in principle, allows for understanding and modeling not only macroscopic diffusion but also microscopic (node-level) diffusion. To this aim, a bottom-up approach has been typically considered, which starts by considering how particular ideas, pieces of information, products, or, more generally, contagions spread locally from node to node apparently at random to later produce global, macroscopic patterns at a network level.However, this bottom-up approach also raises significant modeling, algorithmic and computational challenges which require leveraging methods from machine learning, probabilistic modeling, temporal point processes and graph theory, as well as the nascent field of network science. In this tutorial, we will present several diffusion models designed for fine-grained large-scale diffusion and social event data, present some canonical research problem in the context of diffusion, and introduce state-of-the-art algorithms to solve some of these problems, in particular, network estimation, influence estimation and control, and rumor source identification.
【Keywords】: diffusion networks; information networks; information propagation; social influence; social networks
【Paper Link】 【Pages】:2317-2318
【Authors】: Yan Liu ; Sanjay Chawla
【Abstract】: Anomaly detection is of critical importance to prevent malicious activities such as bullying, terrorist attack planning, and fraud information dissemination. With the recent popularity of social media, new types of anomalous behaviors arise, causing concerns from various parties. While a large body of work haven been dedicated to traditional anomaly detection problems, we observe a surge of research interests in the new realm of social media anomaly detection. In this tutorial, we survey existing work on social media anomaly detection, focusing on the new anomalous phenomena in social media and most recent techniques to detect those special types of anomalies. We aim to provide a general overview of the problem domain, common formulations, existing methodologies and future directions.
【Keywords】: anomaly detection; social media analysis
【Paper Link】 【Pages】:2319-2320
【Authors】: Xiang Ren ; Ahmed El-Kishky ; Chi Wang ; Jiawei Han
【Abstract】: In today's computerized and information-based society, we are soaked with vast amounts of text data, ranging from news articles, scientific publications, product reviews, to a wide range of textual information from social media. To unlock the value of these unstructured text data from various domains, it is of great importance to gain an understanding of entities and their relationships. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in massive, domain-specific text corpora. These methods can automatically identify token spans as entity mentions in documents and label their types (e.g., people, product, food) in a scalable way. We demonstrate on real datasets including news articles and tweets how these typed entities aid in knowledge discovery and management.
【Keywords】: clustering; entity recognition; entity typing; relation phrase
【Paper Link】 【Pages】:2321-2322
【Authors】: Matteo Riondato ; Eli Upfal
【Abstract】: Rademacher Averages and the Vapnik-Chervonenkis dimension are fundamental concepts from statistical learning theory. They allow to study simultaneous deviation bounds of empirical averages from their expectations for classes of functions, by considering properties of the functions, of their domain (the dataset), and of the sampling process. In this tutorial, we survey the use of Rademacher Averages and the VC-dimension in sampling-based algorithms for graph analysis and pattern mining. We start from their theoretical foundations at the core of machine learning, then show a generic recipe for formulating data mining problems in a way that allows to use these concepts in efficient randomized algorithms for those problems. Finally, we show examples of the application of the recipe to graph problems (connectivity, shortest paths, betweenness centrality) and pattern mining. Our goal is to expose the usefulness of these techniques for the data mining researcher, and to encourage research in the area.
【Keywords】: betweenness centrality; frequent itemsets; graph mining; pattern mining; randomized algorithms; tutorial
【Paper Link】 【Pages】:2323-2324
【Authors】: James G. Shanahan ; Laing Dai
【Abstract】: Apache Spark is an open-source cluster computing framework for big data processing. It has emerged as the next generation big data processing engine, overtaking Hadoop MapReduce which helped ignite the big data revolution. Spark maintains MapReduce's linear scalability and fault tolerance, but extends it in a few important ways: it is much faster (100 times faster for certain applications), much easier to program in due to its rich APIs in Python, Java, Scala (and shortly R), and its core data abstraction, the distributed data frame, and it goes far beyond batch applications to support a variety of compute-intensive tasks, including interactive queries, streaming, machine learning, and graph processing. This tutorial will provide an accessible introduction to Spark and its potential to revolutionize academic and commercial data science practices.
【Keywords】: data science; distributed systems; hadoop; hdfs; large scale machine learning; map reduce; spark
【Paper Link】 【Pages】:2325
【Authors】: Myra Spiliopoulou ; Pedro Pereira Rodrigues ; Ernestina Menasalvas Ruiz
【Abstract】: In year 2015, we experience a proliferation of scientific publications, conferences and funding programs on KDD for medicine and healthcare. However, medical scholars and practitioners work differently from KDD researchers: their research is mostly hypothesis-driven, not data-driven. KDD researchers need to understand how medical researchers and practitioners work, what questions they have and what methods they use, and how mining methods can fit into their research frame and their everyday business. Purpose of this tutorial is to contribute to this learning process. We address medicine and healthcare; there the expertise of KDD scholars is needed and familiarity with medical research basics is a prerequisite. We aim to provide basics for (1) mining in epidemiology and (2) mining in the hospital. We also address, to a lesser extent, the subject of (3) preparing and annotating Electronic Health Records for mining.
【Keywords】: clinical decision support; electronic health records; epidemiological mining; medical mining
【Paper Link】 【Pages】:2327
【Authors】: Tianbao Yang ; Qihang Lin ; Rong Jin
【Abstract】: As the scale and dimensionality of data continue to grow in many applications of data analytics (e.g., bioinformatics, finance, computer vision, medical informatics), it becomes critical to develop efficient and effective algorithms to solve numerous machine learning and data mining problems. This tutorial will focus on simple yet practically effective techniques and algorithms for big data analytics. In the first part, we plan to present the state-of-the-art large-scale optimization algorithms, including various stochastic gradient descent methods, stochastic coordinate descent methods and distributed optimization algorithms, for solving various machine learning problems. In the second part, we will focus on randomized approximation algorithms for learning from large-scale data. We will discuss i) randomized algorithms for low-rank matrix approximation; ii) approximation techniques for solving kernel learning problems; iii) randomized reduction methods for addressing the high-dimensional challenge. Along with the description of algorithms, we will also present some empirical results to facilitate understanding of different algorithms and comparison between them.
【Keywords】: machine learning; optimization; randomized approximation; randomized reduction
【Paper Link】 【Pages】:2329-2330
【Authors】: Katharina Morik ; Hugh Durrant-Whyte ; Gary Hill ; Dietmar Müller ; Tanya Y. Berger-Wolf
【Abstract】: The panel session 'Data Driven Science' discusses application and use of knowledge discovery, machine learning and data analytics in science disciplines; in natural, physical, medical and social science; from physics to geology, and from neuroscience to population health. Knowledge discovery methods are finding broad application in all areas of scientific endeavor, to explore experimental data, to discover new models, to propose new scientific theories and ideas. In addition, the availability of ever larger scientific data sets is driving a new data-driven paradigm for modeling of complex phenomena in physical, natural and social sciences. The purpose of this panel is to bring together users of knowledge discovery, machine learning and data analytics methods across the science disciplines, to understand what tools and methods are proving effective in areas such as data exploration and modeling, to uncover common problems that can be addressed in the KDD community, and to explore the emerging data-driven paradigm in science.
【Keywords】: data driven science; data mining; data science; knowledge discovery