The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, New York, NY, USA - August 24 - 27, 2014. ACM 【DBLP Link】
【Paper Link】 【Pages】:1
【Authors】: Oren Etzioni
【Abstract】: Deep learning has catapulted to the front page of the New York Times, formed the core of the so-called 'Google brain', and achieved impressive results in vision, speech recognition, and elsewhere. Yet researchers have offered simple conundrums that deep learning doesn't address. For example, consider the sentence: 'The large ball crashed right through the table because it was made of Styrofoam.' What was made of Styrofoam? The large ball? Or the table? The answer is obviously 'the table', but if we change the word 'Styrofoam' to 'steel', the answer is clearly 'the large ball'. To automatically answer this type of question, our computers require an extensive body of knowledge. We believe that text mining can provide the requisite body of knowledge. My talk will describe work at the new Allen Institute for AI towards building the next-generation of text-mining systems.
【Keywords】: data mining; deep learning; text-mining
【Paper Link】 【Pages】:2
【Authors】: Eric Horvitz
【Abstract】: Deep societal benefits will spring from advances in data availability and in computational procedures for mining insights and inferences from large data sets. I will describe efforts to harness data for making predictions and guiding decisions, touching on work in transportation, healthcare, online services, and interactive systems. I will start with efforts to learn and field predictive models that forecast flows of traffic in greater city regions. Moving from the ground to the air, I will discuss fusing data from aircraft to make inferences about atmospheric conditions and using these results to enhance air transport. I will then focus on experiences with building and fielding predictive models in clinical medicine. I will show how inferences about outcomes and interventions can provide insights and guide decision making. Moving beyond data captured by hospitals, I will discuss the promise of transforming anonymized behavioral data drawn from web services into large-scale sensor networks for public health, including efforts to identify adverse effects of medications and to understand illness in populations. I will conclude by describing how we can use machine learning to leverage the complementarity of human and machine intellect to solve challenging problems in science and society.
【Keywords】: decision analysis; healthcare; machine learning; probabilistic graphical models; probabilistic inference; transportation
【Paper Link】 【Pages】:3
【Authors】: Eric E. Schadt
【Abstract】: Throughout the biomedical and life sciences research community, advanced integrative biology algorithms are employed to integrate large scale data across many different high-dimensional datatypes to construct predictive network models of disease. The causal inference approaches we employ for this purpose well complement the types of natural artificial intelligence/machine learning approaches that have become nearly standard in the life and biomedical sciences for building classifiers for a range of problems, from disease classification and subtype stratification, to the identification of responders and non-responders for a given treatment strategy. By building a causal network model that spans multiple scales (from the molecular to the cellular, to the tissue/organ, to the organism and community) we can understand the flow of information and how best to modulate that flow to improve human wellbeing, whether better diagnosing and treating disease or improving overall health(1-4). More specifically, we have constructed predictive network models for Alzheimer's disease, along with other common human diseases such as obesity, diabetes, heart disease, and inflammatory bowel disease, and cancer, and demonstrated a causal network common across all of these diseases(3, 5-10). Not only do we demonstrate that our predictive models uncover important mechanisms of disease and mechanistic connections among different diseases, but that they have led to a natural way to prioritize therapeutic points of intervention and provide optimal molecular phenotypes for high throughput screening. Our application of these models in a number of disease areas has led to the identification of novel genes that are causal for disease and that may serve as efficacious points of therapeutic intervention, as well as to personalized treatment strategies that provide a more quantitative and accurate approach to tailoring treatments to specific forms of disease.
【Keywords】: causal network inference; drug discovery; multiscale biology; predictive network modeling; systems biology
【Paper Link】 【Pages】:4
【Authors】: Sendhil Mullainathan
【Abstract】: Social scientists increasingly criticize the use of machine learning techniques to understand human behavior. Criticisms include: (1) They are atheoretical and hence of limited scientific value; (2) They do not address causality and are hence of limited policy value; and (3) They are uninterpretable and hence of limited generalizability value (outside contexts very narrowly similar to the training dataset). These criticisms, I argue, miss the enormous opportunity offered by ML techniques to fundamentally improve the practice of empirical social science. Yet each criticism does contain a grain of truth and overcoming them will require innovations to existing methodologies. Some of these innovations are being developed today and some are yet to be tackled. I will in this talk sketch (1) what these innovations look like or should look like; (2) why they are needed; and (3) the technical challenges they raise. I will illustrate my points using a set of applications that range from financial markets to social policy problems to computational models of basic psychological processes. This talk describes joint work with Jon Kleinberg and individual projects with Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, Anuj Shah, Chenhao Tan, Mike Yeomans and Tom Zimmerman.
【Keywords】: n/a
【Paper Link】 【Pages】:5-14
【Authors】: Xuan Song ; Quanshi Zhang ; Yoshihide Sekimoto ; Ryosuke Shibasaki
【Abstract】: The frequency and intensity of natural disasters has significantly increased over the past decades and this trend is predicted to continue. Facing these possible and unexpected disasters, accurately predicting human emergency behavior and their mobility will become the critical issue for planning effective humanitarian relief, disaster management, and long-term societal reconstruction. In this paper, we build up a large human mobility database (GPS records of 1.6 million users over one year) and several different datasets to capture and analyze human emergency behavior and their mobility following the Great East Japan Earthquake and Fukushima nuclear accident. Based on our empirical analysis through these data, we find that human behavior and their mobility following large-scale disaster sometimes correlate with their mobility patterns during normal times, and are also highly impacted by their social relationship, intensity of disaster, damage level, government appointed shelters, news reporting, large population flow and etc. On the basis of these findings, we develop a model of human behavior that takes into account these factors for accurately predicting human emergency behavior and their mobility following large-scale disaster. The experimental results and validations demonstrate the efficiency of our behavior model, and suggest that human behavior and their movements during disasters may be significantly more predictable than previously thought.
【Keywords】: disaster informatics; human mobility; spatio-temporal data mining
【Paper Link】 【Pages】:15-24
【Authors】: Yuxiao Dong ; Yang Yang ; Jie Tang ; Yang Yang ; Nitesh V. Chawla
【Abstract】: Demographics are widely used in marketing to characterize different types of customers. However, in practice, demographic information such as age, gender, and location is usually unavailable due to privacy and other reasons. In this paper, we aim to harness the power of big data to automatically infer users' demographics based on their daily mobile communication patterns. Our study is based on a real-world large mobile network of more than 7,000,000 users and over 1,000,000,000 communication records (CALL and SMS). We discover several interesting social strategies that mobile users frequently use to maintain their social connections. First, young people are very active in broadening their social circles, while seniors tend to keep close but more stable connections. Second, female users put more attention on cross-generation interactions than male users, though interactions between male and female users are frequent. Third, a persistent same-gender triadic pattern over one's lifetime is discovered for the first time, while more complex opposite-gender triadic patterns are only exhibited among young people. We further study to what extent users' demographics can be inferred from their mobile communications. As a special case, we formalize a problem of double dependent-variable prediction-inferring user gender and age simultaneously. We propose the WhoAmI method, a Double Dependent-Variable Factor Graph Model, to address this problem by considering not only the effects of features on gender/age, but also the interrelation between gender and age. Our experiments show that the proposed WhoAmI method significantly improves the prediction accuracy by up to 10% compared with several alternative methods.
【Keywords】: demographic prediction; human communication; mobile social network; social strategy
【Paper Link】 【Pages】:25-34
【Authors】: Yilun Wang ; Yu Zheng ; Yexiang Xue
【Abstract】: In this paper, we propose a citywide and real-time model for estimating the travel time of any path (represented as a sequence of connected road segments) in real time in a city, based on the GPS trajectories of vehicles received in current time slots and over a period of history as well as map data sources. Though this is a strategically important task in many traffic monitoring and routing systems, the problem has not been well solved yet given the following three challenges. The first is the data sparsity problem, i.e., many road segments may not be traveled by any GPS-equipped vehicles in present time slot. In most cases, we cannot find a trajectory exactly traversing a query path either. Second, for the fragment of a path with trajectories, they are multiple ways of using (or combining) the trajectories to estimate the corresponding travel time. Finding an optimal combination is a challenging problem, subject to a tradeoff between the length of a path and the number of trajectories traversing the path (i.e., support). Third, we need to instantly answer users' queries which may occur in any part of a given city. This calls for an efficient, scalable and effective solution that can enable a citywide and real-time travel time estimation. To address these challenges, we model different drivers' travel times on different road segments in different time slots with a three dimension tensor. Combined with geospatial, temporal and historical contexts learned from trajectories and map data, we fill in the tensor's missing values through a context-aware tensor decomposition approach. We then devise and prove an object function to model the aforementioned tradeoff, with which we find the most optimal concatenation of trajectories for an estimate through a dynamic programming solution. In addition, we propose using frequent trajectory patterns (mined from historical trajectories) to scale down the candidates of concatenation and a suffix-tree-based index to manage the trajectories received in the present time slot. We evaluate our method based on extensive experiments, using GPS trajectories generated by more than 32,000 taxis over a period of two months. The results demonstrate the effectiveness, efficiency and scalability of our method beyond baseline approaches.
【Keywords】: spatial trajectories; spatio-temporal data mining; tensor decomposition; travel time estimation; urban computing
【Paper Link】 【Pages】:35-44
【Authors】: Moshe Lichman ; Padhraic Smyth
【Abstract】: Location-based data is increasingly prevalent with the rapid increase and adoption of mobile devices. In this paper we address the problem of learning spatial density models, focusing specifically on individual-level data. Modeling and predicting a spatial distribution for an individual is a challenging problem given both (a) the typical sparsity of data at the individual level and (b) the heterogeneity of spatial mobility patterns across individuals. We investigate the application of kernel density estimation (KDE) to this problem using a mixture model approach that can interpolate between an individual's data and broader patterns in the population as a whole. The mixture-KDE approach is evaluated on two large geolocation/check-in data sets, from Twitter and Gowalla, with comparisons to non-KDE baselines, using both log-likelihood and detection of simulated identity theft as evaluation metrics. Our experimental results indicate that the mixture-KDE method provides a useful and accurate methodology for capturing and predicting individual-level spatial patterns in the presence of noisy and sparse data.
【Keywords】: anomaly/novelty detection; kernel density estimation; probabilistic methods; social media; spatial; user modeling
【Paper Link】 【Pages】:45-54
【Authors】: Meng Qu ; Hengshu Zhu ; Junming Liu ; Guannan Liu ; Hui Xiong
【Abstract】: The GPS technology and new forms of urban geography have changed the paradigm for mobile services. As such, the abundant availability of GPS traces has enabled new ways of doing taxi business. Indeed, recent efforts have been made on developing mobile recommender systems for taxi drivers using Taxi GPS traces. These systems can recommend a sequence of pick-up points for the purpose of maximizing the probability of identifying a customer with the shortest driving distance. However, in the real world, the income of taxi drivers is strongly correlated with the effective driving hours. In other words, it is more critical for taxi drivers to know the actual driving routes to minimize the driving time before finding a customer. To this end, in this paper, we propose to develop a cost-effective recommender system for taxi drivers. The design goal is to maximize their profits when following the recommended routes for finding passengers. Specifically, we first design a net profit objective function for evaluating the potential profits of the driving routes. Then, we develop a graph representation of road networks by mining the historical taxi GPS traces and provide a Brute-Force strategy to generate optimal driving route for recommendation. However, a critical challenge along this line is the high computational cost of the graph based approach. Therefore, we develop a novel recursion strategy based on the special form of the net profit function for searching optimal candidate routes efficiently. Particularly, instead of recommending a sequence of pick-up points and letting the driver decide how to get to those points, our recommender system is capable of providing an entire driving route, and the drivers are able to find a customer for the largest potential profit by following the recommendations. This makes our recommender system more practical and profitable than other existing recommender systems. Finally, we carry out extensive experiments on a real-world data set collected from the San Francisco Bay area and the experimental results clearly validate the effectiveness of the proposed recommender system.
【Keywords】: cost-effective; mobile recommender systems; taxi drivers
【Paper Link】 【Pages】:55-64
【Authors】: Yubin Park ; Joydeep Ghosh
【Abstract】: In the past few years, the government and other agencies have publicly released a prodigious amount of data that can be potentially mined to benefit the society at large. However, data such as health records are typically only provided at aggregated levels (e.g. per State, per Hospital Referral Region, etc.) to protect privacy. Unfortunately aggregation can severely diminish the utility of such data when modeling or analysis is desired at a per-individual basis. So, not surprisingly, despite the increasing abundance of aggregate data, there have been very few successful attempts in exploiting them for individual-level analyses. This paper introduces LUDIA, a novel low-rank approximation algorithm that utilizes aggregation constraints in addition to auxiliary information in order to estimate or "reconstruct" the original individual-level values from aggregate data. If the reconstructed data are statistically similar to the original individual-level data, off-the-shelf individual-level models can be readily and reliably applied for subsequent predictive or descriptive analytics. LUDIA is more robust to nonlinear estimates and random effects than other reconstruction algorithms. It solves a Sylvester equation and leverages multi-level (also known as hierarchical or mixed-effect) modeling approaches efficiently. A novel graphical model is also introduced to provide a probabilistic viewpoint of LUDIA. Experimental results using a Texas inpatient dataset show that individual-level data can be reasonably reconstructed from county-, hospital-, and zip code-level aggregate data. Several factors affecting the reconstruction quality are discussed, along with the implications of this work for current aggregation guidelines.
【Keywords】: data aggregation; low rank approximation; multi-level model
【Paper Link】 【Pages】:65-74
【Authors】: Subhabrata Mukherjee ; Gerhard Weikum ; Cristian Danescu-Niculescu-Mizil
【Abstract】: Online health communities are a valuable source of information for patients and physicians. However, such user-generated resources are often plagued by inaccuracies and misinformation. In this work we propose a method for automatically establishing the credibility of user-generated medical statements and the trustworthiness of their authors by exploiting linguistic cues and distant supervision from expert sources. To this end we introduce a probabilistic graphical model that jointly learns user trustworthiness, statement credibility, and language objectivity. We apply this methodology to the task of extracting rare or unknown side-effects of medical drugs --- this being one of the problems where large scale non-expert data has the potential to complement expert medical knowledge. We show that our method can reliably extract side-effects and filter out false statements, while identifying trustworthy users that are likely to contribute valuable medical information.
【Keywords】: credibility; objectivity; probabilistic graphical models; trustworthiness; veracity
【Paper Link】 【Pages】:75-84
【Authors】: Marzyeh Ghassemi ; Tristan Naumann ; Finale Doshi-Velez ; Nicole Brimmer ; Rohit Joshi ; Anna Rumshisky ; Peter Szolovits
【Abstract】: Accurate knowledge of a patient's disease state and trajectory is critical in a clinical setting. Modern electronic healthcare records contain an increasingly large amount of data, and the ability to automatically identify the factors that influence patient outcomes stand to greatly improve the efficiency and quality of care. We examined the use of latent variable models (viz. Latent Dirichlet Allocation) to decompose free-text hospital notes into meaningful features, and the predictive power of these features for patient mortality. We considered three prediction regimes: (1) baseline prediction, (2) dynamic (time-varying) outcome prediction, and (3) retrospective outcome prediction. In each, our prediction task differs from the familiar time-varying situation whereby data accumulates; since fewer patients have long ICU stays, as we move forward in time fewer patients are available and the prediction task becomes increasingly difficult. We found that latent topic-derived features were effective in determining patient mortality under three timelines: in-hospital, 30 day post-discharge, and 1 year post-discharge mortality. Our results demonstrated that the latent topic features important in predicting hospital mortality are very different from those that are important in post-discharge mortality. In general, latent topic features were more predictive than structured features, and a combination of the two performed best. The time-varying models that combined latent topic features and baseline features had AUCs that reached 0.85, 0.80, and 0.77 for in-hospital, 30 day post-discharge and 1 year post-discharge mortality respectively. Our results agreed with other work suggesting that the first 24 hours of patient information are often the most predictive of hospital mortality. Retrospective models that used a combination of latent topic features and structured features achieved AUCs of 0.96, 0.82, and 0.81 for in-hospital, 30 day, and 1-year mortality prediction. Our work focuses on the dynamic (time-varying) setting because models from this regime could facilitate an on-going severity stratification system that helps direct care-staff resources and inform treatment strategies.
【Keywords】: data mining for social good; healthcare and medicine; support vector machines; text; topic; graphical and latent variable models
【Paper Link】 【Pages】:85-94
【Authors】: Xiang Wang ; David Sontag ; Fei Wang
【Abstract】: Chronic diseases, such as Alzheimer's Disease, Diabetes, and Chronic Obstructive Pulmonary Disease, usually progress slowly over a long period of time, causing increasing burden to the patients, their families, and the healthcare system. A better understanding of their progression is instrumental in early diagnosis and personalized care. Modeling disease progression based on real-world evidence is a very challenging task due to the incompleteness and irregularity of the observations, as well as the heterogeneity of the patient conditions. In this paper, we propose a probabilistic disease progression model that address these challenges. As compared to existing disease progression models, the advantage of our model is three-fold: 1) it learns a continuous-time progression model from discrete-time observations with non-equal intervals; 2) it learns the full progression trajectory from a set of incomplete records that only cover short segments of the progression; 3) it learns a compact set of medical concepts as the bridge between the hidden progression process and the observed medical evidence, which are usually extremely sparse and noisy. We demonstrate the capabilities of our model by applying it to a real-world COPD patient cohort and deriving some interesting clinical insights.
【Keywords】: bayesian network; disease progression modeling; markov jump process; medical informatics
【Paper Link】 【Pages】:95-104
【Authors】: Evangelos E. Papalexakis ; Alona Fyshe ; Nicholas D. Sidiropoulos ; Partha Pratim Talukdar ; Tom M. Mitchell ; Christos Faloutsos
【Abstract】: Given a simple noun such as {\em apple}, and a question such as "is it edible?", what processes take place in the human brain? More specifically, given the stimulus, what are the interactions between (groups of) neurons (also known as functional connectivity) and how can we automatically infer those interactions, given measurements of the brain activity? Furthermore, how does this connectivity differ across different human subjects? In this work we present a simple, novel good-enough brain model, or GeBM in short, and a novel algorithm Sparse-SysId, which are able to effectively model the dynamics of the neuron interactions and infer the functional connectivity. Moreover, GeBM is able to simulate basic psychological phenomena such as habituation and priming (whose definition we provide in the main text). We evaluate GeBM by using both synthetic and real brain data. Using the real data, GeBM produces brain activity patterns that are strikingly similar to the real ones, and the inferred functional connectivity is able to provide neuroscientific insights towards a better understanding of the way that neurons interact with each other, as well as detect regularities and outliers in multi-subject brain activity measurements.
【Keywords】: brain activity analysis; brain functional connectivity; control theory; neuroscience; system identification
【Paper Link】 【Pages】:105-114
【Authors】: Yasuko Matsubara ; Yasushi Sakurai ; Willem G. van Panhuis ; Christos Faloutsos
【Abstract】: Given a large collection of epidemiological data consisting of the count of d contagious diseases for l locations of duration n, how can we find patterns, rules and outliers? For example, the Project Tycho provides open access to the count infections for U.S. states from 1888 to 2013, for 56 contagious diseases (e.g., measles, influenza), which include missing values, possible recording errors, sudden spikes (or dives) of infections, etc. So how can we find a combined model, for all these diseases, locations, and time-ticks? In this paper, we present FUNNEL, a unifying analytical model for large scale epidemiological data, as well as a novel fitting algorithm, FUNNELFIT, which solves the above problem. Our method has the following properties: (a) Sense-making: it detects important patterns of epidemics, such as periodicities, the appearance of vaccines, external shock events, and more; (b) Parameter-free: our modeling framework frees the user from providing parameter values; (c) Scalable: FUNNELFIT is carefully designed to be linear on the input size; (d) General: our model is general and practical, which can be applied to various types of epidemics, including computer-virus propagation, as well as human diseases. Extensive experiments on real data demonstrate that FUNNELFIT does indeed discover important properties of epidemics: (P1) disease seasonality, e.g., influenza spikes in January, Lyme disease spikes in July and the absence of yearly periodicity for gonorrhea; (P2) disease reduction effect, e.g., the appearance of vaccines; (P3) local/state-level sensitivity, e.g., many measles cases in NY; (P4) external shock events, e.g., historical flu pandemics; (P5) detect incongruous values, i.e., data reporting errors.
【Keywords】: automatic mining; epidemics; time-series
【Paper Link】 【Pages】:115-124
【Authors】: Joyce C. Ho ; Joydeep Ghosh ; Jimeng Sun
【Abstract】: The rapidly increasing availability of electronic health records (EHRs) from multiple heterogeneous sources has spearheaded the adoption of data-driven approaches for improved clinical research, decision making, prognosis, and patient management. Unfortunately, EHR data do not always directly and reliably map to phenotypes, or medical concepts, that clinical researchers need or use. Existing phenotyping approaches typically require labor intensive supervision from medical experts. We propose Marble, a novel sparse non-negative tensor factorization method to derive phenotype candidates with virtually no human supervision. Marble decomposes the observed tensor into two terms, a bias tensor and an interaction tensor. The bias tensor represents the baseline characteristics common amongst the overall population and the interaction tensor defines the phenotypes. We demonstrate the capability of our proposed model on both simulated and patient data from a publicly available clinical database. Our results show that Marble derived phenotypes provide at least a 42.8% reduction in the number of non-zero element and also retains predictive power for classification purposes. Furthermore, the resulting phenotypes and baseline characteristics from real EHR data are consistent with known characteristics of the patient population. Thus it can potentially be used to rapidly characterize, predict, and manage a large number of diseases, thereby promising a novel, data-driven solution that can benefit very large segments of the population.
【Keywords】: EHR phenotyping; application; dimensionality reduction; tensor
【Paper Link】 【Pages】:125-134
【Authors】: Chih-Chun Chia ; Zeeshan Syed
【Abstract】: Cardiac disease is the leading cause of death around the world; with ischemic heart disease alone claiming 7 million lives in 2011. This burden can be attributed, in part, to the absence of biomarkers that can reliably identify high risk patients and match them to treatments that are appropriate for them. In recent clinical studies, we have demonstrated the ability of computation to extract information with substantial prognostic utility that is typically disregarded in time-series data collected from cardiac patients. Of particular interest are subtle variations in long-term electrocardiographic (ECG) data that are usually overlooked as noise but provide a useful assessment of myocardial instability. In multiple clinical cohorts, we have developed the pathophysiological basis for studying probabilistic variations in long-term ECG and demonstrated the ability of this information to effectively risk stratify patients at risk of dying following heart attacks. In this paper, we extend this work and focus on the question of how to reduce its computational complexity for scalable use in large datasets or energy constrained embedded devices. Our basic approach to uncovering pathological structure within the ECG focuses on characterizing beat-to-beat time-warped shape deformations of the ECG using a modified dynamic time-warping (DTW) and Lomb-Scargle periodogram-based algorithm. As part of our efforts to scale this work up, we explore a novel approach to address the quadratic runtime of DTW. We achieve this by developing the idea of adaptive downsampling to reduce the size of the inputs presented to DTW, and describe changes to the dynamic programming problem underlying DTW to exploit adaptively downsampled ECG signals. When evaluated on data from 765 patients in the DISPERSE2-TIMI33 trial, our results show that high morphologic variability is associated with an 8- to 9-fold increased risk of death within 90 days of a heart attack. Moreover, the use of adaptive downsampling with a modified DTW formulation achieves a 7- to almost 20-fold reduction in runtime relative to DTW, without a significant change in biomarker discrimination.
【Keywords】: adaptive downsampling; cardiovascular disease; electrocardiogram; noise mining; time-warping
【Paper Link】 【Pages】:135-144
【Authors】: Jiayu Zhou ; Fei Wang ; Jianying Hu ; Jieping Ye
【Abstract】: Inferring phenotypic patterns from population-scale clinical data is a core computational task in the development of personalized medicine. One important source of data on which to conduct this type of research is patient Electronic Medical Records (EMR). However, the patient EMRs are typically sparse and noisy, which creates significant challenges if we use them directly to represent patient phenotypes. In this paper, we propose a data driven phenotyping framework called Pacifier (PAtient reCord densIFIER), where we interpret the longitudinal EMR data of each patient as a sparse matrix with a feature dimension and a time dimension, and derive more robust patient phenotypes by exploring the latent structure of those matrices. Specifically, we assume that each derived phenotype is composed of a subset of the medical features contained in original patient EMR, whose value evolves smoothly over time. We propose two formulations to achieve such goal. One is Individual Basis Approach (IBA), which assumes the phenotypes are different for every patient. The other is Shared Basis Approach (SBA), which assumes the patient population shares a common set of phenotypes. We develop an efficient optimization algorithm that is capable of resolving both problems efficiently. Finally we validate Pacifier on two real world EMR cohorts for the tasks of early prediction of Congestive Heart Failure (CHF) and End Stage Renal Disease (ESRD). Our results show that the predictive performance in both tasks can be improved significantly by the proposed algorithms (average AUC score improved from 0.689 to 0.816 on CHF, and from 0.756 to 0.838 on ESRD respectively, on diagnosis group granularity). We also illustrate some interesting phenotypes derived from our data.
【Keywords】: densification; matrix completion; medical informatics; phenotyping; sparse learning
【Paper Link】 【Pages】:145-154
【Authors】: Fei Wang ; Ping Zhang ; Buyue Qian ; Xiang Wang ; Ian Davidson
【Abstract】: Logistic regression is one core predictive modeling technique that has been used extensively in health and biomedical problems. Recently a lot of research has been focusing on enforcing sparsity on the learned model to enhance its effectiveness and interpretability, which results in sparse logistic regression model. However, no matter the original or sparse logistic regression, they require the inputs to be in vector form. This limits the applicability of logistic regression in the problems when the data cannot be naturally represented vectors (e.g., functional magnetic resonance imaging and electroencephalography signals). To handle the cases when the data are in the form of multi-dimensional arrays, we propose MulSLR: Multilinear Sparse Logistic Regression. MulSLR can be viewed as a high order extension of sparse logistic regression. Instead of solving one classification vector as in conventional logistic regression, we solve for K classification vectors in MulSLR (K is the number of modes in the data). We propose a block proximal descent approach to solve the problem and prove its convergence. The convergence rate of the proposed algorithm is also analyzed. Finally we validate the efficiency and effectiveness of MulSLR on predicting the onset risk of patients with Alzheimer's disease and heart failure.
【Keywords】: healthcare; logistic regression; multilinear; proximal gradient
【Paper Link】 【Pages】:155-162
【Authors】: James C. Ross ; Peter J. Castaldi ; Michael H. Cho ; Jennifer G. Dy
【Abstract】: Chronic obstructive pulmonary disease (COPD) is a lung disease characterized by airflow limitation usually associated with an inflammatory response to noxious particles, such as cigarette smoke. COPD is currently the third leading cause of death in the United States and is the only leading cause of death that is increasing in prevalence. It also represents an enormous financial burden to society, costing tens of billions of dollars annually in the U.S. It is widely accepted by the medical community that COPD is a heterogeneous disease, with substantial evidence indicating that genetic variation contributes to varying levels of disease susceptibility. This heterogeneity makes it difficult to predict health decline and develop targeted treatments for better patient care. Although researchers have made several attempts to discover disease subtypes, results have been inconclusive, in part because standard clustering methods have not properly dealt with disease manifestations that may worsen with increased exposure. In this paper we introduce a transformative way of looking at the COPD subtyping task. Specifically, we model the relationship between risk factors (such as age and smoke exposure) and manifestations of disease severity using Gaussian Processes, which allow us to represent so-called "disease trajectories". We also posit that individuals can be associated with multiple disease types (latent clusters), which we assume are influenced by genetics. Furthermore, we predict that only subsets of the numerous disease-related quantitative features are useful for describing each latent subtype. We model these associations using two separate beta process priors, and we describe a variational inference approach to discover the most probable latent cluster assignments. Results are validated with associations to genetic markers.
【Keywords】: Bayesian nonparametrics; beta processes; disease trajectories; gaussian processes; latent clusters
【Paper Link】 【Pages】:163-172
【Authors】: Quan Yuan ; Gao Cong ; Chin-Yew Lin
【Abstract】: With the rapid development of online social networks, a growing number of people are willing to share their group activities, e.g. having dinners with colleagues, and watching movies with spouses. This motivates the studies on group recommendation, which aims to recommend items for a group of users. Group recommendation is a challenging problem because different group members have different preferences, and how to make a trade-off among their preferences for recommendation is still an open problem. In this paper, we propose a probabilistic model named COM (COnsensus Model) to model the generative process of group activities, and make group recommendations. Intuitively, users in a group may have different influences, and those who are expert in topics relevant to the group are usually more influential. In addition, users in a group may behave differently as group members from as individuals. COM is designed based on these intuitions, and is able to incorporate both users' selection history and personal considerations of content factors. When making recommendations, COM estimates the preference of a group to an item by aggregating the preferences of the group members with different weights. We conduct extensive experiments on four datasets, and the results show that the proposed model is effective in making group recommendations, and outperforms baseline methods significantly.
【Keywords】: collaborative filtering; group recommendation; topic models
【Paper Link】 【Pages】:173-182
【Authors】: Laurent Charlin ; Richard S. Zemel ; Hugo Larochelle
【Abstract】: We introduce a novel graphical model, the collaborative score topic model (CSTM), for personal recommendations of textual documents. CSTM's chief novelty lies in its learned model of individual libraries, or sets of documents, associated with each user. Overall, CSTM is a joint directed probabilistic model of user-item scores (ratings), and the textual side information in the user libraries and the items. Creating a generative description of scores and the text allows CSTM to perform well in a wide variety of data regimes, smoothly combining the side information with observed ratings as the number of ratings available for a given user ranges from none to many. Experiments on real-world datasets demonstrate CSTM's performance. We further demonstrate its utility in an application for personal recommendations of posters which we deployed at the NIPS 2013 conference.
【Keywords】: cold start; collaborative filtering; document recommendations; side information; topic modeling
【Paper Link】 【Pages】:183-192
【Authors】: Yupeng Gu ; Yizhou Sun ; Ning Jiang ; Bingyu Wang ; Ting Chen
【Abstract】: Ideal point estimation that estimates legislators' ideological positions and understands their voting behavior has attracted studies from political science and computer science. Typically, a legislator is assigned a global ideal point based on her voting or other social behavior. However, it is quite normal that people may have different positions on different policy dimensions. For example, some people may be more liberal on economic issues while more conservative on cultural issues. In this paper, we propose a novel topic-factorized ideal point estimation model for a legislative voting network in a unified framework. First, we model the ideal points of legislators and bills for each topic instead of assigning them to a global one. Second, the generation of topics are guided by the voting matrix in addition to the text information contained in bills. A unified model that combines voting behavior modeling and topic modeling is presented, and an iterative learning algorithm is proposed to learn the topics of bills as well as the topic-factorized ideal points of legislators and bills. By comparing with the state-of-the-art ideal point estimation models, our method has a much better explanation power in terms of held-out log-likelihood and other measures. Besides, case studies show that the topic-factorized ideal points coincide with human intuition. Finally, we illustrate how to use these topic-factorized ideal points to predict voting results for unseen bills.
【Keywords】: ideal point estimation; legislative voting network; topic model; voting prediction
【Paper Link】 【Pages】:193-202
【Authors】: Qiming Diao ; Minghui Qiu ; Chao-Yuan Wu ; Alexander J. Smola ; Jing Jiang ; Chong Wang
【Abstract】: Recommendation and review sites offer a wealth of information beyond ratings. For instance, on IMDb users leave reviews, commenting on different aspects of a movie (e.g. actors, plot, visual effects), and expressing their sentiments (positive or negative) on these aspects in their reviews. This suggests that uncovering aspects and sentiments will allow us to gain a better understanding of users, movies, and the process involved in generating ratings. The ability to answer questions such as "Does this user care more about the plot or about the special effects?" or "What is the quality of the movie in terms of acting?" helps us to understand why certain ratings are generated. This can be used to provide more meaningful recommendations. In this work we propose a probabilistic model based on collaborative filtering and topic modeling. It allows us to capture the interest distribution of users and the content distribution for movies; it provides a link between interest and relevance on a per-aspect basis and it allows us to differentiate between positive and negative sentiments on a per-aspect basis. Unlike prior work our approach is entirely unsupervised and does not require knowledge of the aspect specific ratings or genres for inference. We evaluate our model on a live copy crawled from IMDb. Our model offers superior performance by joint modeling. Moreover, we are able to address the cold start problem -- by utilizing the information inherent in reviews our model demonstrates improvement for new users and movies.
【Keywords】: collaborative filtering; integrated modeling; sentiment analysis; topic models
【Paper Link】 【Pages】:203-212
【Authors】: Mahbub Hasan ; Abhijith Kashyap ; Vagelis Hristidis ; Vassilis J. Tsotras
【Abstract】: Ambiguous queries, which are typical on search engines and recommendation systems, often return a large number of results from multiple interpretations. Given that many users often perform their searches on limited size screens (e.g. mobile phones), an important problem is which results to display first. Recent work has suggested displaying a set of results (Top-k) based on their relevance score with respect to the query and their diversity with respect to each other. However, previous works balance relevance and diversity mostly by a predefined fixed way. In this paper, we show that for different search tasks there is a different ideal balance of relevance and diversity. We propose a principled method for adaptive diversification of query results that minimizes the user effort to find the desired results, by dynamically balancing the relevance and diversity at each query step (e.g. when refining the query or viewing the next page of results). We introduce a navigation cost model as a means to estimate the effort required to navigate the query-results, and show that the problem of estimating the ideal amount of diversification at each step is NP-Hard. We propose an efficient approximate algorithm to select a near-optimal subset of the query results that minimizes the expected user effort. Finally we demonstrate the efficacy and efficiency of our solution in minimizing user effort, compared to state-of-the-art ranking methods, by means of an extensive experimental evaluation and a comprehensive user study on Amazon Mechanical Turk.
【Keywords】: diversity; mobile applications; user effort minimization
【Paper Link】 【Pages】:213-222
【Authors】: Xiao He ; Jing Feng ; Bettina Konte ; Son T. Mai ; Claudia Plant
【Abstract】: Clustering categorical data poses some unique challenges: Due to missing order and spacing among the categories, selecting a suitable similarity measure is a difficult task. Many existing techniques require the user to specify input parameters which are difficult to estimate. Moreover, many techniques are limited to detect clusters in the full-dimensional data space. Only few methods exist for subspace clustering and they produce highly redundant results. Therefore, we propose ROCAT (Relevant Overlapping Subspace Clusters on Categorical Data), a novel technique based on the idea of data compression. Following the Minimum Description Length principle, ROCAT automatically detects the most relevant subspace clusters without any input parameter. The relevance of each cluster is validated by its contribution to compress the data. Optimizing the trade-off between goodness-of-fit and model complexity, ROCAT automatically determines a meaningful number of clusters to represent the data. ROCAT is especially designed to detect subspace clusters on categorical data which may overlap in objects and/or attributes; i.e. objects can be assigned to different clusters in different subspaces and attributes may contribute to different subspaces containing clusters. ROCAT naturally avoids undesired redundancy in clusters and subspaces by allowing overlap only if it improves the compression rate. Extensive experiments demonstrate the effectiveness and efficiency of our approach.
【Keywords】: categorical data; minimum description length; relevant subspace clustering
【Paper Link】 【Pages】:223-232
【Authors】: Murat Dundar ; Halid Ziya Yerebakan ; Bartek Rajwa
【Abstract】: We present a clustering algorithm for discovering rare yet significant recurring classes across a batch of samples in the presence of random effects. We model each sample data by an infinite mixture of Dirichlet-process Gaussian-mixture models (DPMs) with each DPM representing the noisy realization of its corresponding class distribution in a given sample. We introduce dependencies across multiple samples by placing a global Dirichlet process prior over individual DPMs. This hierarchical prior introduces a sharing mechanism across samples and allows for identifying local realizations of classes across samples. We use collapsed Gibbs sampler for inference to recover local DPMs and identify their class associations. We demonstrate the utility of the proposed algorithm, processing a flow cytometry data set containing two extremely rare cell populations, and report results that significantly outperform competing techniques. The source code of the proposed algorithm is available on the web via the link:http://cs.iupui.edu/~dundar/aspire.htm.
【Keywords】: anomaly detection; batch clustering; hierarchical dirichlet process; random effects; rare classes; recurring classes
【Paper Link】 【Pages】:233-242
【Authors】: Jianhua Yin ; Jianyong Wang
【Abstract】: Short text clustering has become an increasingly important task with the popularity of social media like Twitter, Google+, and Facebook. It is a challenging problem due to its sparse, high-dimensional, and large-volume characteristics. In this paper, we proposed a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model for short text clustering (abbr. to GSDMM). We found that GSDMM can infer the number of clusters automatically with a good balance between the completeness and homogeneity of the clustering results, and is fast to converge. GSDMM can also cope with the sparse and high-dimensional problem of short texts, and can obtain the representative words of each cluster. Our extensive experimental study shows that GSDMM can achieve significantly better performance than three other clustering models.
【Keywords】: dirichlet multinomial mixture; gibbs sampling; short text clustering
【Paper Link】 【Pages】:243-252
【Authors】: Andreas Züfle ; Tobias Emrich ; Klaus Arthur Schmid ; Nikos Mamoulis ; Arthur Zimek ; Matthias Renz
【Abstract】: This paper targets the problem of computing meaningful clusterings from uncertain data sets. Existing methods for clustering uncertain data compute a single clustering without any indication of its quality and reliability; thus, decisions based on their results are questionable. In this paper, we describe a framework, based on possible-worlds semantics; when applied on an uncertain dataset, it computes a set of representative clusterings, each of which has a probabilistic guarantee not to exceed some maximum distance to the ground truth clustering, i.e., the clustering of the actual (but unknown) data. Our framework can be combined with any existing clustering algorithm and it is the first to provide quality guarantees about its result. In addition, our experimental evaluation shows that our representative clusterings have a much smaller deviation from the ground truth clustering than existing approaches, thus reducing the effect of uncertainty.
【Keywords】: clustering; possible worlds; uncertain data
【Paper Link】 【Pages】:253-262
【Authors】: Stephan Günnemann ; Ines Färber ; Matthias Rüdiger ; Thomas Seidl
【Abstract】: Since data is often multi-faceted in its very nature, it might not adequately be summarized by just a single clustering. To better capture the data's complexity, methods aiming at the detection of multiple, alternative clusterings have been proposed. Independent of this research area, semi-supervised clustering techniques have shown to substantially improve clustering results for single-view clustering by integrating prior knowledge. In this paper, we join both research areas and present a solution for integrating prior knowledge in the process of detecting multiple clusterings. We propose a Bayesian framework modeling multiple clusterings of the data by multiple mixture distributions, each responsible for an individual set of relevant dimensions. In addition, our model is able to handle prior knowledge in the form of instance-level constraints indicating which objects should or should not be grouped together. Since a priori the assignment of constraints to specific views is not necessarily known, our technique automatically determines their membership. For efficient learning, we propose the algorithm SMVC using variational Bayesian methods. With experiments on various real-world data, we demonstrate SMVC's potential to detect multiple clustering views and its capability to improve the result by exploiting prior knowledge.
【Keywords】: constraints; semi-supervised learning; subspace clustering
【Paper Link】 【Pages】:263-272
【Authors】: Yashoteja Prabhu ; Manik Varma
【Abstract】: The objective in extreme multi-label classification is to learn a classifier that can automatically tag a data point with the most relevant subset of labels from a large label set. Extreme multi-label classification is an important research problem since not only does it enable the tackling of applications with many labels but it also allows the reformulation of ranking problems with certain advantages over existing formulations. Our objective, in this paper, is to develop an extreme multi-label classifier that is faster to train and more accurate at prediction than the state-of-the-art Multi-label Random Forest (MLRF) algorithm [2] and the Label Partitioning for Sub-linear Ranking (LPSR) algorithm [35]. MLRF and LPSR learn a hierarchy to deal with the large number of labels but optimize task independent measures, such as the Gini index or clustering error, in order to learn the hierarchy. Our proposed FastXML algorithm achieves significantly higher accuracies by directly optimizing an nDCG based ranking loss function. We also develop an alternating minimization algorithm for efficiently optimizing the proposed formulation. Experiments reveal that FastXML can be trained on problems with more than a million labels on a standard desktop in eight hours using a single core and in an hour using multiple cores.
【Keywords】: extreme classification; multi-label learning; ranking
【Paper Link】 【Pages】:273-282
【Authors】: Shaodan Zhai ; Tian Xia ; Shaojun Wang
【Abstract】: We present a direct multi-class boosting (DMCBoost) method for classification with the following properties: (i) instead of reducing the multi-class classification task to a set of binary classification tasks, DMCBoost directly solves the multi-class classification problem, and only requires very weak base classifiers; (ii) DMCBoost builds an ensemble classifier by directly optimizing the non-convex performance measures, including the empirical classification error and margin functions, without resorting to any upper bounds or approximations. As a non-convex optimization method, DMCBoost shows competitive or better results than state-of-the-art convex relaxation boosting methods, and it performs especially well on the noisy cases.
【Keywords】: boosting; classification; direct optimization; noise tolerance; supervised learning
【Paper Link】 【Pages】:283-292
【Authors】: Yashu Liu ; Jie Wang ; Jieping Ye
【Abstract】: Linear regression is a widely used tool in data mining and machine learning. In many applications, fitting a regression model with only linear effects may not be sufficient for predictive or explanatory purposes. One strategy which has recently received increasing attention in statistics is to include feature interactions to capture the nonlinearity in the regression model. Such model has been applied successfully in many biomedical applications. One major challenge in the use of such model is that the data dimensionality is significantly higher than the original data, resulting in the small sample size large dimension problem. Recently, weak hierarchical Lasso, a sparse interaction regression model, is proposed that produces sparse and hierarchical structured estimator by exploiting the Lasso penalty and a set of hierarchical constraints. However, the hierarchical constraints make it a non-convex problem and the existing method finds the solution of its convex relaxation, which needs additional conditions to guarantee the hierarchical structure. In this paper, we propose to directly solve the non-convex weak hierarchical Lasso by making use of the GIST (General Iterative Shrinkage and Thresholding) optimization framework which has been shown to be efficient for solving non-convex sparse formulations. The key step in GIST is to compute a sequence of proximal operators. One of our key technical contributions is to show that the proximal operator associated with the non-convex weak hierarchical Lasso admits a closed form solution. However, a naive approach for solving each subproblem of the proximal operator leads to a quadratic time complexity, which is not desirable for large size problems. To this end, we further develop an efficient algorithm for computing the subproblems with a linearithmic time complexity. We have conducted extensive experiments on both synthetic and real data sets. Results show that our proposed algorithm is much more efficient and effective than its convex relaxation.
【Keywords】: non-convex; proximal operator; sparse learning; weak hierarchical lasso
【Paper Link】 【Pages】:293-302
【Authors】: Doyen Sahoo ; Steven C. H. Hoi ; Bin Li
【Abstract】: Kernel-based regression represents an important family of learning techniques for solving challenging regression tasks with non-linear patterns. Despite being studied extensively, most of the existing work suffers from two major drawbacks: (i) they are often designed for solving regression tasks in a batch learning setting, making them not only computationally inefficient and but also poorly scalable in real-world applications where data arrives sequentially; and (ii) they usually assume a fixed kernel function is given prior to the learning task, which could result in poor performance if the chosen kernel is inappropriate. To overcome these drawbacks, this paper presents a novel scheme of Online Multiple Kernel Regression (OMKR), which sequentially learns the kernel-based regressor in an online and scalable fashion, and dynamically explore a pool of multiple diverse kernels to avoid suffering from a single fixed poor kernel so as to remedy the drawback of manual/heuristic kernel selection. The OMKR problem is more challenging than regular kernel-based regression tasks since we have to on-the-fly determine both the optimal kernel-based regressor for each individual kernel and the best combination of the multiple kernel regressors. In this paper, we propose a family of OMKR algorithms for regression and discuss their application to time series prediction tasks. We also analyze the theoretical bounds of the proposed OMKR method and conduct extensive experiments to evaluate its empirical performance on both real-world regression and times series prediction tasks.
【Keywords】: kernel regression; multiple kernel learning; online learning; time series prediction
【Paper Link】 【Pages】:303-312
【Authors】: Sihong Xie ; Jing Gao ; Wei Fan ; Deepak S. Turaga ; Philip S. Yu
【Abstract】: In data mining applications such as crowdsourcing and privacy-preserving data mining, one may wish to obtain consolidated predictions out of multiple models without access to features of the data. Besides, multiple models usually carry complementary predictive information, model combination can potentially provide more robust and accurate predictions by correcting independent errors from individual models. Various methods have been proposed to combine predictions such that the final predictions are maximally agreed upon by multiple base models. Though this maximum consensus principle has been shown to be successful, simply maximizing consensus can lead to less discriminative predictions and overfit the inevitable noise due to imperfect base models. We argue that proper regularization for model combination approaches is needed to alleviate such overfitting effect. Specifically, we analyze the hypothesis spaces of several model combination methods and identify the trade-off between model consensus and generalization ability. We propose a novel model called Regularized Consensus Maximization (RCM), which is formulated as an optimization problem to combine the maximum consensus and large margin principles. We theoretically show that RCM has a smaller upper bound on generalization error compared to the version without regularization. Experiments show that the proposed algorithm outperforms a wide spectrum of state-of-the-art model combination methods on 11 tasks.
【Keywords】: ensemble; generalization error; large margin
【Paper Link】 【Pages】:313-322
【Authors】: Teng Zhang ; Zhi-Hua Zhou
【Abstract】: Support vector machine (SVM) has been one of the most popular learning algorithms, with the central idea of maximizing the minimum margin, i.e., the smallest distance from the instances to the classification boundary. Recent theoretical results, however, disclosed that maximizing the minimum margin does not necessarily lead to better generalization performances, and instead, the margin distribution has been proven to be more crucial. In this paper, we propose the Large margin Distribution Machine (LDM), which tries to achieve a better generalization performance by optimizing the margin distribution. We characterize the margin distribution by the first- and second-order statistics, i.e., the margin mean and variance. The LDM is a general learning approach which can be used in any place where SVM can be applied, and its superiority is verified both theoretically and empirically in this paper.
【Keywords】: classification; margin distribution; minimum margin
【Paper Link】 【Pages】:323-332
【Authors】: Qi Qian ; Juhua Hu ; Rong Jin ; Jian Pei ; Shenghuo Zhu
【Abstract】: Distance metric learning (DML) aims to learn a distance metric better than Euclidean distance. It has been successfully applied to various tasks, e.g., classification, clustering and information retrieval. Many DML algorithms suffer from the over-fitting problem because of a large number of parameters to be determined in DML. In this paper, we exploit the dropout technique, which has been successfully applied in deep learning to alleviate the over-fitting problem, for DML. Different from the previous studies that only apply dropout to training data, we apply dropout to both the learned metrics and the training data. We illustrate that application of dropout to DML is essentially equivalent to matrix norm based regularization. Compared with the standard regularization scheme in DML, dropout is advantageous in simulating the structured regularizers which have shown consistently better performance than non structured regularizers. We verify, both empirically and theoretically, that dropout is effective in regulating the learned metric to avoid the over-fitting problem. Last, we examine the idea of wrapping the dropout technique in the state-of-art DML methods and observe that the dropout technique can significantly improve the performance of the original DML methods.
【Keywords】: distance metric learning; dropout
【Paper Link】 【Pages】:333-342
【Authors】: Siong Thye Goh ; Cynthia Rudin
【Abstract】: The vast majority of real world classification problems are imbalanced, meaning there are far fewer data from the class of interest (the positive class) than from other classes. We propose two machine learning algorithms to handle highly imbalanced classification problems. The classifiers are disjunctions of conjunctions, and are created as unions of parallel axis rectangles around the positive examples, and thus have the benefit of being interpretable. The first algorithm uses mixed integer programming to optimize a weighted balance between positive and negative class accuracies. Regularization is introduced to improve generalization performance. The second method uses an approximation in order to assist with scalability. Specifically, it follows a \textit{characterize then discriminate} approach, where the positive class is characterized first by boxes, and then each box boundary becomes a separate discriminative classifier. This method has the computational advantages that it can be easily parallelized, and considers only the relevant regions of feature space.
【Keywords】: classification; decision trees; imbalanced data
【Paper Link】 【Pages】:343-352
【Authors】: Cheng-Hao Tsai ; Chieh-Yen Lin ; Chih-Jen Lin
【Abstract】: In classification, if a small number of instances is added or removed, incremental and decremental techniques can be applied to quickly update the model. However, the design of incremental and decremental algorithms involves many considerations. In this paper, we focus on linear classifiers including logistic regression and linear SVM because of their simplicity over kernel or other methods. By applying a warm start strategy, we investigate issues such as using primal or dual formulation, choosing optimization methods, and creating practical implementations. Through theoretical analysis and practical experiments, we conclude that a warm start setting on a high-order optimization method for primal formulations is more suitable than others for incremental and decremental learning of linear classification.
【Keywords】: decremental learning; incremental learning; linear classification; warm start
【Paper Link】 【Pages】:353-361
【Authors】: Junbo Zhang ; Guangjian Tian ; Yadong Mu ; Wei Fan
【Abstract】: Deep learning well demonstrates its potential in learning latent feature representations. Recent years have witnessed an increasing enthusiasm for regularizing deep neural networks by incorporating various side information, such as user-provided labels or pairwise constraints. However, the effectiveness and parameter sensitivity of such algorithms have been major obstacles for putting them into practice. The major contribution of our work is the exposition of a novel supervised deep learning algorithm, which distinguishes from two unique traits. First, it regularizes the network construction by utilizing similarity or dissimilarity constraints between data pairs, rather than sample-specific annotations. Such kind of side information is more flexible and greatly mitigates the workload of annotators. Secondly, unlike prior works, our proposed algorithm decouples the supervision information and intrinsic data structure. We design two heterogeneous networks, each of which encodes either supervision or unsupervised data structure respectively. Specifically, we term the supervision-oriented network as "auxiliary network" since it is principally used for facilitating the parameter learning of the other one and will be removed when handling out-of-sample data. The two networks are complementary to each other and bridged by enforcing the correlation of their parameters. We name the proposed algorithm SUpervision-Guided AutoencodeR (SUGAR). Comparing prior works on unsupervised deep networks and supervised learning, SUGAR better balances numerical tractability and the flexible utilization of supervision information. The classification performance on MNIST digits and eight benchmark datasets demonstrates that SUGAR can effectively improve the performance by using the auxiliary networks, on both shallow and deep architectures. Particularly, when multiple SUGARs are stacked, the performance is significantly boosted. On the selected benchmarks, ours achieve up to 11.35% relative accuracy improvement compared to the state-of-the-art models.
【Keywords】: autoencoder; deep neural networks; supervision
【Paper Link】 【Pages】:362-371
【Authors】: Tahereh Babaie ; Sanjay Chawla ; Romesh G. Abeysuriya
【Abstract】: We introduce a new problem, the Online Selective Anomaly Detection (OSAD), to model a specific scenario emerging from research in sleep science. Scientists have segmented sleep into several stages and stage two is characterized by two patterns (or anomalies) in the EEG time series recorded on sleep subjects. These two patterns are sleep spindle (SS) and K-complex. The OSAD problem was introduced to design a residual system, where all anomalies (known and unknown) are detected but the system only triggers an alarm when non-SS anomalies appear. The solution of the OSAD problem required us to combine techniques from both data mining and control theory. Experiments on data from real subjects attest to the effectiveness of our approach.
【Keywords】: anomaly/novelty detection; dynamic residue model; mining rich data types; sleep EEG anomalies
【Paper Link】 【Pages】:372-381
【Authors】: Qi Rose Yu ; Xinran He ; Yan Liu
【Abstract】: Traditional anomaly detection on social media mostly focuses on individual point anomalies while anomalous phenomena usually occur in groups. Therefore it is valuable to study the collective behavior of individuals and detect group anomalies. Existing group anomaly detection approaches rely on the assumption that the groups are known, which can hardly be true in real world social media applications. In this paper, we take a generative approach by proposing a hierarchical Bayes model: Group Latent Anomaly Detection (GLAD) model. GLAD takes both pair-wise and point-wise data as input, automatically infers the groups and detects group anomalies simultaneously. To account for the dynamic properties of the social media data, we further generalize GLAD to its dynamic extension d-GLAD. We conduct extensive experiments to evaluate our models on both synthetic and real world datasets. The empirical results demonstrate that our approach is effective and robust in discovering latent groups and detecting group anomalies.
【Keywords】: anomaly detection; hierarchical bayes modeling; social media analysis
【Paper Link】 【Pages】:382-391
【Authors】: Dehua Cheng ; Mohammad Taha Bahadori ; Yan Liu
【Abstract】: Discovering temporal dependence structure from multivariate time series has established its importance in many applications. We observe that when we look in reversed order of time, the temporal dependence structure of the time series is usually preserved after switching the roles of cause and effect. Inspired by this observation, we create a new time series by reversing the time stamps of original time series and combine both time series to improve the performance of temporal dependence recovery. We also provide theoretical justification for the proposed algorithm for several existing time series models. We test our approach on both synthetic and real world datasets. The experimental results confirm that this surprisingly simple approach is indeed effective under various circumstances.
【Keywords】: generalized linear model; granger causality; time series analysis
【Paper Link】 【Pages】:392-401
【Authors】: Josif Grabocka ; Nicolas Schilling ; Martin Wistuba ; Lars Schmidt-Thieme
【Abstract】: Shapelets are discriminative sub-sequences of time series that best predict the target variable. For this reason, shapelet discovery has recently attracted considerable interest within the time-series research community. Currently shapelets are found by evaluating the prediction qualities of numerous candidates extracted from the series segments. In contrast to the state-of-the-art, this paper proposes a novel perspective in terms of learning shapelets. A new mathematical formalization of the task via a classification objective function is proposed and a tailored stochastic gradient learning algorithm is applied. The proposed method enables learning near-to-optimal shapelets directly without the need to try out lots of candidates. Furthermore, our method can learn true top-K shapelets by capturing their interaction. Extensive experimentation demonstrates statistically significant improvement in terms of wins and ranks against 13 baselines over 28 time-series datasets.
【Keywords】: shapelets; supervised feature extraction; time-series classification
【Paper Link】 【Pages】:402-411
【Authors】: Mohamed F. Ghalwash ; Vladan Radosavljevic ; Zoran Obradovic
【Abstract】: Early classification of time series is prevalent in many time-sensitive applications such as, but not limited to, early warning of disease outcome and early warning of crisis in stock market. \textcolor{black}{ For example,} early diagnosis allows physicians to design appropriate therapeutic strategies at early stages of diseases. However, practical adaptation of early classification of time series requires an easy to understand explanation (interpretability) and a measure of confidence of the prediction results (uncertainty estimates). These two aspects were not jointly addressed in previous time series early classification studies, such that a difficult choice of selecting one of these aspects is required. In this study, we propose a simple and yet effective method to provide uncertainty estimates for an interpretable early classification method. The question we address here is "how to provide estimates of uncertainty in regard to interpretable early prediction." In our extensive evaluation on twenty time series datasets we showed that the proposed method has several advantages over the state-of-the-art method that provides reliability estimates in early classification. Namely, the proposed method is more effective than the state-of-the-art method, is simple to implement, and provides interpretable results.
【Keywords】: earliness; interpretability; reliability; time series; uncertainty
【Paper Link】 【Pages】:412-421
【Authors】: Junming Shao ; Zahra Ahmadi ; Stefan Kramer
【Abstract】: Data stream mining has gained growing attentions due to its wide emerging applications such as target marketing, email filtering and network intrusion detection. In this paper, we propose a prototype-based classification model for evolving data streams, called SyncStream, which dynamically models time-changing concepts and makes predictions in a local fashion. Instead of learning a single model on a sliding window or ensemble learning, SyncStream captures evolving concepts by dynamically maintaining a set of prototypes in a new data structure called the P-tree. The prototypes are obtained by error-driven representativeness learning and synchronization-inspired constrained clustering. To identify abrupt concept drift in data streams, PCA and statistics based heuristic approaches are employed. SyncStream has several attractive benefits: (a) It is capable of dynamically modeling evolving concepts from even a small set of prototypes and is robust against noisy examples. (b) Owing to synchronization-based constrained clustering and the P-Tree, it supports an efficient and effective data representation and maintenance. (c) Gradual and abrupt concept drift can be effectively detected. Empirical results shows that our method achieves good predictive performance compared to state-of-the-art algorithms and that it requires much less time than another instance-based stream mining algorithm.
【Keywords】: classification; concept drift; data stream; synchronization
【Paper Link】 【Pages】:422-431
【Authors】: Yanwei Yu ; Lei Cao ; Elke A. Rundensteiner ; Qin Wang
【Abstract】: The detection of abnormal moving objects over high-volume trajectory streams is critical for real time applications ranging from military surveillance to transportation management. Yet this problem remains largely unexplored. In this work, we first propose classes of novel trajectory outlier definitions that model the anomalous behavior of moving objects for a large range of real time applications. Our theoretical analysis and empirical study on the Beijing Taxi and GMTI (Ground Moving Target Indicator) datasets demonstrate its effectiveness in capturing abnormal moving objects. Furthermore we propose a general strategy for efficiently detecting the new outlier classes. It features three fundamental optimization principles designed to minimize the detection costs. Our comprehensive experimental studies demonstrate that our proposed strategy drives the detection costs 100-fold down into practical realm for applications producing high volume trajectory streams to utilize.
【Keywords】: moving object; outlier; trajectory stream
【Paper Link】 【Pages】:432-441
【Authors】: Charu C. Aggarwal
【Abstract】: In many applications, classification labels may not be associated with a single instance of records, but may be associated with a data set of records. The class behavior may not be possible to infer effectively from a single record, but may be only be inferred by an aggregate set of records. Therefore, in this problem, the class label is associated with a set of instances both in the training and test data. Therefore, the problem may be understood to be that of classifying a set of data sets. Typically, the classification behavior may only be inferred from the overall patterns of data distribution, and very little information is embedded in any given record for classification purposes. We refer to this problem as the setwise classification problem. The problem can be extremely challenging in scenarios where the data is received in the form of a stream, and the records within any particular data set may not necessarily be received contiguously. In this paper, we present a first approach for real time and streaming classification of such data. We present experimental results illustrating the effectiveness of the approach.
【Keywords】: data classification; data streams
【Paper Link】 【Pages】:442-451
【Authors】: Daniel Ting
【Abstract】: Counting the number of distinct elements in a large dataset is a common task in web applications and databases. This problem is difficult in limited memory settings where storing a large hash table table is intractable. This paper advances the state of the art in probabilistic methods for estimating the number of distinct elements in a streaming setting New streaming algorithms are given that provably beat the "optimal" errors for Min-count and HyperLogLog while using the same sketch. This paper also contributes to the understanding and theory of probabilistic cardinality estimation introducing the concept of an area cutting process and the martingale estimator. These ideas lead to theoretical analyses of both old and new sketches and estimators and show the new estimators are optimal for several streaming settings while also providing accurate error bounds that match those obtained via simulation. Furthermore, the area cutting process provides a geometric intuition behind all methods for counting distinct elements which are not affected by duplicates. This intuition leads to a new sketch, Discrete Max-count, and the analysis of a class of sketches, self-similar area cutting decompositions that have attractive properties and unbiased estimators for both streaming and non-streaming settings. Together, these contributions lead to multi-faceted advances in sketch construction, cardinality and error estimation, the theory, and intuition for the problem of approximate counting of distinct elements for both the streaming and non-streaming cases.
【Keywords】: cardinality estimation; distinct elements; martingale; randomized algorithms
【Paper Link】 【Pages】:452-461
【Authors】: Andrew S. Lan ; Christoph Studer ; Richard G. Baraniuk
【Abstract】: We propose SPARFA-Trace, a new machine learning-based framework for time-varying learning and content analytics for educational applications. We develop a novel message passing-based, blind, approximate Kalman filter for sparse factor analysis (SPARFA) that jointly traces learner concept knowledge over time, analyzes learner concept knowledge state transitions (induced by interacting with learning resources, such as textbook sections, lecture videos, etc., or the forgetting effect), and estimates the content organization and difficulty of the questions in assessments. These quantities are estimated solely from binary-valued (correct/incorrect) graded learner response data and the specific actions each learner performs (e.g., answering a question or studying a learning resource) at each time instant. Experimental results on two online course datasets demonstrate that SPARFA-Trace is capable of tracing each learner's concept knowledge evolution over time, analyzing the quality and content organization of learning resources, and estimating the question--concept associations and the question difficulties. Moreover, we show that SPARFA-Trace achieves comparable or better performance in predicting unobserved learner responses compared to existing collaborative filtering and knowledge tracing methods.
【Keywords】: expectation maximization; kalman filter; learning analytics; personalized learning; sparse factor analysis
【Paper Link】 【Pages】:462-471
【Authors】: Dan Kushnir
【Abstract】: This paper presents an efficient active-transductive approach for classification. A common approach of active learning algorithms is to focus on querying points near the class boundary in order to refine it. However, for certain data distributions, this approach has been shown to lead to uninformative samples. More recent approaches consider combining data exploration with traditional refinement techniques. These techniques typically require tuning sampling of unexplored regions with refinement of detected class boundaries. They also involve significant computational costs for the exploration of informative query candidates. We present a novel iterative active learning algorithm designed to overcome these shortcomings by using a linear running-time active-transductive learning approach that naturally switches from exploration to refinement. The passive classifier employed in our algorithm builds a random-walk on the data graph based on a modified graph geometry that combines the data distribution with current label hypothesis; while the query component uses the uncertainty of the evolving hypothesis. Our supporting theory draws the link between the spectral properties of our iteration matrix and a solution to the minimal-cut problem for a fused hypothesis-data graph. Experiments demonstrate computational complexity that is orders of magnitude lower than state-of-the-art, and competitive results on benchmark data and real churn prediction data.
【Keywords】: active learning; classification; graph; transductive learning; uncertainty sampling
【Paper Link】 【Pages】:472-481
【Authors】: Deepak Vasisht ; Andreas C. Damianou ; Manik Varma ; Ashish Kapoor
【Abstract】: We study the problem of active learning for multilabel classification. We focus on the real-world scenario where the average number of positive (relevant) labels per data point is small leading to positive label sparsity. Carrying out mutual information based near-optimal active learning in this setting is a challenging task since the computational complexity involved is exponential in the total number of labels. We propose a novel inference algorithm for the sparse Bayesian multilabel model of [17]. The benefit of this alternate inference scheme is that it enables a natural approximation of the mutual information objective. We prove that the approximation leads to an identical solution to the exact optimization problem but at a fraction of the optimization cost. This allows us to carry out efficient, non-myopic, and near-optimal active learning for sparse multilabel classification. Extensive experiments reveal the effectiveness of the method.
【Keywords】: active learning; multi-label learning; mutual information
【Paper Link】 【Pages】:482-491
【Authors】: De Wang ; Feiping Nie ; Heng Huang
【Abstract】: Most semi-supervised learning models propagate the labels over the Laplacian graph, where the graph should be built beforehand. However, the computational cost of constructing the Laplacian graph matrix is very high. On the other hand, when we do classification, data points lying around the decision boundary (boundary points) are noisy for learning the correct classifier and deteriorate the classification performance. To address these two challenges, in this paper, we propose an adaptive semi-supervised learning model. Different from previous semi-supervised learning approaches, our new model needn't construct the graph Laplacian matrix. Thus, our method avoids the huge computational cost required by previous methods, and achieves a computational complexity linear to the number of data points. Therefore, our method is scalable to large-scale data. Moreover, the proposed model adaptively suppresses the weights of boundary points, such that our new model is robust to the boundary points. An efficient algorithm is derived to alternatively optimize the model parameter and class probability distribution of the unlabeled data, such that the induction of classifier and the transduction of labels are adaptively unified into one framework. Extensive experimental results on six real-world data sets show that the proposed semi-supervised learning model outperforms other related methods in most cases.
【Keywords】: large-scale semi-supervised learning; semi-supervised learning; unified inductive and transductive model
【Paper Link】 【Pages】:492-501
【Authors】: Akshay Gadde ; Aamir Anis ; Antonio Ortega
【Abstract】: We consider the problem of offline, pool-based active semi-supervised learning on graphs. This problem is important when the labeled data is scarce and expensive whereas unlabeled data is easily available. The data points are represented by the vertices of an undirected graph with the similarity between them captured by the edge weights. Given a target number of nodes to label, the goal is to choose those nodes that are most informative and then predict the unknown labels. We propose a novel framework for this problem based on our recent results on sampling theory for graph signals. A graph signal is a real-valued function defined on each node of the graph. A notion of frequency for such signals can be defined using the spectrum of the graph Laplacian matrix. The sampling theory for graph signals aims to extend the traditional Nyquist-Shannon sampling theory by allowing us to identify the class of graph signals that can be reconstructed from their values on a subset of vertices. This approach allows us to define a criterion for active learning based on sampling set selection which aims at maximizing the frequency of the signals that can be reconstructed from their samples on the set. Experiments show the effectiveness of our method.
【Keywords】: active semi-supervised learning; graph signal filtering; graph signal processing; sampling theory
【Paper Link】 【Pages】:502-511
【Authors】: Jialei Wang ; Nathan Srebro ; James Evans
【Abstract】: We consider the problem of Collaborative Permutation Recovery, i.e. recovering multiple permutations over objects (e.g. preference rankings over different options) from limited pairwise comparisons. We tackle both the problem of how to recover multiple related permutations from limited observations, and the active learning problem of which pairwise comparison queries to ask so as to allow better recovery. There has been much work on recovering single permutations from pairwise comparisons, but we show that considering several related permutations jointly we can leverage their relatedness so as to reduce the number of comparisons needed compared to reconstructing each permutation separately. To do so, we take a collaborative filtering / matrix completion approach and use a trace-norm or max-norm regularized matrix learning model. Our approach can also be seen as a collaborative learning version of Jamieson and Nowak's recent work on constrained permutation recovery, where instead of basing the recovery on known features, we learn the best features de novo.
【Keywords】: active learning; collaborative ranking; matrix factorization
【Paper Link】 【Pages】:512-521
【Authors】: Xuan Vinh Nguyen ; Jeffrey Chan ; Simone Romano ; James Bailey
【Abstract】: Most current mutual information (MI) based feature selection techniques are greedy in nature thus are prone to sub-optimal decisions. Potential performance improvements could be gained by systematically posing MI-based feature selection as a global optimization problem. A rare attempt at providing a global solution for the MI-based feature selection is the recently proposed Quadratic Programming Feature Selection (QPFS) approach. We point out that the QPFS formulation faces several non-trivial issues, in particular, how to properly treat feature `self-redundancy' while ensuring the convexity of the objective function. In this paper, we take a systematic approach to the problem of global MI-based feature selection. We show how the resulting NP-hard global optimization problem could be efficiently approximately solved via spectral relaxation and semi-definite programming techniques. We experimentally demonstrate the efficiency and effectiveness of these novel feature selection frameworks.
【Keywords】: feature selection; global optimization; mutual information; semi-definite programming; spectral relaxation
【Paper Link】 【Pages】:522-531
【Authors】: Zhixiang Eddie Xu ; Gao Huang ; Kilian Q. Weinberger ; Alice X. Zheng
【Abstract】: A feature selection algorithm should ideally satisfy four conditions: reliably extract relevant features; be able to identify non-linear feature interactions; scale linearly with the number of features and dimensions; allow the incorporation of known sparsity structure. In this work we propose a novel feature selection algorithm, Gradient Boosted Feature Selection (GBFS), which satisfies all four of these requirements. The algorithm is flexible, scalable, and surprisingly straight-forward to implement as it is based on a modification of Gradient Boosted Trees. We evaluate GBFS on several real world data sets and show that it matches or outperforms other state of the art feature selection algorithms. Yet it scales to larger data set sizes and naturally allows for domain-specific side information.
【Keywords】: feature selection; gradient boosting; large-scale
【Paper Link】 【Pages】:532-541
【Authors】: Shuo Xiang ; Tao Yang ; Jieping Ye
【Abstract】: Selecting an informative subset of features has important applications in many data mining tasks especially for high-dimensional data. Recently, simultaneous selection of features and feature groups (a.k.a bi-level selection) becomes increasingly popular since it not only reduces the number of features but also unveils the underlying grouping effect in the data, which is a valuable functionality in many applications such as bioinformatics and web data mining. One major challenge of bi-level selection (or even feature selection only) is that computing a globally optimal solution requires a prohibitive computational cost. To overcome such a challenge, current research mainly falls into two categories. The first one focuses on finding suitable continuous computational surrogates for the discrete functions and this leads to various convex and nonconvex optimization models. Although efficient, convex models usually deliver sub-optimal performance while nonconvex models on the other hand require significantly more computational effort. Another direction is to use greedy algorithms to solve the discrete optimization directly. However, existing algorithms are proposed to handle single-level selection only and it remains challenging to extend these methods to handle bi-level selection. In this paper, we fulfill this gap by introducing an efficient sparse group hard thresholding algorithm. Our main contributions are: (1) we propose a novel bi-level selection model and show that the key combinatorial problem admits a globally optimal solution using dynamic programming; (2) we provide an error bound between our solution and the globally optimal under the RIP (Restricted Isometry Property) theoretical framework. Our experiments on synthetic and real data demonstrate that the proposed algorithm produces encouraging performance while keeping comparable computational efficiency to convex relaxation models.
【Keywords】: bi-level learning; com- binatorics; dynamic programming; feature selection; optimization; supervised learning
【Paper Link】 【Pages】:542-551
【Authors】: Zheng Zhao ; Jun Liu ; James Cox
【Abstract】: Sparse support vector machine (SVM) is a robust predictive model that can effectively remove noise and preserve signals. Like Lasso, it can efficiently learn a solution path based on a set of predefined parameters and therefore provides strong support for model selection. Sparse SVM has been successfully applied in a variety of data mining applications including text mining, bioinformatics, and image processing. The emergence of big-data analysis poses new challenges for model selection with large-scale data that consist of tens of millions samples and features. In this paper, a novel screening technique is proposed to accelerate model selection for l1-regularized l2-SVM and effectively improve its scalability. This technique can precisely identify inactive features in the optimal solution of a l1-regularized l2-SVM model and remove them before training. The technique makes use of the variational inequality and provides a closed-form solution for screening inactive features in different situations. Every feature that is removed by the screening technique is guaranteed to be inactive in the optimal solution. Therefore, when l1-regularized l2-SVM uses the features selected by the technique, it achieves exactly the same result as when it uses the full feature set. Because the technique can remove a large number of inactive features, it can greatly increase the efficiency of model selection for l1-regularized l2-SVM. Experimental results on five high-dimensional benchmark data sets demonstrate the power of the proposed technique.
【Keywords】: feature selection; screening; sparse support vector machine
【Paper Link】 【Pages】:552-561
【Authors】: Sanjay Purushotham ; Martin Renqiang Min ; C.-C. Jay Kuo ; Rachel Ostroff
【Abstract】: Identifying interpretable discriminative high-order feature interactions given limited training data in high dimensions is challenging in both machine learning and data mining. In this paper, we propose a factorization based sparse learning framework termed FHIM for identifying high-order feature interactions in linear and logistic regression models, and study several optimization methods for solving them. Unlike previous sparse learning methods, our model FHIM recovers both the main effects and the interaction terms accurately without imposing tree-structured hierarchical constraints. Furthermore, we show that FHIM has oracle properties when extended to generalized linear regression models with pairwise interactions. Experiments on simulated data show that FHIM outperforms the state-of-the-art sparse lear-ning techniques. Further experiments on our experimentally generated data from patient blood samples using a novel SOMAmer (Slow Off-rate Modified Aptamer) technology show that, FHIM performs blood-based cancer diagnosis and bio-marker discovery for Renal Cell Carcinoma much better than other competing methods, and it identifies interpretable block-wise high-order gene interactions predictive of cancer stages of samples. A literature survey shows that the interactions identified by FHIM play important roles in cancer development.
【Keywords】: feature selection; sparse learning
【Paper Link】 【Pages】:562-571
【Authors】: Dehua Cheng ; Yan Liu
【Abstract】: The hierarchical Dirichlet process (HDP) is an intuitive and elegant technique to model data with latent groups. However, it has not been widely used for practical applications due to the high computational costs associated with inference. In this paper, we propose an effective parallel Gibbs sampling algorithm for HDP by exploring its connections with the gamma-gamma-Poisson process. Specifically, we develop a novel framework that combines bootstrap and Reversible Jump MCMC algorithm to enable parallel variable updates. We also provide theoretical convergence analysis based on Gibbs sampling with asynchronous variable updates. Experiment results on both synthetic datasets and two large-scale text collections show that our algorithm can achieve considerable speedup as well as better inference accuracy for HDP compared with existing parallel sampling algorithms.
【Keywords】: hierarchical dirichlet process; parallel inference; topic model
【Paper Link】 【Pages】:572-581
【Authors】: Tamraparni Dasu ; Ji Meng Loh ; Divesh Srivastava
【Abstract】: Data glitches are unusual observations that do not conform to data quality expectations, be they logical, semantic or statistical. By applying data integrity constraints, potentially large sections of data could be flagged as being noncompliant. Ignoring or repairing significant sections of the data could fundamentally bias the results and conclusions drawn from analyses. In the context of Big Data where large numbers and volumes of feeds from disparate sources are integrated, it is likely that significant portions of seemingly noncompliant data are actually legitimate usable data. In this paper, we introduce the notion of Empirical Glitch Explanations - concise, multi-dimensional descriptions of subsets of potentially dirty data - and propose a scalable method for empirically generating such explanatory characterizations. The explanations could serve two valuable functions: (1) Provide a way of identifying legitimate data and releasing it back into the pool of clean data. In doing so, we reduce cleaning-related statistical distortion of the data; (2) Used to refine existing data quality constraints and generate and formalize domain knowledge. We conduct experiments using real and simulated data to demonstrate the scalability of our method and the robustness of explanations. In addition, we use two real world examples to demonstrate the utility of the explanations where we reclaim over 99% of the suspicious data, keeping data repair related statistical distortion close to 0.
【Keywords】: crossover subsampling; data quality; glitch explanations; quantitative data cleaning
【Paper Link】 【Pages】:582-590
【Authors】: Hongxia Yang ; Jingrui He
【Abstract】: Traditional data mining techniques are designed to model a single type of heterogeneity, such as multi-task learning for modeling task heterogeneity, multi-view learning for modeling view heterogeneity, etc. Recently, a variety of real applications emerged, which exhibit dual heterogeneity, namely both task heterogeneity and view heterogeneity. Examples include insider threat detection across multiple organizations, web image classification in different domains, etc. Existing methods for addressing such problems typically assume that multiple tasks are equally related and multiple views are equally consistent, which limits their application in complex settings with varying task relatedness and view consistency. In this paper, we advance state-of-the-art techniques by adaptively modeling task relatedness and view consistency via a nonparametric Bayes model: we model task relatedness using normal penalty with sparse covariances, and view consistency using matrix Dirichlet process. Based on this model, we propose the NOBLE algorithm using an efficient Gibbs sampler. Experimental results on multiple real data sets demonstrate the effectiveness of the proposed algorithm.
【Keywords】: Gibbs sampler; multi-task multi-view; nonparametric Bayes modeling
【Paper Link】 【Pages】:591-600
【Authors】: Chien-Liang Liu ; Tsung-Hsun Tsai ; Chia-Hoang Lee
【Abstract】: Processing large volumes of streaming data in near-real-time is becoming increasingly important as the Internet, sensor networks and network traffic grow. Online machine learning is a typical means of dealing with streaming data, since it allows the classification model to learn one instance of data at a time. Although many online learning methods have been developed since the development of the Perceptron algorithm, existing online methods assume that the number of classes is available in advance of classification process. However, this assumption is unrealistic for large scale or streaming data sets. This work proposes an online Chinese restaurant process (CRP) algorithm, which is an online and nonparametric algorithm, to tackle this problem. This work proposes a relaxing function as part of the prior and updates the parameters with the likelihood function in terms of the consistency between the true label information and predicted result. This work presents two Gibbs sampling algorithms to perform posterior inference. In the experiments, the online CRP is applied to three massive data sets, and compared with several online learning and batch learning algorithms. One of the data sets is obtained from Wikipedia, which comprises approximately two million documents. The experimental results reveal that the proposed online CRP performs well and efficiently on massive data sets. Finally, this work proposes two methods to update the hyperparameter $\alpha$ of the online CRP. The first method is based on the posterior distribution of $\alpha$, and the second exploits the property of online learning, namely adapting to change, to adjust $\alpha$ dynamically.
【Keywords】: adaptive learning; chinese restaurant process; nonparametric; online learning
【Paper Link】 【Pages】:601-610
【Authors】: Xin Dong ; Evgeniy Gabrilovich ; Geremy Heitz ; Wilko Horn ; Ni Lao ; Kevin Murphy ; Thomas Strohmann ; Shaohua Sun ; Wei Zhang
【Abstract】: Recent years have witnessed a proliferation of large-scale knowledge bases, including Wikipedia, Freebase, YAGO, Microsoft's Satori, and Google's Knowledge Graph. To increase the scale even further, we need to explore automatic methods for constructing knowledge bases. Previous approaches have primarily focused on text-based extraction, which can be very noisy. Here we introduce Knowledge Vault, a Web-scale probabilistic knowledge base that combines extractions from Web content (obtained via analysis of text, tabular data, page structure, and human annotations) with prior knowledge derived from existing knowledge repositories. We employ supervised machine learning methods for fusing these distinct information sources. The Knowledge Vault is substantially bigger than any previously published structured knowledge repository, and features a probabilistic inference system that computes calibrated probabilities of fact correctness. We report the results of multiple studies that explore the relative utility of the different information sources and extraction methods.
【Keywords】: information extraction; knowledge bases; machine learning; probabilistic models
【Paper Link】 【Pages】:611-620
【Authors】: Shusen Wang ; Chao Zhang ; Hui Qian ; Zhihua Zhang
【Abstract】: The Nystrom method is an efficient approach to enabling large-scale kernel methods. The Nystrom method generates a fast approximation to any large-scale symmetric positive semidefinete (SPSD) matrix using only a few columns of the SPSD matrix. However, since the Nystrom approximation is low-rank, when the spectrum of the SPSD matrix decays slowly, the Nystrom approximation is of low accuracy. In this paper, we propose a variant of the Nystrom method called the modified Nystrom by spectral shifting (SS-Nystrom). The SS-Nystrom method works well no matter whether the spectrum of SPSD matrix decays fast or slow. We prove that our SS-Nystrom has a much stronger error bound than the standard and modified Nystrom methods, and that SS-Nystrom can be even more accurate than the truncated SVD of the same scale in some cases. We also devise an algorithm such that the SS-Nystrom approximation can be computed nearly as efficient as the modified Nystrom approximation. Finally, our SS-Nystrom method demonstrates significant improvements over the standard and modified Nystrom methods on several real-world datasets.
【Keywords】: kernel approximation; large-scale machine learning; the nyström method
【Paper Link】 【Pages】:621-630
【Authors】: Wenlin Chen ; Yixin Chen ; Kilian Q. Weinberger
【Abstract】: In this paper, we propose a novel supervised learning method, Fast Flux Discriminant (FFD), for large-scale nonlinear classification. Compared with other existing methods, FFD has unmatched advantages, as it attains the efficiency and interpretability of linear models as well as the accuracy of nonlinear models. It is also sparse and naturally handles mixed data types. It works by decomposing the kernel density estimation in the entire feature space into selected low-dimensional subspaces. Since there are many possible subspaces, we propose a submodular optimization framework for subspace selection. The selected subspace predictions are then transformed to new features on which a linear model can be learned. Besides, since the transformed features naturally expect non-negative weights, we only require smooth optimization even with the L1 regularization. Unlike other nonlinear models such as kernel methods, the FFD model is interpretable as it gives importance weights on the original features. Its training and testing are also much faster than traditional kernel models. We carry out extensive empirical studies on real-world datasets and show that the proposed model achieves state-of-the-art classification results with sparsity, interpretability, and exceptional scalability. Our model can be learned in minutes on datasets with millions of samples, for which most existing nonlinear methods will be prohibitively expensive in space and time.
【Keywords】: classification; interpretability; sparsity; submodularity
【Paper Link】 【Pages】:631-640
【Authors】: Mingwang Tang ; Feifei Li
【Abstract】: Histogram construction is a fundamental problem in data management, and a good histogram supports numerous mining operations. Recent work has extended histograms to probabilistic data. However, constructing histograms for probabilistic data can be extremely expensive, and existing studies suffer from limited scalability. This work designs novel approximation methods to construct scalable histograms on probabilistic data. We show that our methods provide constant approximations compared to the optimal histograms produced by the state-of-the-art in the worst case. We also extend our methods to parallel and distributed settings so that they can run gracefully in a cluster of commodity machines. We introduced novel synopses to reduce communication cost when running our methods in such settings. Extensive experiments on large real data sets have demonstrated the superb scalability and efficiency achieved by our methods, when compared to the state-of-the-art methods. They also achieved excellent approximation quality in practice.
【Keywords】: histogram; probabilistic database; scalable method
【Paper Link】 【Pages】:641-650
【Authors】: Flavio Chierichetti ; Nilesh N. Dalvi ; Ravi Kumar
【Abstract】: Correlation clustering is a basic primitive in data miner's toolkit with applications ranging from entity matching to social network analysis. The goal in correlation clustering is, given a graph with signed edges, partition the nodes into clusters to minimize the number of disagreements. In this paper we obtain a new algorithm for correlation clustering. Our algorithm is easily implementable in computational models such as MapReduce and streaming, and runs in a small number of rounds. In addition, we show that our algorithm obtains an almost 3-approximation to the optimal correlation clustering. Experiments on huge graphs demonstrate the scalability of our algorithm and its applicability to data mining problems.
【Keywords】: scalable clustering; signed networks.
【Paper Link】 【Pages】:651-660
【Authors】: Christos Anagnostopoulos ; Peter Triantafillou
【Abstract】: Solving the missing-value (MV) problem with small estimation errors in big data environments is a notoriously resource-demanding task. As datasets and their user community continuously grow, the problem can only be exacerbated. Assume that it is possible to have a single machine (`Godzilla'), which can store the massive dataset and support an ever-growing community submitting MV imputation requests. Is it possible to replace Godzilla by employing a large number of cohort machines so that imputations can be performed much faster, engaging cohorts in parallel, each of which accesses much smaller partitions of the original dataset? If so, it would be preferable for obvious performance reasons to access only a subset of all cohorts per imputation. In this case, can we decide swiftly which is the desired subset of cohorts to engage per imputation? But efficiency and scalability is just one key concern! Is it possible to do the above while ensuring comparable or even better than Godzilla's imputation estimation errors? In this paper we derive answers to these fundamentals questions and develop principled methods and a framework which offer large performance speed-ups and better, or comparable, errors to that of Godzilla, independently of which missing-value imputation algorithm is used. Our contributions involve Pythia, a framework and algorithms for providing the answers to the above questions and for engaging the appropriate subset of cohorts per MV imputation request. Pythia functionality rests on two pillars: (i) dataset (partition) signatures, one per cohort, and (ii) similarity notions and algorithms, which can identify the appropriate subset of cohorts to engage. Comprehensive experimentation with real and synthetic datasets showcase our efficiency, scalability, and accuracy claims.
【Keywords】: big data; clustering; missing value
【Paper Link】 【Pages】:661-670
【Authors】: Mu Li ; Tong Zhang ; Yuqiang Chen ; Alexander J. Smola
【Abstract】: Stochastic gradient descent (SGD) is a popular technique for large-scale optimization problems in machine learning. In order to parallelize SGD, minibatch training needs to be employed to reduce the communication cost. However, an increase in minibatch size typically decreases the rate of convergence. This paper introduces a technique based on approximate optimization of a conservatively regularized objective function within each minibatch. We prove that the convergence rate does not decrease with increasing minibatch size. Experiments demonstrate that with suitable implementations of approximate optimization, the resulting algorithm can outperform standard SGD in many scenarios.
【Keywords】: big data; distributed computing; machine learning; minibatch; stochastic gradient descent
【Paper Link】 【Pages】:671-680
【Authors】: Ashwinkumar Badanidiyuru ; Baharan Mirzasoleiman ; Amin Karbasi ; Andreas Krause
【Abstract】: How can one summarize a massive data set "on the fly", i.e., without even having seen it in its entirety? In this paper, we address the problem of extracting representative elements from a large stream of data. I.e., we would like to select a subset of say k data points from the stream that are most representative according to some objective function. Many natural notions of "representativeness" satisfy submodularity, an intuitive notion of diminishing returns. Thus, such problems can be reduced to maximizing a submodular set function subject to a cardinality constraint. Classical approaches to submodular maximization require full access to the data set. We develop the first efficient streaming algorithm with constant factor 1/2-ε approximation guarantee to the optimum solution, requiring only a single pass through the data, and memory independent of data size. In our experiments, we extensively evaluate the effectiveness of our approach on several applications, including training large-scale kernel methods and exemplar-based clustering, on millions of data points. We observe that our streaming method, while achieving practically the same utility value, runs about 100 times faster than previous work.
【Keywords】: streaming algorithms; submodular functions
【Paper Link】 【Pages】:681-690
【Authors】: Edith Cohen
【Abstract】: Distance queries are a basic tool in data analysis. They are used for detection and localization of change for the purpose of anomaly detection, monitoring, or planning. Distance queries are particularly useful when data sets such as measurements, snapshots of a system, content, traffic matrices, and activity logs are collected repeatedly. Random sampling, which can be efficiently performed over streamed or distributed data, is an important tool for scalable data analysis. The sample constitutes an extremely flexible summary, which naturally supports domain queries and scalable estimation of statistics, which can be specified after the sample is generated. The effectiveness of a sample as a summary, however, hinges on the estimators we have. We derive novel estimators for estimating $L_p$ distance from sampled data. Our estimators apply with the most common weighted sampling schemes: Poisson Probability Proportional to Size (PPS) and its fixed sample size variants. They also apply when the samples of different data sets are independent or coordinated. Our estimators are admissible (Pareto optimal in terms of variance) and have compelling properties. We study the performance of our Manhattan and Euclidean distance (p=1,2) estimators on diverse datasets, demonstrating scalability and accuracy even when a small fraction of the data is sampled. Our work, for the first time, facilitates effective distance estimation over sampled data.
【Keywords】: bottom-k sampling; distance queries; euclidean distance; manhattan distance; poisson sampling
【Paper Link】 【Pages】:691-700
【Authors】: Yi Li ; Zhengyu Wang ; David P. Woodruff
【Abstract】: We study the problem of determining if an input matrix A εRm x n can be well-approximated by a low rank matrix. Specifically, we study the problem of quickly estimating the rank or stable rank of A, the latter often providing a more robust measure of the rank. Since we seek significantly sublinear time algorithms, we cast these problems in the property testing framework. In this framework, A either has low rank or stable rank, or is far from having this property. The algorithm should read only a small number of entries or rows of A and decide which case A is in with high probability. If neither case occurs, the output is allowed to be arbitrary. We consider two notions of being far: (1) A requires changing at least an ε-fraction of its entries, or (2) A requires changing at least an ε-fraction of its rows. We call the former the "entry model" and the latter the "row model". We show: For testing if a matrix has rank at most d in the entry model, we improve the previous number of entries of A that need to be read from O(d2/ε2) (Krauthgamer and Sasson, SODA 2003) to O(d2/ε). Our algorithm is the first to adaptively query the entries of A, which for constant d we show is necessary to achieve O(1/ε) queries. For the important case of d = 1 we also give a new non-adaptive algorithm, improving the previous O(1/ε2) queries to O(log2(1/ε) / ε). For testing if a matrix has rank at most d in the row model, we prove an Ω(d/ε) lower bound on the number of rows that need to be read, even for adaptive algorithms. Our lower bound matches a non-adaptive upper bound of Krauthgamer and Sasson. For testing if a matrix has stable rank at most d in the row model or requires changing an ε/d-fraction of its rows in order to have stable rank at most d, we prove that reading θ(d/ε2) rows is necessary and sufficient. We also give an empirical evaluation of our rank and stable rank algorithms on real and synthetic datasets.
【Keywords】: dimensionality reduction; principal component analysis; property testing; robustness; stable rank
【Paper Link】 【Pages】:701-710
【Authors】: Bryan Perozzi ; Rami Al-Rfou' ; Steven Skiena
【Abstract】: We present DeepWalk, a novel approach for learning latent representations of vertices in a network. These latent representations encode social relations in a continuous vector space, which is easily exploited by statistical models. DeepWalk generalizes recent advancements in language modeling and unsupervised feature learning (or deep learning) from sequences of words to graphs. DeepWalk uses local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences. We demonstrate DeepWalk's latent representations on several multi-label network classification tasks for social networks such as BlogCatalog, Flickr, and YouTube. Our results show that DeepWalk outperforms challenging baselines which are allowed a global view of the network, especially in the presence of missing information. DeepWalk's representations can provide F1 scores up to 10% higher than competing methods when labeled data is sparse. In some experiments, DeepWalk's representations are able to outperform all baseline methods while using 60% less training data. DeepWalk is also scalable. It is an online learning algorithm which builds useful incremental results, and is trivially parallelizable. These qualities make it suitable for a broad class of real world applications such as network classification, and anomaly detection.
【Keywords】: deep learning; latent representations; learning with partial labels; network classification; online learning; social networks
【Paper Link】 【Pages】:711-720
【Authors】: Sunita Sarawagi ; Soumen Chakrabarti
【Abstract】: Over 40% of columns in hundreds of millions of Web tables contain numeric quantities. Tables are a richer source of structured knowledge than free text. We harness Web tables to answer queries whose target is a quantity with natural variation, such as net worth of zuckerburg, battery life of ipad, half life of plutonium, and calories in pizza. Our goal is to respond to such queries with a ranked list of quantity distributions, suitably represented. Apart from the challenges of informal schema and noisy extractions, which have been known since tables were used for non-quantity information extraction, we face additional problems of noisy number formats, as well as unit specifications that are often contextual and ambiguous. Early "hardening" of extraction decisions at a table level leads to poor accuracy. Instead, we use a probabilistic context free grammar (PCFG) based unit extractor on the tables, and retain several top-scoring extractions of quantity and numerals. Then we inject these into a new collective inference framework that makes global decisions about the relevance of candidate table snippets, the interpretation of the query's target quantity type, the value distributions to be ranked and presented, and the degree of consensus that can be built to support the proposed quantity distributions. Experiments with over 25 million Web tables and 350 diverse queries show robust, large benefits from our quantity catalog, unit extractor, and collective inference.
【Keywords】: collective inference; number and unit extraction; web tables
【Paper Link】 【Pages】:721-730
【Authors】: Bin Wu ; ErHeng Zhong ; Ben Tan ; Andrew Horner ; Qiang Yang
【Abstract】: Time-sync video tagging aims to automatically generate tags for each video shot. It can improve the user's experience in previewing a video's timeline structure compared to traditional schemes that tag an entire video clip. In this paper, we propose a new application which extracts time-sync video tags by automatically exploiting crowdsourced comments from video websites such as Nico Nico Douga, where videos are commented on by online crowd users in a time-sync manner. The challenge of the proposed application is that users with bias interact with one another frequently and bring noise into the data, while the comments are too sparse to compensate for the noise. Previous techniques are unable to handle this task well as they consider video semantics independently, which may overfit the sparse comments in each shot and thus fail to provide accurate modeling. To resolve these issues, we propose a novel temporal and personalized topic model that jointly considers temporal dependencies between video semantics, users' interaction in commenting, and users' preferences as prior knowledge. Our proposed model shares knowledge across video shots via users to enrich the short comments, and peels off user interaction and user bias to solve the noisy-comment problem. Log-likelihood analyses and user studies on large datasets show that the proposed model outperforms several state-of-the-art baselines in video tagging quality. Case studies also demonstrate our model's capability of extracting tags from the crowdsourced short and noisy comments.
【Keywords】: crowdsourcing; temporal and personalized model; topic modeling; video tagging
【Paper Link】 【Pages】:731-740
【Authors】: Liangda Li ; Hongbo Deng ; Anlei Dong ; Yi Chang ; Hongyuan Zha
【Abstract】: We consider a search task as a set of queries that serve the same user information need. Analyzing search tasks from user query streams plays an important role in building a set of modern tools to improve search engine performance. In this paper, we propose a probabilistic method for identifying and labeling search tasks based on the following intuitive observations: queries that are issued temporally close by users in many sequences of queries are likely to belong to the same search task, meanwhile, different users having the same information needs tend to submit topically coherent search queries. To capture the above intuitions, we directly model query temporal patterns using a special class of point processes called Hawkes processes, and combine topic models with Hawkes processes for simultaneously identifying and labeling search tasks. Essentially, Hawkes processes utilize their self-exciting properties to identify search tasks if influence exists among a sequence of queries for individual users, while the topic model exploits query co-occurrence across different users to discover the latent information needed for labeling search tasks. More importantly, there is mutual reinforcement between Hawkes processes and the topic model in the unified model that enhances the performance of both. We evaluate our method based on both synthetic data and real-world query log data. In addition, we also apply our model to query clustering and search task identification. By comparing with state-of-the-art methods, the results demonstrate that the improvement in our proposed approach is consistent and promising.
【Keywords】: hawkes process; latent dirichlet allocation; search task; variational inference
【Paper Link】 【Pages】:741-750
【Authors】: Oleksandr Polozov ; Sumit Gulwani
【Abstract】: We show how to programmatically model processes that humans use when extracting answers to queries (e.g., "Who invented typewriter?", "List of Washington national parks") from semi-structured Web pages returned by a search engine. This modeling enables various applications including automating repetitive search tasks, and helping search engine developers design micro-segments of factoid questions. We describe the design and implementation of a domain-specific language that enables extracting data from a webpage based on its structure, visual layout, and linguistic patterns. We also describe an algorithm to rank multiple answers extracted from multiple webpages. On 100,000+ queries (across 7 micro-segments) obtained from Bing logs, our system LaSEWeb answered queries with an average recall of 71%. Also, the desired answer(s) were present in top-3 suggestions for 95%+ cases.
【Keywords】: domain-specific languages; question answering.; semi-structured data; structure extraction; web programming
【Paper Link】 【Pages】:751-760
【Authors】: Shangsong Liang ; Zhaochun Ren ; Maarten de Rijke
【Abstract】: This paper is concerned with the problem of personalized diversification of search results, with the goal of enhancing the performance of both plain diversification and plain personalization algorithms. In previous work, the problem has mainly been tackled by means of unsupervised learning. To further enhance the performance, we propose a supervised learning strategy. Specifically, we set up a structured learning framework for conducting supervised personalized diversification, in which we add features extracted directly from the tokens of documents and those utilized by unsupervised personalized diversification algorithms, and, importantly, those generated from our proposed user-interest latent Dirichlet topic model. Based on our proposed topic model whether a document can cater to a user's interest can be estimated in our learning strategy. We also define two constraints in our structured learning framework to ensure that search results are both diversified and consistent with a user's interest. We conduct experiments on an open personalized diversification dataset and find that our supervised learning strategy outperforms unsupervised personalized diversification methods as well as other plain personalization and plain diversification methods.
【Keywords】: ad hoc retrieval; diversity; personalization; structured svms
【Paper Link】 【Pages】:761-770
【Authors】: Pinghua Gong ; Jiayu Zhou ; Wei Fan ; Jieping Ye
【Abstract】: Multi-task feature learning has been proposed to improve the generalization performance by learning the shared features among multiple related tasks and it has been successfully applied to many real-world problems in machine learning, data mining, computer vision and bioinformatics. Most existing multi-task feature learning models simply assume a common noise level for all tasks, which may not be the case in real applications. Recently, a Calibrated Multivariate Regression (CMR) model has been proposed, which calibrates different tasks with respect to their noise levels and achieves superior prediction performance over the non-calibrated one. A major challenge is how to solve the CMR model efficiently as it is formulated as a composite optimization problem consisting of two non-smooth terms. In this paper, we propose a variant of the calibrated multi-task feature learning formulation by including a squared norm regularizer. We show that the dual problem of the proposed formulation is a smooth optimization problem with a piecewise sphere constraint. The simplicity of the dual problem enables us to develop fast dual optimization algorithms with low per-iteration cost. We also provide a detailed convergence analysis for the proposed dual optimization algorithm. Empirical studies demonstrate that, the dual optimization algorithm quickly converges and it is much more efficient than the primal optimization algorithm. Moreover, the calibrated multi-task feature learning algorithms with and without the squared norm regularizer achieve similar prediction performance and both outperform the non-calibrated ones. Thus, the proposed variant not only enables us to develop fast optimization algorithms, but also keeps the superior prediction performance of the calibrated multi-task feature learning over the non-calibrated one.
【Keywords】: accelerated gradient descent; calibration; dual problem; feature selection; multi-task learning
【Paper Link】 【Pages】:771-780
【Authors】: Tianyi Zhou ; Dacheng Tao
【Abstract】: This paper proposes multi-task copula (MTC) that can handle a much wider class of tasks than mean regression with Gaussian noise in most former multi-task learning (MTL). While former MTL emphasizes shared structure among models, MTC aims at joint prediction to exploit inter-output correlation. Given input, the outputs of MTC are allowed to follow arbitrary joint continuous distribution. MTC captures the joint likelihood of multi-output by learning the marginal of each output firstly and then a sparse and smooth output dependency graph function. While the former can be achieved by classical MTL, learning graphs dynamically varying with input is quite a challenge. We address this issue by developing sparse graph regression (SpaGraphR), a non-parametric estimator incorporating kernel smoothing, maximum likelihood, and sparse graph structure to gain fast learning algorithm. It starts from a few seed graphs on a few input points, and then updates the graphs on other input points by a fast operator via coarse-to-fine propagation. Due to the power of copula in modeling semi-parametric distributions, SpaGraphR can model a rich class of dynamic non-Gaussian correlations. We show that MTC can address more flexible and difficult tasks that do not fit the assumptions of former MTL nicely, and can fully exploit their relatedness. Experiments on robotic control and stock price prediction justify its appealing performance in challenging MTL problems.
【Keywords】: copula; covariance regression; multi-task learning; semi-parametric model; sparse structured learning
【Paper Link】 【Pages】:781-790
【Authors】: Mianwei Zhou ; Kevin Chen-Chuan Chang
【Abstract】: For document scoring, although learning to rank and domain adaptation are treated as two different problems in previous works, we discover that they actually share the same challenge of adapting keyword contribution across different queries or domains. In this paper, we propose to study the cross-task document scoring problem, where a task refers to a query to rank or a domain to adapt to, as the first attempt to unify these two problems. Existing solutions for learning to rank and domain adaptation either leave the heavy burden of adapting keyword contribution to feature designers, or are difficult to be generalized. To resolve such limitations, we abstract the keyword scoring principle, pointing out that the contribution of a keyword essentially depends on, first, its importance to a task and, second, its importance to the document. For determining these two aspects of keyword importance, we further propose the concept of feature decoupling, suggesting using two types of easy-to-design features: meta-features and intra-features. Towards learning a scorer based on the decoupled features, we require that our framework fulfill inferred sparsity to eliminate the interference of noisy keywords, and employ distant supervision to tackle the lack of keyword labels. We propose the Tree-structured Boltzmann Machine (T-RBM), a novel two-stage Markov Network, as our solution. Experiments on three different applications confirm the effectiveness of T-RBM, which achieves significant improvement compared with four state-of-the-art baseline methods.
【Keywords】: domain adaptation; feature decoupling; learning to rank; tree-structured restricted boltzmann machine
【Paper Link】 【Pages】:791-800
【Authors】: Ying Wei ; Yangqiu Song ; Yi Zhen ; Bo Liu ; Qiang Yang
【Abstract】: Hashing has enjoyed a great success in large-scale similarity search. Recently, researchers have studied the multi-modal hashing to meet the need of similarity search across different types of media. However, most of the existing methods are applied to search across multi-views among which explicit bridge information is provided. Given a heterogeneous media search task, we observe that abundant multi-view data can be found on the Web which can serve as an auxiliary bridge. In this paper, we propose a Heterogeneous Translated Hashing (HTH) method with such auxiliary bridge incorporated not only to improve current multi-view search but also to enable similarity search across heterogeneous media which have no direct correspondence. HTH simultaneously learns hash functions embedding heterogeneous media into different Hamming spaces, and translators aligning these spaces. Unlike almost all existing methods that map heterogeneous data in a common Hamming space, mapping to different spaces provides more flexible and discriminative ability. We empirically verify the effectiveness and efficiency of our algorithm on two real world large datasets, one publicly available dataset of Flickr and the other MIRFLICKR-Yahoo Answers dataset.
【Keywords】: hash function learning; heterogeneous translated hashing; scalability
【Paper Link】 【Pages】:801-810
【Authors】: Chung-Yi Li ; Shou-De Lin
【Abstract】: Given two homogeneous rating matrices with some overlapped users/items whose mappings are unknown, this paper aims at answering two questions. First, can we identify the unknown mapping between the users and/or items? Second, can we further utilize the identified mappings to improve the quality of recommendation in either domain? Our solution integrates a latent space matching procedure and a refining process based on the optimization of prediction to identify the matching. Then, we further design a transfer-based method to improve the recommendation performance. Using both synthetic and real data, we have done extensive experiments given different real life scenarios to verify the effectiveness of our models. The code and other materials are available at http://www.csie.ntu.edu.tw/~r00922051/matching/
【Keywords】: collaborative filtering; matrix factorization; transfer learning
【Paper Link】 【Pages】:811-820
【Authors】: Wei Lu ; Stratis Ioannidis ; Smriti Bhagat ; Laks V. S. Lakshmanan
【Abstract】: People's interests are dynamically evolving, often affected by external factors such as trends promoted by the media or adopted by their friends. In this work, we model interest evolution through dynamic interest cascades: we consider a scenario where a user's interests may be affected by (a) the interests of other users in her social circle, as well as (b) suggestions she receives from a recommender system. In the latter case, we model user reactions through either attraction or aversion towards past suggestions. We study this interest evolution process, and the utility accrued by recommendations, as a function of the system's recommendation strategy. We show that, in steady state, the optimal strategy can be computed as the solution of a semi-definite program (SDP). Using datasets of user ratings, we provide evidence for the existence of aversion and attraction in real-life data, and show that our optimal strategy can lead to significantly improved recommendations over systems that ignore aversion and attraction.
【Keywords】: attraction; aversion; interest evolution; recommender systems
【Paper Link】 【Pages】:821-830
【Authors】: Xiang Ren ; Jialu Liu ; Xiao Yu ; Urvashi Khandelwal ; Quanquan Gu ; Lidan Wang ; Jiawei Han
【Abstract】: Citation recommendation is an interesting but challenging research problem. Most existing studies assume that all papers adopt the same criterion and follow the same behavioral pattern in deciding relevance and authority of a paper. However, in reality, papers have distinct citation behavioral patterns when looking for different references, depending on paper content, authors and target venues. In this study, we investigate the problem in the context of heterogeneous bibliographic networks and propose a novel cluster-based citation recommendation framework, called ClusCite, which explores the principle that citations tend to be softly clustered into interest groups based on multiple types of relationships in the network. Therefore, we predict each query's citations based on related interest groups, each having its own model for paper authority and relevance. Specifically, we learn group memberships for objects and the significance of relevance features for each interest group, while also propagating relative authority between objects, by solving a joint optimization problem. Experiments on both DBLP and PubMed datasets demonstrate the power of the proposed approach, with 17.68% improvement in Recall@50 and 9.57% growth in MRR over the best performing baseline.
【Keywords】: citation behavioral pattern; citation recommendation; clustering; heterogeneous information network
【Paper Link】 【Pages】:831-840
【Authors】: Defu Lian ; Cong Zhao ; Xing Xie ; Guangzhong Sun ; Enhong Chen ; Yong Rui
【Abstract】: Point-of-Interest (POI) recommendation has become an important means to help people discover attractive locations. However, extreme sparsity of user-POI matrices creates a severe challenge. To cope with this challenge, viewing mobility records on location-based social networks (LBSNs) as implicit feedback for POI recommendation, we first propose to exploit weighted matrix factorization for this task since it usually serves collaborative filtering with implicit feedback better. Besides, researchers have recently discovered a spatial clustering phenomenon in human mobility behavior on the LBSNs, i.e., individual visiting locations tend to cluster together, and also demonstrated its effectiveness in POI recommendation, thus we incorporate it into the factorization model. Particularly, we augment users' and POIs' latent factors in the factorization model with activity area vectors of users and influence area vectors of POIs, respectively. Based on such an augmented model, we not only capture the spatial clustering phenomenon in terms of two-dimensional kernel density estimation, but we also explain why the introduction of such a phenomenon into matrix factorization helps to deal with the challenge from matrix sparsity. We then evaluate the proposed algorithm on a large-scale LBSN dataset. The results indicate that weighted matrix factorization is superior to other forms of factorization models and that incorporating the spatial clustering phenomenon into matrix factorization improves recommendation performance.
【Keywords】: kernel density estimation; location recommendation; location-based social network; weighted matrix factorization
【Paper Link】 【Pages】:841-850
【Authors】: Stephan Günnemann ; Nikou Günnemann ; Christos Faloutsos
【Abstract】: Rating data is ubiquitous on websites such as Amazon, TripAdvisor, or Yelp. Since ratings are not static but given at various points in time, a temporal analysis of rating data provides deeper insights into the evolution of a product's quality. In this work, we tackle the following question: Given the time stamped rating data for a product or service, how can we detect the general rating behavior of users as well as time intervals where the ratings behave anomalous? We propose a Bayesian model that represents the rating data as sequence of categorical mixture models. In contrast to existing methods, our method does not require any aggregation of the input but it operates on the original time stamped data. To capture the dynamic effects of the ratings, the categorical mixtures are temporally constrained: Anomalies can occur in specific time intervals only and the general rating behavior should evolve smoothly over time. Our method automatically determines the intervals where anomalies occur, and it captures the temporal effects of the general behavior by using a state space model on the natural parameters of the categorical distributions. For learning our model, we propose an efficient algorithm combining principles from variational inference and dynamic programming. In our experimental study we show the effectiveness of our method and we present interesting discoveries on multiple real world datasets.
【Keywords】: anomaly detection; categorical mixtures; robust mining
【Paper Link】 【Pages】:851-860
【Authors】: Silei Xu ; John Chi-Shing Lui
【Abstract】: It is often crucial for manufacturers to decide what products to produce so that they can increase their market share in an increasingly fierce market. To decide which products to produce, manufacturers need to analyze the consumers' requirements and how consumers make their purchase decisions so that the new products will be competitive in the market. In this paper, we first present a general distance-based product adoption model to capture consumers' purchase behavior. Using this model, various distance metrics can be used to describe different real life purchase behavior. We then provide a learning algorithm to decide which set of distance metrics one should use when we are given some historical purchase data. Based on the product adoption model, we formalize the k most marketable products (or k-MMP) selection problem and formally prove that the problem is NP-hard. To tackle this problem, we propose an efficient greedy-based approximation algorithm with a provable solution guarantee. Using submodularity analysis, we prove that our approximation algorithm can achieve at least 63% of the optimal solution. We apply our algorithm on both synthetic datasets and real-world datasets (TripAdvisor.com), and show that our algorithm can easily achieve five or more orders of speedup over the exhaustive search and achieve about 96% of the optimal solution on average. Our experiments also show the significant impact of different distance metrics on the results, and how proper distance metrics can improve the accuracy of product selection.
【Keywords】: approximation algorithm; consumer behavior; model learning; product selection; submodular set function
【Paper Link】 【Pages】:861-870
【Authors】: Yongxin Tong ; Caleb Chen Cao ; Lei Chen
【Abstract】: In recent years, with the widespread usage of Web 2.0 techniques, crowdsourcing plays an important role in offering human intelligence in various service websites, such as Yahoo! Answer and Quora. With the increasing amount of crowd-oriented service data, an important task is to analyze latest hot topics and track topic evolution over time. However, the existing techniques in text mining cannot effectively work due to the unique structure of crowd-oriented service data, task-response pairs, which consists of the task and its corresponding responses. In particular, existing approaches become ineffective with the ever-increasing crowd-oriented service data that accumulate along the time. In this paper, we first study the problem of discovering topics over crowd-oriented service data. Then we propose a new probabilistic topic model, the Topic Crowd Service Model (TCS model), to effectively discover latent topics from massive crowd-oriented service data. In particular, in order to train TCS efficiently, we design a novel parameter inference algorithm, the Bucket Parameter Estimation (BPE), which utilizes belief propagation and a new sketching technique, called Pairwise Sketch (pSketch). Finally, we conduct extensive experiments to verify the effectiveness and efficiency of the TCS model and the BPE algorithm.
【Keywords】: crowd-oriented service; crowdsourcing; probabilistic topic model
【Paper Link】 【Pages】:871-880
【Authors】: Erich Schubert ; Michael Weiler ; Hans-Peter Kriegel
【Abstract】: Social media such as Twitter or weblogs are a popular source for live textual data. Much of this popularity is due to the fast rate at which this data arrives, and there are a number of global events - such as the Arab Spring - where Twitter is reported to have had a major influence. However, existing methods for emerging topic detection are often only able to detect events of a global magnitude such as natural disasters or celebrity deaths, and can monitor user-selected keywords or operate on a curated set of hashtags only. Interesting emerging topics may, however, be of much smaller magnitude and may involve the combination of two or more words that themselves are not unusually hot at that time. Our contributions to the detection of emerging trends are three-fold first of all, we propose a significance measure that can be used to detect emerging topics early, long before they become "hot tags", by drawing upon experience from outlier detection. Secondly, by using hash tables in a heavy-hitters type algorithm for establishing a noise baseline, we show how to track even all keyword pairs using only a fixed amount of memory. Finally, we aggregate the detected co-trends into larger topics using clustering approaches, as often as a single event will cause multiple word combinations to trend at the same time.
【Keywords】: hashing; outlier detection; text-mining; trend detection
【Paper Link】 【Pages】:881-890
【Authors】: Wray L. Buntine ; Swapnil Mishra
【Abstract】: In topic modelling, various alternative priors have been developed, for instance asymmetric and symmetric priors for the document-topic and topic-word matrices respectively, the hierarchical Dirichlet process prior for the document-topic matrix and the hierarchical Pitman-Yor process prior for the topic-word matrix. For information retrieval, language models exhibiting word burstiness are important. Indeed, this burstiness effect has been show to help topic models as well, and this requires additional word probability vectors for each document. Here we show how to combine these ideas to develop high-performing non-parametric topic models exhibiting burstiness based on standard Gibbs sampling. Experiments are done to explore the behavior of the models under different conditions and to compare the algorithms with previously published. The full non-parametric topic models with burstiness are only a small factor slower than standard Gibbs sampling for LDA and require double the memory, making them very competitive. We look at the comparative behaviour of different models and present some experimental insights.
【Keywords】: experimental results; non-parametric prior; text; topic modelling
【Paper Link】 【Pages】:891-900
【Authors】: Aaron Q. Li ; Amr Ahmed ; Sujith Ravi ; Alexander J. Smola
【Abstract】: Inference in topic models typically involves a sampling step to associate latent variables with observations. Unfortunately the generative model loses sparsity as the amount of data increases, requiring O(k) operations per word for k topics. In this paper we propose an algorithm which scales linearly with the number of actually instantiated topics kd in the document. For large document collections and in structured hierarchical models kd ll k. This yields an order of magnitude speedup. Our method applies to a wide variety of statistical models such as PDP [16,4] and HDP [19]. At its core is the idea that dense, slowly changing distributions can be approximated efficiently by the combination of a Metropolis-Hastings step, use of sparsity, and amortized constant time sampling via Walker's alias method.
【Keywords】: alias method; sampling; scalability; topic models
【Paper Link】 【Pages】:901-910
【Authors】: Mikalai Tsytsarau ; Themis Palpanas ; Malú Castellanos
【Abstract】: The analysis of social sentiment expressed on the Web is becoming increasingly relevant to a variety of applications, and it is important to understand the underlying mechanisms which drive the evolution of sentiments in one way or another, in order to be able to predict these changes in the future. In this paper, we study the dynamics of news events and their relation to changes of sentiment expressed on relevant topics. We propose a novel framework, which models the behavior of news and social media in response to events as a convolution between event's importance and media response function, specific to media and event type. This framework is suitable for detecting time and duration of events, as well as their impact and dynamics, from time series of publication volume. These data can greatly enhance events analysis; for instance, they can help distinguish important events from unimportant, or predict sentiment and stock market shifts. As an example of such application, we extracted news events for a variety of topics and then correlated this data with the corresponding sentiment time series, revealing the connection between sentiment shifts and event dynamics.
【Keywords】: information spread; news dynamics; sentiment analysis; social media
【Paper Link】 【Pages】:911-920
【Authors】: Qian Xiao ; Rui Chen ; Kian-Lee Tan
【Abstract】: Information networks, such as social media and email networks, often contain sensitive information. Releasing such network data could seriously jeopardize individual privacy. Therefore, we need to sanitize network data before the release. In this paper, we present a novel data sanitization solution that infers a network's structure in a differentially private manner. We observe that, by estimating the connection probabilities between vertices instead of considering the observed edges directly, the noise scale enforced by differential privacy can be greatly reduced. Our proposed method infers the network structure by using a statistical hierarchical random graph (HRG) model. The guarantee of differential privacy is achieved by sampling possible HRG structures in the model space via Markov chain Monte Carlo (MCMC). We theoretically prove that the sensitivity of such inference is only O(log n), where n is the number of vertices in a network. This bound implies less noise to be injected than those of existing works. We experimentally evaluate our approach on four real-life network datasets and show that our solution effectively preserves essential network structural properties like degree distribution, shortest path length distribution and influential nodes.
【Keywords】: differential privacy; network data; structural inference
【Paper Link】 【Pages】:921-930
【Authors】: Wentian Lu ; Gerome Miklau
【Abstract】: The effective analysis of social networks and graph-structured data is often limited by the privacy concerns of individuals whose data make up these networks. Differential privacy offers individuals a rigorous and appealing guarantee of privacy. But while differentially private algorithms for computing basic graph properties have been proposed, most graph modeling tasks common in the data mining community cannot yet be carried out privately. In this work we propose algorithms for privately estimating the parameters of exponential random graph models (ERGMs). We break the estimation problem into two steps: computing private sufficient statistics, then using them to estimate the model parameters. We consider specific alternating statistics that are in common use for ERGM models and describe a method for estimating them privately by adding noise proportional to a high-confidence bound on their local sensitivity. In addition, we propose an estimation algorithm that considers the noise distribution of the private statistics and offers better accuracy than performing standard parameter estimation using the private statistics.
【Keywords】: differential privacy; exponential random graph model
【Paper Link】 【Pages】:931-940
【Authors】: Jaewoo Lee ; Christopher W. Clifton
【Abstract】: Frequent itemset mining is a core data mining task and has been studied extensively. Although by their nature, frequent itemsets are aggregates over many individuals and would not seem to pose a privacy threat, an attacker with strong background information can learn private individual information from frequent itemsets. This has lead to differentially private frequent itemset mining, which protects privacy by giving inexact answers. We give an approach that first identifies top-k frequent itemsets, then uses them to construct a compact, differentially private FP-tree. Once the noisy FP-tree is built, the (privatized) support of all frequent itemsets can be derived from it without access to the original data. Experimental results show that the proposed algorithm gives substantially better results than prior approaches, especially for high levels of privacy.
【Keywords】: differential privacy; fp-tree; frequent itemset
【Paper Link】 【Pages】:941-950
【Authors】: Meng Jiang ; Peng Cui ; Alex Beutel ; Christos Faloutsos ; Shiqiang Yang
【Abstract】: Given a directed graph of millions of nodes, how can we automatically spot anomalous, suspicious nodes, judging only from their connectivity patterns? Suspicious graph patterns show up in many applications, from Twitter users who buy fake followers, manipulating the social network, to botnet members performing distributed denial of service attacks, disturbing the network traffic graph. We propose a fast and effective method, CatchSync, which exploits two of the tell-tale signs left in graphs by fraudsters: (a) synchronized behavior: suspicious nodes have extremely similar behavior pattern, because they are often required to perform some task together (such as follow the same user); and (b) rare behavior: their connectivity patterns are very different from the majority. We introduce novel measures to quantify both concepts ("synchronicity" and "normality") and we propose a parameter-free algorithm that works on the resulting synchronicity-normality plots. Thanks to careful design, CatchSync has the following desirable properties: (a) it is scalable to large datasets, being linear on the graph size; (b) it is parameter free; and (c) it is side-information-oblivious: it can operate using only the topology, without needing labeled data, nor timing information, etc., while still capable of using side information, if available. We applied CatchSync on two large, real datasets 1-billion-edge Twitter social graph and 3-billion-edge Tencent Weibo social graph, and several synthetic ones; CatchSync consistently outperforms existing competitors, both in detection accuracy by 36% on Twitter and 20% on Tencent Weibo, as well as in speed.
【Keywords】: anomalous behavior; outlier detection; zombie follower
【Paper Link】 【Pages】:951-960
【Authors】: Hengshu Zhu ; Hui Xiong ; Yong Ge ; Enhong Chen
【Abstract】: With the rapid prevalence of smart mobile devices, the number of mobile Apps available has exploded over the past few years. To facilitate the choice of mobile Apps, existing mobile App recommender systems typically recommend popular mobile Apps to mobile users. However, mobile Apps are highly varied and often poorly understood, particularly for their activities and functions related to privacy and security. Therefore, more and more mobile users are reluctant to adopt mobile Apps due to the risk of privacy invasion and other security concerns. To fill this crucial void, in this paper, we propose to develop a mobile App recommender system with privacy and security awareness. The design goal is to equip the recommender system with the functionality which allows to automatically detect and evaluate the security risk of mobile Apps. Then, the recommender system can provide App recommendations by considering both the Apps' popularity and the users' security preferences. Specifically, a mobile App can lead to security risk because insecure data access permissions have been implemented in this App. Therefore, we first develop the techniques to automatically detect the potential security risk for each mobile App by exploiting the requested permissions. Then, we propose a flexible approach based on modern portfolio theory for recommending Apps by striking a balance between the Apps' popularity and the users' security concerns, and build an App hash tree to efficiently recommend Apps. Finally, we evaluate our approach with extensive experiments on a large-scale data set collected from Google Play. The experimental results clearly validate the effectiveness of our approach.
【Keywords】: mobile apps; recommender systems; security and privacy
【Paper Link】 【Pages】:967-976
【Authors】: Xiaomin Fang ; Rong Pan
【Abstract】: As tensors provide a natural representation for the higher-order relations, tensor factorization techniques such as Tucker decomposition and CANDECOMP/PARAFAC decomposition have been applied to many fields. Tucker decomposition has strong capacity of expression, but the time complexity is unpractical for the large-scale real problems. On the other hand, CANDECOMP/PARAFAC decomposition is linear in the feature dimensionality, but the assumption is so strong that it abandons some important information. Besides, both of TD and CP decompose a tensor into several factor matrices. However, the factor matrices are not natural for the representation of the higher-order relations. To overcome these problems, we propose a near linear tensor factorization approach, which decompose a tensor into factor tensors in order to model the higher-order relations, without loss of important information. In addition, to reduce the time complexity and the number of the parameters, we decompose each slice of the factor tensors into two smaller matrices. We conduct experiments on both synthetic datasets and real datasets. The experimental results on the synthetic datasets validate that our model has strong capacity of expression. The results on the real datasets show that our approach outperforms the state-of-the-art tensor factorization methods.
【Keywords】: DTT; tensor decomposition
【Paper Link】 【Pages】:977-986
【Authors】: Feiping Nie ; Xiaoqian Wang ; Heng Huang
【Abstract】: Many clustering methods partition the data groups based on the input data similarity matrix. Thus, the clustering results highly depend on the data similarity learning. Because the similarity measurement and data clustering are often conducted in two separated steps, the learned data similarity may not be the optimal one for data clustering and lead to the suboptimal results. In this paper, we propose a novel clustering model to learn the data similarity matrix and clustering structure simultaneously. Our new model learns the data similarity matrix by assigning the adaptive and optimal neighbors for each data point based on the local distances. Meanwhile, the new rank constraint is imposed to the Laplacian matrix of the data similarity matrix, such that the connected components in the resulted similarity matrix are exactly equal to the cluster number. We derive an efficient algorithm to optimize the proposed challenging problem, and show the theoretical analysis on the connections between our method and the K-means clustering, and spectral clustering. We also further extend the new clustering model for the projected clustering to handle the high-dimensional data. Extensive empirical results on both synthetic data and real-world benchmark data sets show that our new clustering methods consistently outperforms the related clustering approaches.
【Keywords】: adaptive neighbors; block diagonal similarity matrix; clustering; clustering with dimensionality reduction
【Paper Link】 【Pages】:987-996
【Authors】: Xilun Chen ; K. Selçuk Candan
【Abstract】: Singular Value Decomposition (SVD) is computationally costly and therefore a naive implementation does not scale to the needs of scenarios where data evolves continuously. While there are various on-line analysis and incremental decomposition techniques, these may not accurately represent the data or may be slow for the needs of many applications. To address these challenges, in this paper, we propose a Low-rank, Windowed, Incremental SVD (LWI-SVD) algorithm, which (a) leverages efficient and accurate low-rank approximations to speed up incremental SVD updates and (b) uses a window-based approach to aggregate multiple incoming updates (insertions or deletions of rows and columns) and, thus, reduces on- line processing costs. We also present an LWI-SVD with restarts (LWI2-SVD) algorithm which leverages a novel highly efficient partial reconstruction based change detection scheme to support timely refreshing of the decomposition with significant changes in the data and prevent accumulation of errors over time. Experiment results, including comparisons to other state of the art techniques on different data sets and under different parameter settings, confirm that LWI-SVD and LWI2-SVD are both efficient and accurate in maintaining decompositions.
【Keywords】: data streams; incremental singular value decomposition
【Paper Link】 【Pages】:997-1006
【Authors】: Dimitris S. Papailiopoulos ; Anastasios T. Kyrillidis ; Christos Boutsidis
【Abstract】: We explain theoretically a curious empirical phenomenon: "Approximating a matrix by deterministically selecting a subset of its columns with the corresponding largest leverage scores results in a good low-rank matrix surrogate". In this work, we provide a novel theoretical analysis of deterministic leverage score sampling. We show that such sampling can be provably as accurate as its randomized counterparts, if the leverage scores follow a moderately steep power-law decay. We support this power-law assumption by providing empirical evidence that such decay laws are abundant in real-world data sets. We then demonstrate empirically the performance of deterministic leverage score sampling, which many times matches or outperforms the state-of-the-art techniques.
【Keywords】: deterministic sampling; leverage scores; low-rank matrix approximation; power law distributions; subset selection
【Paper Link】 【Pages】:1007-1016
【Authors】: Tuan M. V. Le ; Hady Wirawan Lauw
【Abstract】: Visualization of high-dimensional data such as text documents is widely applicable. The traditional means is to find an appropriate embedding of the high-dimensional representation in a low-dimensional visualizable space. As topic modeling is a useful form of dimensionality reduction that preserves the semantics in documents, recent approaches aim for a visualization that is consistent with both the original word space, as well as the semantic topic space. In this paper, we address the semantic visualization problem. Given a corpus of documents, the objective is to simultaneously learn the topic distributions as well as the visualization coordinates of documents. We propose to develop a semantic visualization model that approximates L2-normalized data directly. The key is to associate each document with three representations: a coordinate in the visualization space, a multinomial distribution in the topic space, and a directional vector in a high-dimensional unit hypersphere in the word space. We join these representations in a unified generative model, and describe its parameter estimation through variational inference. Comprehensive experiments on real-life text datasets show that the proposed method outperforms the existing baselines on objective evaluation metrics for visualization quality and topic interpretability.
【Keywords】: L2-normalized vector; dimensionality reduction; generative model; semantic visualization; spherical semantic embedding; spherical space; topic model
【Paper Link】 【Pages】:1017-1026
【Authors】: Rakesh Agrawal ; Behzad Golshan ; Evimaria Terzi
【Abstract】: Given a class of large number of students, each exhibiting a different ability level, how can we group them into sections so that the overall gain for students is maximized? This question has been a topic of central concern and debate amongst social scientists and policy makers for a long time. We propose a framework for rigorously studying this question, taking a computational perspective. We present a formal definition of the grouping problem and investigate some of its variants. Such variants are determined by the desired number of groups as well as the definition of the gain for each student in the group. We focus on two natural instantiations of the gain function and we show that for both of them the problem of identifying a single group of students that maximizes the gain among its members can be solved in polynomial time. The corresponding partitioning problem, where the goal is to partition the students into non-overlapping groups appear to be much harder. However, the algorithms for the single-group version can be leveraged for solving the more complex partitioning problem. Our experiments with generated data coming from different distributions demonstrate that our algorithm is significantly better than the current strategies in vogue for dividing students in a class into sections.
【Keywords】: MOOC; clustering; groups; large classes; massive courses; partitioning; teams
【Paper Link】 【Pages】:1027-1036
【Authors】: Jingbo Shang ; Yu Zheng ; Wenzhu Tong ; Eric Chang ; Yong Yu
【Abstract】: This paper instantly infers the gas consumption and pollution emission of vehicles traveling on a city's road network in a current time slot, using GPS trajectories from a sample of vehicles (e.g., taxicabs). The knowledge can be used to suggest cost-efficient driving routes as well as identifying road segments where gas has been wasted significantly. The instant estimation of the emissions from vehicles can enable pollution alerts and help diagnose the root cause of air pollution in the long run. In our method, we first compute the travel speed of each road segment using the GPS trajectories received recently. As many road segments are not traversed by trajectories (i.e., data sparsity), we propose a Travel Speed Estimation (TSE) model based on a context-aware matrix factorization approach. TSE leverages features learned from other data sources, e.g., map data and historical trajectories, to deal with the data sparsity problem. We then propose a Traffic Volume Inference (TVI) model to infer the number of vehicles passing each road segment per minute. TVI is an unsupervised Bayesian Network that incorporates multiple factors, such as travel speed, weather conditions and geographical features of a road. Given the travel speed and traffic volume of a road segment, gas consumption and emissions can be calculated based on existing environmental theories. We evaluate our method based on extensive experiments using GPS trajectories generated by over 32,000 taxis in Beijing over a period of two months. The results demonstrate the advantages of our method over baselines, validating the contribution of its components and finding interesting discoveries for the benefit of society.
【Keywords】: gas consumption; spatial trajectories.; traffic pollution; urban computing
【Paper Link】 【Pages】:1037-1046
【Authors】: Karthik Raman ; Thorsten Joachims
【Abstract】: Massive Online Open Courses have the potential to revolutionize higher education with their wide outreach and accessibility, but they require instructors to come up with scalable alternates to traditional student evaluation. Peer grading -- having students assess each other -- is a promising approach to tackling the problem of evaluation at scale, since the number of "graders" naturally scales with the number of students. However, students are not trained in grading, which means that one cannot expect the same level of grading skills as in traditional settings. Drawing on broad evidence that ordinal feedback is easier to provide and more reliable than cardinal feedback [5, 38, 29, 9], it is therefore desirable to allow peer graders to make ordinal statements (e.g. "project X is better than project Y") and not require them to make cardinal statements (e.g. "project X is a B-"). Thus, in this paper we study the problem of automatically inferring student grades from ordinal peer feedback, as opposed to existing methods that require cardinal peer feedback. We formulate the ordinal peer grading problem as a type of rank aggregation problem, and explore several probabilistic models under which to estimate student grades and grader reliability. We study the applicability of these methods using peer grading data collected from a real class --- with instructor and TA grades as a baseline --- and demonstrate the efficacy of ordinal feedback techniques in comparison to existing cardinal peer grading methods. Finally, we compare these peer-grading techniques to traditional evaluation techniques.
【Keywords】: ordinal feedback; peer grading; rank aggregation
【Paper Link】 【Pages】:1047-1056
【Authors】: Yanjie Fu ; Hui Xiong ; Yong Ge ; Zijun Yao ; Yu Zheng ; Zhi-Hua Zhou
【Abstract】: It is traditionally a challenge for home buyers to understand, compare and contrast the investment values of real estates. While a number of estate appraisal methods have been developed to value real property, the performances of these methods have been limited by the traditional data sources for estate appraisal. However, with the development of new ways of collecting estate-related mobile data, there is a potential to leverage geographic dependencies of estates for enhancing estate appraisal. Indeed, the geographic dependencies of the value of an estate can be from the characteristics of its own neighborhood (individual), the values of its nearby estates (peer), and the prosperity of the affiliated latent business area (zone). To this end, in this paper, we propose a geographic method, named ClusRanking, for estate appraisal by leveraging the mutual enforcement of ranking and clustering power. ClusRanking is able to exploit geographic individual, peer, and zone dependencies in a probabilistic ranking model. Specifically, we first extract the geographic utility of estates from geography data, estimate the neighborhood popularity of estates by mining taxicab trajectory data, and model the influence of latent business areas via ClusRanking. Also, we use a linear model to fuse these three influential factors and predict estate investment values. Moreover, we simultaneously consider individual, peer and zone dependencies, and derive an estate-specific ranking likelihood as the objective function. Finally, we conduct a comprehensive evaluation with real-world estate related data, and the experimental results demonstrate the effectiveness of our method.
【Keywords】: clusranking; geographic dependencies; real estate appraisal
【Paper Link】 【Pages】:1057-1066
【Authors】: Bo Zong ; Yinghui Wu ; Jie Song ; Ambuj K. Singh ; Hasan Çam ; Jiawei Han ; Xifeng Yan
【Abstract】: Performance monitor software for data centers typically generates a great number of alert sequences. These alert sequences indicate abnormal network events. Given a set of observed alert sequences, it is important to identify the most critical alerts that are potentially the causes of others. While the need for mining critical alerts over large scale alert sequences is evident, most alert analysis techniques stop at modeling and mining the causal relations among the alerts. This paper studies the critical alert mining problem: Given a set of alert sequences, we aim to find a set of k critical alerts such that the number of alerts potentially triggered by them is maximized. We show that the problem is intractable; therefore, we resort to approximation and heuristic algorithms. First, we develop an approximation algorithm that obtains a near-optimal alert set in quadratic time, and propose pruning techniques to improve its runtime performance. Moreover, we show a faster approximation exists, when the alerts follow certain causal structure. Second, we propose two fast heuristic algorithms based on tree sampling techniques. On real-life data, these algorithms identify a critical alert from up to 270,000 mined causal relations in 5 seconds; meanwhile, they preserve more than 80% of solution quality, and are up to 5,000 times faster than their approximation counterparts.
【Keywords】: critical alert mining; data center troubleshooting; data mining; graph mining; root cause analysis
【Paper Link】 【Pages】:1067-1076
【Authors】: Caleb Chen Cao ; Lei Chen ; Hosagrahar Visvesvaraya Jagadish
【Abstract】: We often care about people's degrees of belief about certain events: e.g. causality between an action and the outcomes, odds distribution among the outcome of a horse race and so on. It is well recognized that the best form to elicit opinion from human is probability distribution instead of simple voting, because the form of distribution retains the delicate information that an opinion expresses. In the past, opinion elicitation has relied on experts, who are expensive and not always available. More recently, crowdsourcing has gained prominence as an inexpensive way to get a great deal of human input. However, traditional crowdsourcing has primarily focused on issuing very simple (e.g. binary decision) tasks to the crowd. In this paper, we study how to use crowds for Opinion Elicitation. There are three major challenges to eliciting opinion information in the form of probability distributions: how to measure the quality of distribution; how to aggregate the distributions; and, how to strategically implement such a system. To address these challenges, we design and implement COPE Crowd-powered OPinion Elicitation market. COPE models crowdsourced work as a trading market, where the "workers" behave like "traders" to maximize their profit by presenting their opinion. Among the innovative features in this system, we design COPE updating to combine the multiple elicited distributions following a Bayesian scheme. Also to provide more flexibility while running COPE, we propose a series of efficient algorithms and a slope based strategy to manage the ending condition of COPE. We then demonstrate the implementation of COPE and report experimental results running on real commercial platform to demonstrate the practical value of this system.
【Keywords】: crowdsourcing; human computation; market; social media
【Paper Link】 【Pages】:1077-1086
【Authors】: Weinan Zhang ; Shuai Yuan ; Jun Wang
【Abstract】: In this paper we study bid optimisation for real-time bidding (RTB) based display advertising. RTB allows advertisers to bid on a display ad impression in real time when it is being generated. It goes beyond contextual advertising by motivating the bidding focused on user data and it is different from the sponsored search auction where the bid price is associated with keywords. For the demand side, a fundamental technical challenge is to automate the bidding process based on the budget, the campaign objective and various information gathered in runtime and in history. In this paper, the programmatic bidding is cast as a functional optimisation problem. Under certain dependency assumptions, we derive simple bidding functions that can be calculated in real time; our finding shows that the optimal bid has a non-linear relationship with the impression level evaluation such as the click-through rate and the conversion rate, which are estimated in real time from the impression level features. This is different from previous work that is mainly focused on a linear bidding function. Our mathematical derivation suggests that optimal bidding strategies should try to bid more impressions rather than focus on a small set of high valued impressions because according to the current RTB market data, compared to the higher evaluated impressions, the lower evaluated ones are more cost effective and the chances of winning them are relatively higher. Aside from the theoretical insights, offline experiments on a real dataset and online experiments on a production RTB system verify the effectiveness of our proposed optimal bidding strategies and the functional optimisation framework.
【Keywords】: bid optimisation; demand-side platform; display advertising; real-time bidding
【Paper Link】 【Pages】:1087-1096
【Authors】: Ting Wang ; Dashun Wang ; Fei Wang
【Abstract】: In many diverse settings, aggregated opinions of others play an increasingly dominant role in shaping individual decision making. One key prerequisite of harnessing the "crowd wisdom" is the independency of individuals' opinions, yet in real settings collective opinions are rarely simple aggregations of independent minds. Recent experimental studies document that disclosing prior collective opinions distorts individuals' decision making as well as their perceptions of quality and value, highlighting a fundamental disconnect from current modeling efforts: How to model social influence and its impact on systems that are constantly evolving? In this paper, we develop a mechanistic framework to model social influence of prior collective opinions (e.g., online product ratings) on subsequent individual decision making. We find our method successfully captures the dynamics of rating growth, helping us separate social influence bias from inherent values. Using large-scale longitudinal customer rating datasets, we demonstrate that our model not only effectively assesses social influence bias, but also accurately predicts long-term cumulative growth of ratings solely based on early rating trajectories. We believe our framework will play an increasingly important role as our understanding of social processes deepens. It promotes strategies to untangle manipulations and social biases and provides insights towards a more reliable and effective design of social platforms.
【Keywords】: crowd wisdom; herding effect; social influence
【Paper Link】 【Pages】:1097-1105
【Authors】: Olivier Chapelle
【Abstract】: In performance display advertising a key metric of a campaign effectiveness is its conversion rate -- the proportion of users who take a predefined action on the advertiser website, such as a purchase. Predicting this conversion rate is thus essential for estimating the value of an impression and can be achieved via machine learning. One difficulty however is that the conversions can take place long after the impression -- up to a month -- and this delayed feedback hinders the conversion modeling. We tackle this issue by introducing an additional model that captures the conversion delay. Intuitively, this probabilistic model helps determining whether a user that has not converted should be treated as a negative sample -- when the elapsed time is larger than the predicted delay -- or should be discarded from the training set -- when it is too early to tell. We provide experimental results on real traffic logs that demonstrate the effectiveness of the proposed model.
【Keywords】: conversion prediction; display advertising; machine learning
【Paper Link】 【Pages】:1106-1115
【Authors】: Meng Fang ; Dacheng Tao
【Abstract】:
In this paper, we study networked bandits', a new bandit problem where a set of interrelated arms varies over time and, given the contextual information that selects one arm, invokes other correlated arms. This problem remains under-investigated, in spite of its applicability to many practical problems. For instance, in social networks, an arm can obtain payoffs from both the selected user and its relations since they often share the content through the network. We examine whether it is possible to obtain multiple payoffs from several correlated arms based on the relationships. In particular, we formalize the networked bandit problem and propose an algorithm that considers not only the selected arm, but also the relationships between arms. Our algorithm is
optimism in face of uncertainty' style, in that it decides an arm depending on integrated confidence sets constructed from historical data. We analyze the performance in simulation experiments and on two real-world offline datasets. The experimental results demonstrate our algorithm's effectiveness in the networked bandit setting.
【Keywords】: exploration/exploitation dilemma; networked bandits; social network
【Paper Link】 【Pages】:1116-1125
【Authors】: Zhiyuan Chen ; Bing Liu
【Abstract】: Topic modeling has been widely used to mine topics from documents. However, a key weakness of topic modeling is that it needs a large amount of data (e.g., thousands of documents) to provide reliable statistics to generate coherent topics. However, in practice, many document collections do not have so many documents. Given a small number of documents, the classic topic model LDA generates very poor topics. Even with a large volume of data, unsupervised learning of topic models can still produce unsatisfactory results. In recently years, knowledge-based topic models have been proposed, which ask human users to provide some prior domain knowledge to guide the model to produce better topics. Our research takes a radically different approach. We propose to learn as humans do, i.e., retaining the results learned in the past and using them to help future learning. When faced with a new task, we first mine some reliable (prior) knowledge from the past learning/modeling results and then use it to guide the model inference to generate more coherent topics. This approach is possible because of the big data readily available on the Web. The proposed algorithm mines two forms of knowledge: must-link (meaning that two words should be in the same topic) and cannot-link (meaning that two words should not be in the same topic). It also deals with two problems of the automatically mined knowledge, i.e., wrong knowledge and knowledge transitivity. Experimental results using review documents from 100 product domains show that the proposed approach makes dramatic improvements over state-of-the-art baselines.
【Keywords】: lifelong learning; opinion aspect extraction; topic model
【Paper Link】 【Pages】:1126-1135
【Authors】: Zhe Chen ; Michael J. Cafarella
【Abstract】: Spreadsheets contain valuable data on many topics. However, spreadsheets are difficult to integrate with other data sources. Converting spreadsheet data to the relational model would allow data analysts to use relational integration tools. We propose a two-phase semiautomatic system that extracts accurate relational metadata while minimizing user effort. Based on an undirected graphical model, our system enables downstream spreadsheet integration applications. First, the automatic extractor uses hints from spreadsheets' graphical style and recovered metadata to extract the spreadsheet data as accurately as possible. Second, the interactive repair identifies similar regions in distinct spreadsheets scattered across large spreadsheet corpora, allowing a user's single manual repair to be amortized over many possible extraction errors. Our experiments show that a human can obtain the accurate extraction with just 31% of the manual operations required by a standard classification based technique on two real-world datasets.
【Keywords】: graphical model; information extraction; spreadsheets
【Paper Link】 【Pages】:1136-1145
【Authors】: Moritz Sudhof ; Andrés Goméz Emilsson ; Andrew L. Maas ; Christopher Potts
【Abstract】: Human emotional states are not independent but rather proceed along systematic paths governed by both internal, cognitive factors and external, social ones. For example, anxiety often transitions to disappointment, which is likely to sink to depression before rising to happiness and relaxation, and these states are conditioned by the states of others in our communities. Modeling these complex dependencies can yield insights into human emotion and support more powerful sentiment technologies. We develop a theory of conditional dependencies between emotional states in which emotions are characterized not only by valence (polarity) and arousal (intensity) but also by the role they play in state transitions and social relationships. We implement this theory using conditional random fields (CRFs) that synthesize textual information with information about previous emotional states and the emotional states of others. To assess the power of affective transitions, we evaluate our model in a collection of 'mood' updates from the Experience Project. To assess the power of social factors, we use a corpus of product reviews from a website in which the community dynamics encourage reviewers to be influenced by each other. In both settings, our models yield improvements of statistical and practical significance over ones that classify each text independently of its emotional or social context.
【Keywords】: multidimensional sentiment analysis; sentiment as social; sentiment transitions
【Paper Link】 【Pages】:1146-1155
【Authors】: Furong Li ; Mong-Li Lee ; Wynne Hsu
【Abstract】: The rapid growth of information sources on the Web has intensified the problem of data quality. In particular, the same real world entity may be described by different sources in various ways with overlapping information, and possibly conflicting or even erroneous values. In order to obtain a more complete and accurate picture for a real world entity, we need to collate the data records that refer to the entity, as well as correct any erroneous values. We observe that these two tasks are often tightly coupled: rectifying erroneous values will facilitate data collation, while linking similar records provides us with a clearer view of the data and additional evidence for error correction. In this paper, we present a framework called Comet that interleaves record linkage with error correction, taking into consideration the source reliabilities on various attributes. The proposed framework first utilizes confidence based matching to discriminate records in terms of ambiguity and source reliability. Then it performs adaptive matching to reduce the impact of erroneous values. Experiment results demonstrate that Comet outperforms the state-of-the-art techniques and is able to build complete and accurate profiles for real world entities.
【Keywords】: entity profiling; record linkage; source reliability; truth discovery
【Paper Link】 【Pages】:1156-1165
【Authors】: Anthony Fader ; Luke Zettlemoyer ; Oren Etzioni
【Abstract】: We consider the problem of open-domain question answering (Open QA) over massive knowledge bases (KBs). Existing approaches use either manually curated KBs like Freebase or KBs automatically extracted from unstructured text. In this paper, we present OQA, the first approach to leverage both curated and extracted KBs. A key technical challenge is designing systems that are robust to the high variability in both natural language questions and massive KBs. OQA achieves robustness by decomposing the full Open QA problem into smaller sub-problems including question paraphrasing and query reformulation. OQA solves these sub-problems by mining millions of rules from an unlabeled question corpus and across multiple KBs. OQA then learns to integrate these rules by performing discriminative training on question-answer pairs using a latent-variable structured perceptron algorithm. We evaluate OQA on three benchmark question sets and demonstrate that it achieves up to twice the precision and recall of a state-of-the-art Open QA system.
【Keywords】: algorithms; experimentation
【Paper Link】 【Pages】:1166-1175
【Authors】: Feng Chen ; Daniel B. Neill
【Abstract】: Event detection in social media is an important but challenging problem. Most existing approaches are based on burst detection, topic modeling, or clustering techniques, which cannot naturally model the implicit heterogeneous network structure in social media. As a result, only limited information, such as terms and geographic locations, can be used. This paper presents Non-Parametric Heterogeneous Graph Scan (NPHGS), a new approach that considers the entire heterogeneous network for event detection: we first model the network as a "sensor" network, in which each node senses its "neighborhood environment" and reports an empirical p-value measuring its current level of anomalousness for each time interval (e.g., hour or day). Then, we efficiently maximize a nonparametric scan statistic over connected subgraphs to identify the most anomalous network clusters. Finally, the event represented by each cluster is summarized with information such as type of event, geographical locations, time, and participants. As a case study, we consider two applications using Twitter data, civil unrest event detection and rare disease outbreak detection, and present empirical evaluations illustrating the effectiveness and efficiency of our proposed approach.
【Keywords】: event detection and forecasting; heterogeneous graphs; non-parametric scan statistics; social media
【Paper Link】 【Pages】:1176-1185
【Authors】: Polina Rozenshtein ; Aris Anagnostopoulos ; Aristides Gionis ; Nikolaj Tatti
【Abstract】: With the fast growth of smart devices and social networks, a lot of computing systems collect data that record different types of activities. An important computational challenge is to analyze these data, extract patterns, and understand activity trends. We consider the problem of mining activity networks to identify interesting events, such as a big concert or a demonstration in a city, or a trending keyword in a user community in a social network. We define an event to be a subset of nodes in the network that are close to each other and have high activity levels. We formalize the problem of event detection using two graph-theoretic formulations. The first one captures the compactness of an event using the sum of distances among all pairs of the event nodes. We show that this formulation can be mapped to the maxcut problem, and thus, it can be solved by applying standard semidefinite programming techniques. The second formulation captures compactness using a minimum-distance tree. This formulation leads to the prize-collecting Steiner-tree problem, which we solve by adapting existing approximation algorithms. For the two problems we introduce, we also propose efficient and effective greedy approaches and we prove performance guarantees for one of them. We experiment with the proposed algorithms on real datasets from a public bicycling system and a geolocation-enabled social network dataset collected from twitter. The results show that our methods are able to detect meaningful events.
【Keywords】: event detection; maximum cut; prize-collecting steiner tree; submodular function maximization
【Paper Link】 【Pages】:1186-1195
【Authors】: Meng Jiang ; Peng Cui ; Fei Wang ; Xinran Xu ; Wenwu Zhu ; Shiqiang Yang
【Abstract】: Behavioral pattern discovery is increasingly being studied to understand human behavior and the discovered patterns can be used in many real world applications such as web search, recommender system and advertisement targeting. Traditional methods usually consider the behaviors as simple user and item connections, or represent them with a static model. In real world, however, human behaviors are actually complex and dynamic: they include correlations between user and multiple types of objects and also continuously evolve along time. These characteristics cause severe data sparsity and computational complexity problem, which pose great challenge to human behavioral analysis and prediction. In this paper, we propose a Flexible Evolutionary Multi-faceted Analysis (FEMA) framework for both behavior prediction and pattern mining. FEMA utilizes a flexible and dynamic factorization scheme for analyzing human behavioral data sequences, which can incorporate various knowledge embedded in different object domains to alleviate the sparsity problem. We give approximation algorithms for efficiency, where the bound of approximation loss is theoretically proved. We extensively evaluate the proposed method in two real datasets. For the prediction of human behaviors, the proposed FEMA significantly outperforms other state-of-the-art baseline methods by 17.4%. Moreover, FEMA is able to discover quite a number of interesting multi-faceted temporal patterns on human behaviors with good interpretability. More importantly, it can reduce the run time from hours to minutes, which is significant for industry to serve real-time applications.
【Keywords】: behavior modeling; behavioral pattern; evolutionary analysis; flexible regularizers; tensor factorization
【Paper Link】 【Pages】:1196-1205
【Authors】: Behzad Golshan ; Theodoros Lappas ; Evimaria Terzi
【Abstract】: Team formation has been long recognized as a natural way to acquire a diverse pool of useful skills, by combining experts with complementary talents. This allows organizations to effectively complete beneficial projects from different domains, while also helping individual experts position themselves and succeed in highly competitive job markets. Here, we assume a collection of projects ensuremath{P}, where each project requires a certain set of skills, and yields a different benefit upon completion. We are further presented with a pool of experts ensuremath{X}, where each expert has his own skillset and compensation demands. Then, we study the problem of hiring a cluster of experts T ⊆ X, so that the overall compensation (cost) does not exceed a given budget B, and the total benefit of the projects that this team can collectively cover is maximized. We refer to this as the ClusterHire problem. Our work presents a detailed analysis of the computational complexity and hardness of approximation of the problem, as well as heuristic, yet effective, algorithms for solving it in practice. We demonstrate the efficacy of our approaches through experiments on real datasets of experts, and demonstrate their advantage over intuitive baselines. We also explore additional variants of the fundamental problem formulation, in order to account for constraints and considerations that emerge in realistic cluster-hiring scenarios. All variants considered in this paper have immediate applications in the cluster hiring process, as it emerges in the context of different organizational settings.
【Keywords】: online marketplaces; team formation
【Paper Link】 【Pages】:1206-1215
【Authors】: Keqian Li ; Wei Lu ; Smriti Bhagat ; Laks V. S. Lakshmanan ; Cong Yu
【Abstract】: Online platforms, such as Meetup and Plancast, have recently become popular for planning gatherings and event organization. However, there is a surprising lack of studies on how to effectively and efficiently organize social events for a large group of people through such platforms. In this paper, we study the key computational problem involved in organization of social events, to our best knowledge, for the first time. We propose the Social Event Organization (SEO) problem as one of assigning a set of events for a group of users to attend, where the users are socially connected with each other and have innate levels of interest in those events. As a first step toward Social Event Organization, we introduce a formal definition of a restricted version of the problem and show that it is NP-hard and is hard to approximate. We propose efficient heuristic algorithms that improve upon simple greedy algorithms by incorporating the notion of phantom events and by using look-ahead estimation. Using synthetic datasets and three real datasets including those from the platforms Meetup and Plancast, we experimentally demonstrate that our greedy heuristics are scalable and furthermore outperform the baseline algorithms significantly in terms of achieving superior social welfare.
【Keywords】: assignment problems; event organization; social networks
【Paper Link】 【Pages】:1216-1225
【Authors】: Varun R. Embar ; Rama Kumar Pasumarthi ; Indrajit Bhattacharya
【Abstract】: The analysis of network connections, diffusion processes and cascades requires evaluating properties of the diffusion network. Properties of interest often involve variables that are not explicitly observed in real world diffusions. Connection strengths in the network and diffusion paths of infections over the network are examples of such hidden variables. These hidden variables therefore need to be estimated for these properties to be evaluated. In this paper, we propose and study this novel problem in a Bayesian framework by capturing the posterior distribution of these hidden variables given the observed cascades, and computing the expectation of these properties under this posterior distribution. We identify and characterize interesting network diffusion properties whose expectations can be computed exactly and efficiently, either wholly or in part. For properties that are not `nice' in this sense, we propose a Gibbs Sampling framework for Monte Carlo integration. In detailed experiments using various network diffusion properties over multiple synthetic and real datasets, we demonstrate that the proposed approach is significantly more accurate than a frequentist plug-in baseline. We also propose a map-reduce implementation of our framework and demonstrate that this can analyze cascades with millions of infections in minutes.
【Keywords】: bayesian analysis; gibbs sampling; information cascades; networks of diffusion; social influence analysis
【Paper Link】 【Pages】:1226-1235
【Authors】: Elias Boutros Khalil ; Bistra N. Dilkina ; Le Song
【Abstract】: How can we optimize the topology of a networked system to bring a flu under control, propel a video to popularity, or stifle a network malware in its infancy? Previous work on information diffusion has focused on modeling the diffusion dynamics and selecting nodes to maximize/minimize influence. Only a paucity of recent studies have attempted to address the network modification problems, where the goal is to either facilitate desirable spreads or curtail undesirable ones by adding or deleting a small subset of network nodes or edges. In this paper, we focus on the widely studied linear threshold diffusion model, and prove, for the first time, that the network modification problems under this model have supermodular objective functions. This surprising property allows us to design efficient data structures and scalable algorithms with provable approximation guarantees, despite the hardness of the problems in question. Both the time and space complexities of our algorithms are linear in the size of the network, which allows us to experiment with millions of nodes and edges. We show that our algorithms outperform an array of heuristics in terms of their effectiveness in controlling diffusion processes, often beating the next best by a significant margin.
【Keywords】: approximation; diffusion networks; network optimization; supermodularity
【Paper Link】 【Pages】:1236-1245
【Authors】: Takeshi Kurashima ; Tomoharu Iwata ; Noriko Takaya ; Hiroshi Sawada
【Abstract】: The diffusion of information, rumors, and diseases are assumed to be probabilistic processes over some network structure. An event starts at one node of the network, and then spreads to the edges of the network. In most cases, the underlying network structure that generates the diffusion process is unobserved, and we only observe the times at which each node is altered/influenced by the process. This paper proposes a probabilistic model for inferring the diffusion network, which we call Probabilistic Latent Network Visualization (PLNV); it is based on cascade data, a record of observed times of node influence. An important characteristic of our approach is to infer the network by embedding it into a low-dimensional visualization space. We assume that each node in the network has latent coordinates in the visualization space, and diffusion is more likely to occur between nodes that are placed close together. Our model uses maximum a posteriori estimation to learn the latent coordinates of nodes that best explain the observed cascade data. The latent coordinates of nodes in the visualization space can 1) enable the system to suggest network layouts most suitable for browsing, and 2) lead to high accuracy in inferring the underlying network when analyzing the diffusion process of new or rare information, rumors, and disease.
【Keywords】: diffusion network; network visualization; survival analysis
【Paper Link】 【Pages】:1246-1255
【Authors】: Senzhang Wang ; Xia Hu ; Philip S. Yu ; Zhoujun Li
【Abstract】: Inferring diffusion networks from traces of cascades has been extensively studied to better understand information diffusion in many domains. A widely used assumption in previous work is that the diffusion network is homogenous and diffusion processes of cascades follow the same pattern. However, in social media, users may have various interests and the connections among them are usually multi-faceted. In addition, different cascades normally diffuse at different speeds and spread to diverse scales, and hence show various diffusion patterns. It is challenging for traditional models to capture the heterogeneous user interactions and diverse patterns of cascades in social media. In this paper, we investigate a novel problem of inferring multi-aspect diffusion networks with multi-pattern cascades. In particular, we study the effects of various diffusion patterns on the information diffusion process by analyzing users' retweeting behavior on a microblogging dataset. By incorporating aspect-level user interactions and various diffusion patterns, a new model for inferring Multi-aspect transmission Rates between users using Multi-pattern cascades (MMRate) is proposed. We also provide an Expectation Maximization algorithm to effectively estimate the parameters. Experimental results on both synthetic and microblogging datasets demonstrate the superior performance of our approach over the state-of-the-art methods in inferring multi-aspect diffusion networks.
【Keywords】: graph mining; information diffusion; social network
【Paper Link】 【Pages】:1256-1265
【Authors】: Xinran He ; David Kempe
【Abstract】: The present article serves as an erratum to our paper of the same title, which was presented and published in the KDD 2014 conference. In that article, we claimed falsely that the objective function defined in Section 1.4 is non-monotone submodular. We are deeply indebted to Debmalya Mandal, Jean Pouget-Abadie and Yaron Singer for bringing to our attention a counter-example to that claim. Subsequent to becoming aware of the counter-example, we have shown that the objective function is in fact NP-hard to approximate to within a factor of O(n1-ε) for any ε > 0. In an attempt to fix the record, the present article combines the problem motivation, models, and experimental results sections from the original incorrect article with the new hardness result. We would like readers to only cite and use this version (which will remain an unpublished note) instead of the incorrect conference version.
【Keywords】: influence maximization; noise; robust optimization; submodular optimization; uncertainty
【Paper Link】 【Pages】:1266-1275
【Authors】: Nicola Barbieri ; Francesco Bonchi ; Giuseppe Manco
【Abstract】: User recommender systems are a key component in any on-line social networking platform: they help the users growing their network faster, thus driving engagement and loyalty. In this paper we study link prediction with explanations for user recommendation in social networks. For this problem we propose WTFW ("Who to Follow and Why"), a stochastic topic model for link prediction over directed and nodes-attributed graphs. Our model not only predicts links, but for each predicted link it decides whether it is a "topical" or a "social" link, and depending on this decision it produces a different type of explanation. A topical link is recommended between a user interested in a topic and a user authoritative in that topic: the explanation in this case is a set of binary features describing the topic responsible of the link creation. A social link is recommended between users which share a large social neighborhood: in this case the explanation is the set of neighbors which are more likely to be responsible for the link creation. Our experimental assessment on real-world data confirms the accuracy of WTFW in the link prediction and the quality of the associated explanations.
【Keywords】: link prediction; social networks
【Paper Link】 【Pages】:1276-1285
【Authors】: Yang Zhou ; Ling Liu
【Abstract】: Multi-label classification of heterogeneous information networks has received renewed attention in social network analysis. In this paper, we present an activity-edge centric multi-label classification framework for analyzing heterogeneous information networks with three unique features. First, we model a heterogeneous information network in terms of a collaboration graph and multiple associated activity graphs. We introduce a novel concept of vertex-edge homophily in terms of both vertex labels and edge labels and transform a general collaboration graph into an activity-based collaboration multigraph by augmenting its edges with class labels from each activity graph through activity-based edge classification. Second, we utilize the label vicinity to capture the pairwise vertex closeness based on the labeling on the activity-based collaboration multigraph. We incorporate both the structure affinity and the label vicinity into a unified classifier to speed up the classification convergence. Third, we design an iterative learning algorithm, AEClass, to dynamically refine the classification result by continuously adjusting the weights on different activity-based edge classification schemes from multiple activity graphs, while constantly learning the contribution of the structure affinity and the label vicinity in the unified classifier. Extensive evaluation on real datasets demonstrates that AEClass outperforms existing representative methods in terms of both effectiveness and efficiency.
【Keywords】: activity-based edge classification; collaboration multigraph; heterogeneous network; label vicinity; multi-label classification
【Paper Link】 【Pages】:1286-1295
【Authors】: Jiawei Zhang ; Philip S. Yu ; Zhi-Hua Zhou
【Abstract】: Online social networks offering various services have become ubiquitous in our daily life. Meanwhile, users nowadays are usually involved in multiple online social networks simultaneously to enjoy specific services provided by different networks. Formally, social networks that share some common users are named as partially aligned networks. In this paper, we want to predict the formation of social links in multiple partially aligned social networks at the same time, which is formally defined as the multi-network link (formation) prediction problem. In multiple partially aligned social networks, users can be extensively correlated with each other by various connections. To categorize these diverse connections among users, 7 "intra-network social meta paths" and 4 categories of "inter-network social meta paths" are proposed in this paper. These "social meta paths" can cover a wide variety of connection information in the network, some of which can be helpful for solving the multi-network link prediction problem but some can be not. To utilize useful connection, a subset of the most informative "social meta paths" are picked, the process of which is formally defined as "social meta path selection" in this paper. An effective general link formation prediction framework, Mli (Multi-network Link Identifier), is proposed in this paper to solve the multi-network link (formation) prediction problem. Built with heterogenous topological features extracted based on the selected "social meta paths" in the multiple partially aligned social networks, Mli can help refine and disambiguate the prediction results reciprocally in all aligned networks. Extensive experiments conducted on real-world partially aligned heterogeneous networks, Foursquare and Twitter, demonstrate that Mli can solve the multi-network link prediction problem very well.
【Keywords】: classification; data mining; link prediction; transfer learning
【Paper Link】 【Pages】:1296-1305
【Authors】: Manish Purohit ; B. Aditya Prakash ; Chanhyun Kang ; Yao Zhang ; V. S. Subrahmanian
【Abstract】: Given a social network, can we quickly 'zoom-out' of the graph? Is there a smaller equivalent representation of the graph that preserves its propagation characteristics? Can we group nodes together based on their influence properties? These are important problems with applications to influence analysis, epidemiology and viral marketing applications. In this paper, we first formulate a novel Graph Coarsening Problem to find a succinct representation of any graph while preserving key characteristics for diffusion processes on that graph. We then provide a fast and effective near-linear-time (in nodes and edges) algorithm COARSENET for the same. Using extensive experiments on multiple real datasets, we demonstrate the quality and scalability of COARSENET, enabling us to reduce the graph by 90% in some cases without much loss of information. Finally we also show how our method can help in diverse applications like influence maximization and detecting patterns of propagation at the level of automatically created groups on real cascade data.
【Keywords】: coarsening; diffusion; graph mining; propagation
【Paper Link】 【Pages】:1306-1315
【Authors】: Peng Zhang ; Wei Chen ; Xiaoming Sun ; Yajun Wang ; Jialin Zhang
【Abstract】: A topic propagating in a social network reaches its tipping point if the number of users discussing it in the network exceeds a critical threshold such that a wide cascade on the topic is likely to occur. In this paper, we consider the task of selecting initial seed users of a topic with minimum size so that {\em with a guaranteed probability} the number of users discussing the topic would reach a given threshold. We formulate the task as an optimization problem called {\em seed minimization with probabilistic coverage guarantee (SM-PCG)}. This problem departs from the previous studies on social influence maximization or seed minimization because it considers influence coverage with {\em probabilistic} guarantees instead of guarantees on {\em expected} influence coverage. We show that the problem is not submodular, and thus is harder than previously studied problems based on submodular function optimization. We provide an approximation algorithm and show that it approximates the optimal solution with both a multiplicative ratio and an additive error. The multiplicative ratio is tight while the additive error would be small if influence coverage distributions of certain seed sets are well concentrated. For one-way bipartite graphs we analytically prove the concentration condition and obtain an approximation algorithm with an $O(\log n)$ multiplicative ratio and an $O(\sqrt{n})$ additive error, where $n$ is the total number of nodes in the social graph. Moreover, we empirically verify the concentration condition in real-world networks and experimentally demonstrate the effectiveness of our proposed algorithm comparing to commonly adopted benchmark algorithms.
【Keywords】: independent cascade model; influence diffusion; seed minimization; social networks
【Paper Link】 【Pages】:1316-1325
【Authors】: Francesco Bonchi ; Francesco Gullo ; Andreas Kaltenbrunner ; Yana Volkovich
【Abstract】: Core decomposition has proven to be a useful primitive for a wide range of graph analyses. One of its most appealing features is that, unlike other notions of dense subgraphs, it can be computed linearly in the size of the input graph. In this paper we provide an analogous tool for uncertain graphs, i.e., graphs whose edges are assigned a probability of existence. The fact that core decomposition can be computed efficiently in deterministic graphs does not guarantee efficiency in uncertain graphs, where even the simplest graph operations may become computationally intensive. Here we show that core decomposition of uncertain graphs can be carried out efficiently as well. We extensively evaluate our definitions and methods on a number of real-world datasets and applications, such as influence maximization and task-driven team formation.
【Keywords】: core decomposition; dense subgraph; uncertain graphs
【Paper Link】 【Pages】:1326-1335
【Authors】: Austin R. Benson ; Carlos Riquelme ; Sven Schmit
【Abstract】: Using random graphs to model networks has a rich history. In this paper, we analyze and improve the multifractal network generators (MFNG) introduced by Palla et al. We provide a new result on the probability of subgraphs existing in graphs generated with MFNG. This allows us to quickly compute moments of an important set of graph properties, such as the expected number of edges, stars, and cliques for graphs generated using MFNG. Specifically, we show how to compute these moments in time complexity independent of the size of the graph and the number of recursive levels in the generative model. We leverage this theory to propose a new method of moments algorithm for fitting MFNG to large networks. Empirically, this new approach effectively simulates properties of several social and information networks. In terms of matching subgraph counts, our method outperforms similar algorithms used with the Stochastic Kronecker Graph model. Furthermore, we present a fast approximation algorithm to generate graph instances following the multifractal structure. The approximation scheme is an improvement over previous methods, which ran in time complexity quadratic in the number of vertices. Combined, our method of moments and fast sampling scheme provide the first scalable framework for effectively modeling large networks with MFNG.
【Keywords】: graph mining; graph sampling; method of moments; multifractal; random graphs; real-world networks; stochastic kronecker graph
【Paper Link】 【Pages】:1336-1345
【Authors】: Chuanren Liu ; Kai Zhang ; Hui Xiong ; Geoff Jiang ; Qiang Yang
【Abstract】: Sequential pattern analysis targets on finding statistically relevant temporal structures where the values are delivered in a sequence. With the growing complexity of real-world dynamic scenarios, more and more symbols are often needed to encode a meaningful sequence. This is so-called 'curse of cardinality', which can impose significant challenges to the design of sequential analysis methods in terms of computational efficiency and practical use. Indeed, given the overwhelming scale and the heterogeneous nature of the sequential data, new visions and strategies are needed to face the challenges. To this end, in this paper, we propose a 'temporal skeletonization' approach to proactively reduce the representation of sequences to uncover significant, hidden temporal structures. The key idea is to summarize the temporal correlations in an undirected graph. Then, the 'skeleton' of the graph serves as a higher granularity on which hidden temporal patterns are more likely to be identified. In the meantime, the embedding topology of the graph allows us to translate the rich temporal content into a metric space. This opens up new possibilities to explore, quantify, and visualize sequential data. Our approach has shown to greatly alleviate the curse of cardinality in challenging tasks of sequential pattern mining and clustering. Evaluation on a Business-to-Business (B2B) marketing application demonstrates that our approach can effectively discover critical buying paths from noisy customer event data.
【Keywords】: curse of cardinality; network embedding; sequential pattern mining; temporal skeletonization
【Paper Link】 【Pages】:1346-1355
【Authors】: Bryan Perozzi ; Leman Akoglu ; Patricia Iglesias Sánchez ; Emmanuel Müller
【Abstract】: Graph clustering and graph outlier detection have been studied extensively on plain graphs, with various applications. Recently, algorithms have been extended to graphs with attributes as often observed in the real-world. However, all of these techniques fail to incorporate the user preference into graph mining, and thus, lack the ability to steer algorithms to more interesting parts of the attributed graph. In this work, we overcome this limitation and introduce a novel user-oriented approach for mining attributed graphs. The key aspect of our approach is to infer user preference by the so-called focus attributes through a set of user-provided exemplar nodes. In this new problem setting, clusters and outliers are then simultaneously mined according to this user preference. Specifically, our FocusCO algorithm identifies the focus, extracts focused clusters and detects outliers. Moreover, FocusCO scales well with graph size, since we perform a local clustering of interest to the user rather than global partitioning of the entire graph. We show the effectiveness and scalability of our method on synthetic and real-world graphs, as compared to both existing graph clustering and outlier detection approaches.
【Keywords】: attributed graphs; clustering; distance metric learning; focused graph mining; infer user preference; outlier mining
【Paper Link】 【Pages】:1356-1365
【Authors】: Jingchao Ni ; Hanghang Tong ; Wei Fan ; Xiang Zhang
【Abstract】: Networks are prevalent and have posed many fascinating research questions. How can we spot similar users, e.g., virtual identical twins, in Cleveland for a New Yorker? Given a query disease, how can we prioritize its candidate genes by incorporating the tissue-specific protein interaction networks of those similar diseases? In most, if not all, of the existing network ranking methods, the nodes are the ranking objects with the finest granularity. In this paper, we propose a new network data model, a Network of Networks (NoN), where each node of the main network itself can be further represented as another (domain-specific) network. This new data model enables to compare the nodes in a broader context and rank them at a finer granularity. Moreover, such an NoN model enables much more efficient search when the ranking targets reside in a certain domain-specific network. We formulate ranking on NoN as a regularized optimization problem; propose efficient algorithms and provide theoretical analysis, such as optimality, convergence, complexity and equivalence. Extensive experimental evaluations demonstrate the effectiveness and the efficiency of our methods.
【Keywords】: network of networks; query; ranking
【Paper Link】 【Pages】:1366-1375
【Authors】: Isabel M. Kloumann ; Jon M. Kleinberg
【Abstract】: In many applications we have a social network of people and would like to identify the members of an interesting but unlabeled group or community. We start with a small number of exemplar group members -- they may be followers of a political ideology or fans of a music genre -- and need to use those examples to discover the additional members. This problem gives rise to the seed expansion problem in community detection: given example community members, how can the social graph be used to predict the identities of remaining, hidden community members? In contrast with global community detection (graph partitioning or covering), seed expansion is best suited for identifying communities locally concentrated around nodes of interest. A growing body of work has used seed expansion as a scalable means of detecting overlapping communities. Yet despite growing interest in seed expansion, there are divergent approaches in the literature and there still isn't a systematic understanding of which approaches work best in different domains. Here we evaluate several variants and uncover subtle trade-offs between different approaches. We explore which properties of the seed set can improve performance, focusing on heuristics that one can control in practice. As a consequence of this systematic understanding we have found several opportunities for performance gains. We also consider an adaptive version in which requests are made for additional membership labels of particular nodes, such as one finds in field studies of social communities. This leads to interesting connections and contrasts with active learning and the trade-offs of exploration and exploitation. Finally, we explore topological properties of communities and seed sets that correlate with algorithm performance, and explain these empirical observations with theoretical ones. We evaluate our methods across multiple domains, using publicly available datasets with labeled, ground-truth communities.
【Keywords】: ground-truth communities; seed set expansion
【Paper Link】 【Pages】:1376-1385
【Authors】: Lian Duan ; William Nick Street ; Yanchi Liu ; Haibing Lu
【Abstract】: Community detection is an important task for social networks, which helps us understand the functional modules on the whole network. Among different community detection methods based on graph structures, modularity-based methods are very popular recently, but suffer a well-known resolution limit problem. This paper connects modularity-based methods with correlation analysis by subtly reformatting their math formulas and investigates how to fully make use of correlation analysis to change the objective function of modularity-based methods, which provides a more natural and effective way to solve the resolution limit problem. In addition, a novel theoretical analysis on the upper bound of different objective functions helps us understand their bias to different community sizes, and experiments are conducted on both real life and simulated data to validate our findings.
【Keywords】: community detection; correlation analysis; leverage; likelihood ratio; modularity
【Paper Link】 【Pages】:1386-1395
【Authors】: Kyle Kloster ; David F. Gleich
【Abstract】: The heat kernel is a type of graph diffusion that, like the much-used personalized PageRank diffusion, is useful in identifying a community nearby a starting seed node. We present the first deterministic, local algorithm to compute this diffusion and use that algorithm to study the communities that it produces. Our algorithm is formally a relaxation method for solving a linear system to estimate the matrix exponential in a degree-weighted norm. We prove that this algorithm stays localized in a large graph and has a worst-case constant runtime that depends only on the parameters of the diffusion, not the size of the graph. On large graphs, our experiments indicate that the communities produced by this method have better conductance than those produced by PageRank, although they take slightly longer to compute. On a real-world community identification task, the heat kernel communities perform better than those from the PageRank diffusion.
【Keywords】: heat kernel; local clustering
【Paper Link】 【Pages】:1396-1405
【Authors】: Tanmoy Chakraborty ; Sriram Srinivasan ; Niloy Ganguly ; Animesh Mukherjee ; Sanjukta Bhowmick
【Abstract】: Despite the prevalence of community detection algorithms, relatively less work has been done on understanding whether a network is indeed modular and how resilient the community structure is under perturbations. To address this issue, we propose a new vertex-based metric called "permanence", that can quantitatively give an estimate of the community- like structure of the network. The central idea of permanence is based on the observation that the strength of membership of a vertex to a community depends upon the following two factors: (i) the distribution of external connectivity of the vertex to individual communities and not the total external connectivity, and (ii) the strength of its internal connectivity and not just the total internal edges. In this paper, we demonstrate that compared to other metrics, permanence provides (i) a more accurate estimate of a derived community structure to the ground-truth community and (ii) is more sensitive to perturbations in the network. As a by-product of this study, we have also developed a community detection algorithm based on maximizing permanence. For a modular network structure, the results of our algorithm match well with ground-truth communities.
【Keywords】: community analysis; modularity; permanence
【Paper Link】 【Pages】:1406-1415
【Authors】: Rumi Ghosh ; Shang-Hua Teng ; Kristina Lerman ; Xiaoran Yan
【Abstract】: We study the interplay between a dynamic process and the structure of the network on which it is defined. Specifically, we examine the impact of this interaction on the quality-measure of network clusters and node centrality. This enables us to effectively identify network communities and important nodes participating in the dynamics. As the first step towards this objective, we introduce an umbrella framework for defining and characterizing an ensemble of dynamic processes on a network. This framework generalizes the traditional Laplacian framework to continuous-time biased random walks and also allows us to model some epidemic processes over a network. For each dynamic process in our framework, we can define a function that measures the quality of every subset of nodes as a potential cluster (or community) with respect to this process on a given network. This subset-quality function generalizes the traditional conductance measure for graph partitioning. We partially justify our choice of the quality function by showing that the classic Cheeger's inequality, which relates the conductance of the best cluster in a network with a spectral quantity of its Laplacian matrix, can be extended from the Laplacian-conductance setting to this more general setting.
【Keywords】: Cheeger's inequality; centrality; community detection; graph theory; spectral clustering
【Paper Link】 【Pages】:1416-1425
【Authors】: Yuichi Yoshida
【Abstract】: Betweenness centrality measures the importance of a vertex by quantifying the number of times it acts as a midpoint of the shortest paths between other vertices. This measure is widely used in network analysis. In many applications, we wish to choose the k vertices with the maximum adaptive betweenness centrality, which is the betweenness centrality without considering the shortest paths that have been taken into account by already-chosen vertices. All previous methods are designed to compute the betweenness centrality in a fixed graph. Thus, to solve the above task, we have to run these methods $k$ times. In this paper, we present a method that directly solves the task, with an almost linear runtime no matter how large the value of k. Our method first constructs a hypergraph that encodes the betweenness centrality, and then computes the adaptive betweenness centrality by examining this graph. Our technique can be utilized to handle other centrality measures. We theoretically prove that our method is very accurate, and experimentally confirm that it is three orders of magnitude faster than previous methods. Relying on the scalability of our method, we experimentally demonstrate that strategies based on adaptive betweenness centrality are effective in important applications studied in the network science and database communities.
【Keywords】: adaptive betweenness centrality; adaptive coverage centrality; randomized algorithm
【Paper Link】 【Pages】:1426-1435
【Authors】: Takanori Maehara ; Mitsuru Kusumoto ; Ken-ichi Kawarabayashi
【Abstract】:
【Keywords】:
【Paper Link】 【Pages】:1436-1445
【Authors】: Peter Lofgren ; Siddhartha Banerjee ; Ashish Goel ; Seshadhri Comandur
【Abstract】: We propose a new algorithm, FAST-PPR, for computing personalized PageRank: given start node s and target node t in a directed graph, and given a threshold δ, it computes the Personalized PageRank π_s(t) from s to t, guaranteeing that the relative error is small as long πs(t) > δ. Existing algorithms for this problem have a running-time of Ω(1/δ in comparison, FAST-PPR has a provable average running-time guarantee of O(√d/δ) (where d is the average in-degree of the graph). This is a significant improvement, since δ is often O(1/n) (where n is the number of nodes) for applications. We also complement the algorithm with an Ω(1/√δ) lower bound for PageRank estimation, showing that the dependence on δ cannot be improved. We perform a detailed empirical study on numerous massive graphs, showing that FAST-PPR dramatically outperforms existing algorithms. For example, on the 2010 Twitter graph with 1.5 billion edges, for target nodes sampled by popularity, FAST-PPR has a 20 factor speedup over the state of the art. Furthermore, an enhanced version of FAST-PPR has a 160 factor speedup on the Twitter graph, and is at least 20 times faster on all our candidate graphs.
【Keywords】: personalized pagerank; social search
【Paper Link】 【Pages】:1446-1455
【Authors】: Nesreen K. Ahmed ; Nick G. Duffield ; Jennifer Neville ; Ramana Rao Kompella
【Abstract】: Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs (e.g. web graphs, social networks), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy. While previous work focused particularly on sampling schemes to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we pro- pose a generic stream sampling framework for big-graph analytics, called Graph Sample and Hold (gSH), which samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state in memory. We use a Horvitz-Thompson construction in conjunction with a scheme that samples arriving edges without adjacencies to previously sampled edges with probability p and holds edges with adjacencies with probability q. Our sample and hold framework facilitates the accurate estimation of subgraph patterns by enabling the dependence of the sampling process to vary based on previous history. Within our framework, we show how to produce statistically unbiased estimators for various graph properties from the sample. Given that the graph analytics will run on a sample instead of the whole population, the runtime complexity is kept under control. Moreover, given that the estimators are unbiased, the approximation error is also kept under control. Finally, we test the performance of the proposed framework (gSH) on various types of graphs, showing that from a sample with -- 40K edges, it produces estimates with relative errors < 1%.
【Keywords】: graph streams; network sampling; statistical estimation
【Paper Link】 【Pages】:1456-1465
【Authors】: Florian Bourse ; Marc Lelarge ; Milan Vojnovic
【Abstract】: Balanced edge partition has emerged as a new approach to partition an input graph data for the purpose of scaling out parallel computations, which is of interest for several modern data analytics computation platforms, including platforms for iterative computations, machine learning problems, and graph databases. This new approach stands in a stark contrast to the traditional approach of balanced vertex partition, where for given number of partitions, the problem is to minimize the number of edges cut subject to balancing the vertex cardinality of partitions. In this paper, we first characterize the expected costs of vertex and edge partitions with and without aggregation of messages, for the commonly deployed policy of placing a vertex or an edge uniformly at random to one of the partitions. We then obtain the first approximation algorithms for the balanced edge-partition problem which for the case of no aggregation matches the best known approximation ratio for the balanced vertex-partition problem, and show that this remains to hold for the case with aggregation up to factor that is equal to the maximum in-degree of a vertex. We report results of an extensive empirical evaluation on a set of real-world graphs, which quantifies the benefits of edge- vs. vertex-partition, and demonstrates efficiency of natural greedy online assignments for the balanced edge-partition problem with and with no aggregation.
【Keywords】: approximation algorithms; distributed massive computation; graph edge partition; streaming heuristics
【Paper Link】 【Pages】:1466-1475
【Authors】: Stavros Sintos ; Panayiotis Tsaparas
【Abstract】: In the past few years there has been an explosion of social networks in the online world. Users flock these networks, creating profiles and linking themselves to other individuals. Connecting online has a small cost compared to the physical world, leading to a proliferation of connections, many of which carry little value or importance. Understanding the strength and nature of these relationships is paramount to anyone interesting in making use of the online social network data. In this paper, we use the principle of Strong Triadic Closure to characterize the strength of relationships in social networks. The Strong Triadic Closure principle stipulates that it is not possible for two individuals to have a strong relationship with a common friend and not know each other. We consider the problem of labeling the ties of a social network as strong or weak so as to enforce the Strong Triadic Closure property. We formulate the problem as a novel combinatorial optimization problem, and we study it theoretically. Although the problem is NP-hard, we are able to identify cases where there exist efficient algorithms with provable approximation guarantees. We perform experiments on real data, and we show that there is a correlation between the labeling we obtain and empirical metrics of tie strength, and that weak edges act as bridges between different communities in the network. Finally, we study extensions and variations of our problem both theoretically and experimentally.
【Keywords】: approximation algorithms; social networks; strong triadic closure
【Paper Link】 【Pages】:1476-1485
【Authors】: Takuya Akiba ; Takanori Maehara ; Ken-ichi Kawarabayashi
【Abstract】:
【Keywords】:
【Paper Link】 【Pages】:1486-1495
【Authors】: Huan Sun ; Mudhakar Srivatsa ; Shulong Tan ; Yang Li ; Lance M. Kaplan ; Shu Tao ; Xifeng Yan
【Abstract】: Collaborative networks are composed of experts who cooperate with each other to complete specific tasks, such as resolving problems reported by customers. A task is posted and subsequently routed in the network from an expert to another until being resolved. When an expert cannot solve a task, his routing decision (i.e., where to transfer a task) is critical since it can significantly affect the completion time of a task. In this work, we attempt to deduce the cognitive process of task routing, and model the decision making of experts as a generative process where a routing decision is made based on mixed routing patterns. In particular, we observe an interesting phenomenon that an expert tends to transfer a task to someone whose knowledge is neither too similar to nor too different from his own. Based on this observation, an expertise difference based routing pattern is developed. We formalize multiple routing patterns by taking into account both rational and random analysis of tasks, and present a generative model to combine them. For a held-out set of tasks, our model not only explains their real routing sequences very well, but also accurately predicts their completion time. Under three different quality measures, our method significantly outperforms all the alternatives with more than 75% accuracy gain. In practice, with the help of our model, hypotheses on how to improve a collaborative network can be tested quickly and reliably, thereby significantly easing performance improvement of collaborative networks.
【Keywords】: collaborative network; generative model; task routing; user modeling
【Paper Link】 【Pages】:1496-1505
【Authors】: Yuan Yao ; Hanghang Tong ; Feng Xu ; Jian Lu
【Abstract】: Community Question Answering (CQA) sites have become valuable platforms to create, share, and seek a massive volume of human knowledge. How can we spot an insightful question that would inspire massive further discussions in CQA sites? How can we detect a valuable answer that benefits many users? The long-term impact (e.g., the size of the population a post benefits) of a question/answer post is the key quantity to answer these questions. In this paper, we aim to predict the long-term impact of questions/answers shortly after they are posted in the CQA sites. In particular, we propose a family of algorithms for the prediction problem by modeling three key aspects, i.e., non-linearity, question/answer coupling, and dynamics. We analyze our algorithms in terms of optimality, correctness, and complexity. We conduct extensive experimental evaluations on two real CQA data sets to demonstrate the effectiveness and efficiency of our algorithms.
【Keywords】: impact correlation; long-term impact; question answering
【Paper Link】 【Pages】:1506-1515
【Authors】: Bin Bi ; Ben Kao ; Chang Wan ; Junghoo Cho
【Abstract】: With the rapid growth of Web 2.0, a variety of content sharing services, such as Flickr, YouTube, Blogger, and TripAdvisor etc, have become extremely popular over the last decade. On these websites, users have created and shared with each other various kinds of resources, such as photos, video, and travel blogs. The sheer amount of user-generated content varies greatly in quality, which calls for a principled method to identify a set of authorities, who created high-quality resources, from a massive number of contributors of content. Since most previous studies only infer global authoritativeness of a user, there is no way to differentiate the authoritativeness in different aspects of life (topics). In this paper, we propose a novel model of Topic-specific Authority Analysis (TAA), which addresses the limitations of the previous approaches, to identify authorities specific to given query topic(s) on a content sharing service. This model jointly leverages the usage data collected from the sharing log and the favorite log. The parameters in TAA are learned from a constructed training dataset, for which a novel logistic likelihood function is specifically designed. To perform Bayesian inference for TAA with the new logistic likelihood, we extend typical Gibbs sampling by introducing auxiliary variables. Thorough experiments with two real-world datasets demonstrate the effectiveness of TAA in topic-specific authority identification as well as the generalizability of the TAA generative model.
【Keywords】: bayesian model; content sharing services; topic-specific authority analysis
【Paper Link】 【Pages】:1516
【Authors】: Sri Subramaniam
【Abstract】: E-commerce has largely been a 'Pull' model to date. Offline retailers have nailed discovery, delight, serendipity, and impulse purchases in person with greater success than online commerce sites. However, in an always-on, mobile-first world, companies like Groupon have the opportunity to push the frontier even further than offline retailers or comprehensive sites due to the fact that our smartphones are always with us. The challenge is to provide the right deals to the right user at the right time. That involves learning about the users and their locations, their personal preferences, and predicting which deals are likely to delight them, presenting diversity, discovery and engaging UX to gather user preferences and semantic graph approaches for user-deal matching. This presentation will give insight into how Groupon manages to grapple with these challenges via a data-driven system in order to delight and surprise customers.
【Keywords】: e-commerce; geolocation data; mobile technology; recommender systems
【Paper Link】 【Pages】:1517
【Authors】: Tracy De Poalo ; Jeremy Howard
【Abstract】: It's not often we get to peer into the details of how big corporates use predictive modeling in practice. In this talk, Sprint's Head of Predictive Modeling, Tracey De Poalo, will talk about the process she developed using SAS and logistic regression to build a wide range of models. Jeremy Howard will discuss his experience as a consultant at Sprint comparing R and random forests to the existing process, and will show the pros and cons of each approach. The talk will also cover the 'real world' issues that Tracey has dealt with in creating a reusable process, including data mart development, sampling, testing, reporting, and implementation with internal customers. Sprint's existing process is the best that Jeremy has seen in industry, and shows a lot of best practices in data structure, documentation, testing, automation, and customer interaction, and modeling.
【Keywords】: logistic regression; machine learning; random forests
【Paper Link】 【Pages】:1518
【Authors】: Nigam Shah
【Abstract】: In the era of EHRs, it is possible to examine the outcomes of decisions made by doctors during clinical practice to identify patterns of care---generating evidence based on the collective practice of experts. We will discuss methods that use unstructured patient data to monitor for adverse drug events, profile specific drugs, identify off-label drug usage, uncover 'natural experiments' and generate practice-based evidence for difficult-to-test clinical hypotheses. We will describe how to detect associations among drugs and their adverse events several years before an alert is issued as well as compute the true rate of drug-drug interactions. We will present approaches to identify novel off-label uses of drugs using the patient feature matrix along with prior knowledge about drugs, diseases, and known usage. We will review a natural experiment--where a subset of congestive heart failure patients who were prescribed Cilostazol despite its black box warning--and profile its safety. We will discuss the testing of a clinical hypothesis about an association between allergic conditions and chronic uveitis in patients with juvenile idiopathic arthritis.
【Keywords】: clinical data warehouse; learning health system; practice-based evidence; unstructured ehr
【Paper Link】 【Pages】:1519
【Authors】: Cynthia Rudin
【Abstract】: It is extremely important in many application domains to have transparency in predictive modeling. Domain experts do not tend to prefer "black box" predictive model models. They would like to understand how predictions are made, and possibly, prefer models that emulate the way a human expert might make a decision, with a few important variables, and a clear convincing reason to make a particular prediction. I will discuss recent work on interpretable predictive modeling with decision lists and sparse integer linear models. I will describe several approaches, including an algorithm based on discrete optimization, and an algorithm based on Bayesian analysis. I will show examples of interpretable models for stroke prediction in medical patients and prediction of violent crime in young people raised in out-of-home care. Collaborators are Ben Letham, Berk Ustun, Stefano Traca, Siong Thye Goh, Tyler McCormick, and David Madigan.
【Keywords】: comprehensibility; interpretability; machine learning; sparsity; medical calculators; understandability
【Paper Link】 【Pages】:1520
【Authors】: Drew Conway
【Abstract】: In this talk, Drew will examine data science through the lens of the social scientist. He will discuss how the various skills and disciplines combine into data science. Drew will also present a motivating example directly from his work as a senior advisor to NYC's Mayor's Office of Analytics.
【Keywords】: data science; local government; social science
【Paper Link】 【Pages】:1521
【Authors】: Rand Waltzman
【Abstract】: The US Department of Defense Dictionary of Military terms defines the Information Environment as 'the aggregate of individuals, organizations, and systems that collect, process, disseminate, or act on information.' The decisions and actions we take both as individuals and collectively simultaneously shape and are shaped by the information environment in which we live. The nature of our interaction with the information environment is rapidly evolving and old models are becoming irrelevant faster than we can develop new ones. This results in uncertainties that leave us exposed to dangerous influences without proper defenses. The purpose of this talk is to help frame a new science of Information Environment Security (IES) whose goal is to create and apply the tools needed to discover and maintain fundamental models of our ever-changing information environment and to defend us in that environment, both as individuals and collectively, against intentional as well as unintentional attempts to deceive, misinform and otherwise manipulate us. IES is an interdisciplinary science that will require bringing together experts working in areas such as cognitive science, computer science, social science, security, marketing, political campaigning, public policy, and psychology to develop a theoretical as well as an applied engineering methodology for managing the full spectrum of information environment security issues.
【Keywords】: cognitive science; computer science; information environment; marketing; political campaigning; psychology; public policy; security; social science
【Paper Link】 【Pages】:1522
【Authors】: Nathan Eagle
【Abstract】: Petabytes of data about human movements, transactions and communication patterns are being generated by everyday technologies such as mobile phones & credit cards. This unprecedented volume of information facilitates a novel set of research questions applicable to a wide range of development issues. In collaboration involving 237 mobile phone operators across 102 countries, Jana's mobile technology platform can instantly poll and compensate 3.48 billion active mobile subscriptions. This talk will discuss how insights gained from living in Kenya became the genesis of a technology company currently working with global clients in over 50 countries, including P&G, Google, Unilever, Danone, General Mills, Nestle, Johnson & Johnson, Microsoft, the World Bank, and the United Nations. After providing an overview of the mobile and social media landscapes in emerging markets, we discuss a system that implements polls & mobile subscription compensation. The presentation will conclude by emphasizing the value of consumer data in underserved and understudied regions of the world.
【Keywords】: emerging markets; mobile technology; polling systems; social media
【Paper Link】 【Pages】:1523
【Authors】: Robert Munro
【Abstract】: Speakers of more than 5,000 languages have access to internet and communication technologies. The majority of phones, tablets and computers now ship with language-enabled capabilities like speech-recognition and intelligent auto-correction, and people increasingly interact with data-intensive cloud-based language technologies like search-engines and spam-filters. For both personal and large-scale technologies, the service quality drops or disappears entirely outside of a handful of languages. Speakers of low-resource languages correlate with lower access to healthcare, education and higher vulnerability to disasters. Serving the broadest possible range of languages is crucial to ensuring equitable participation in the global information economy. I will present examples of how natural language processing and distributed human computing are improving the lives of speakers of all the world's languages, in areas including education, disaster-response, health and access to employment. When applying natural language processing to the full diversity of the world's communications, we need to go beyond simple keyword analysis and implement complex technologies that require human-in-the-loop processing to ensure usable accuracy. In recent work where more than a million human judgments were collected on unstructured text and imagery data around natural disasters, I will present observations that debunk recent over-optimistic claims about the utility of social media following disasters. On the positive side, I will share results that show how for-profit technologies are improving people's lives by providing sustainable economic growth opportunities when they support more languages, aligning business objectives with global diversity.
【Keywords】: crowdsourcing; disaster-response; education; employment; healthcare; human-computer interaction; natural language processing; social development
【Paper Link】 【Pages】:1524-1533
【Authors】: Acar Tamersoy ; Kevin A. Roundy ; Duen Horng Chau
【Abstract】: The increasing sophistication of malicious software calls for new defensive techniques that are harder to evade, and are capable of protecting users against novel threats. We present AESOP, a scalable algorithm that identifies malicious executable files by applying Aesop's moral that "a man is known by the company he keeps." We use a large dataset voluntarily contributed by the members of Norton Community Watch, consisting of partial lists of the files that exist on their machines, to identify close relationships between files that often appear together on machines. AESOP leverages locality-sensitive hashing to measure the strength of these inter-file relationships to construct a graph, on which it performs large scale inference by propagating information from the labeled files (as benign or malicious) to the preponderance of unlabeled files. AESOP attained early labeling of 99% of benign files and 79% of malicious files, over a week before they are labeled by the state-of-the-art techniques, with a 0.9961 true positive rate at flagging malware, at 0.0001 false positive rate.
【Keywords】: belief propagation; file graph; graph mining; locality sensitive hashing; malware detection
【Paper Link】 【Pages】:1534-1543
【Authors】: Anitha Kannan ; Simon Baker ; Krishnan Ramnath ; Juliet Fiss ; Dahua Lin ; Lucy Vanderwende ; Rizwan Ansary ; Ashish Kapoor ; Qifa Ke ; Matt Uyttendaele ; Xin-Jing Wang ; Lei Zhang
【Abstract】: Images are often used to convey many different concepts or illustrate many different stories. We propose an algorithm to mine multiple diverse, relevant, and interesting text snippets for images on the web. Our algorithm scales to all images on the web. For each image, all webpages that contain it are considered. The top-K text snippet selection problem is posed as combinatorial subset selection with the goal of choosing an optimal set of snippets that maximizes a combination of relevancy, interestingness, and diversity. The relevancy and interestingness are scored by machine learned models. Our algorithm is run at scale on the entire image index of a major search engine resulting in the construction of a database of images with their corresponding text snippets. We validate the quality of the database through a large-scale comparative study. We showcase the utility of the database through two web-scale applications: (a) augmentation of images on the web as webpages are browsed and (b)~an image browsing experience (similar in spirit to web browsing) that is enabled by interconnecting semantically related images (which may not be visually related) through shared concepts in their corresponding text snippets.
【Keywords】: browsing; diversity; interestingness; relevance; semantic image browsing; text mining for images; text snippets; web image augmentation
【Paper Link】 【Pages】:1544-1552
【Authors】: Ashay Tamhane ; Shajith Ikbal ; Bikram Sengupta ; Mayuri Duggirala ; James Appleton
【Abstract】: Poor academic performance in K-12 is often a precursor to unsatisfactory educational outcomes such as dropout, which are associated with significant personal and social costs. Hence, it is important to be able to predict students at risk of poor performance, so that the right personalized intervention plans can be initiated. In this paper, we report on a large-scale study to identify students at risk of not meeting acceptable levels of performance in one state-level and one national standardized assessment in Grade 8 of a major US school district. An important highlight of our study is its scale - both in terms of the number of students included, the number of years and the number of features, which provide a very solid grounding to the research. We report on our experience with handling the scale and complexity of data, and on the relative performance of various machine learning techniques we used for building predictive models. Our results demonstrate that it is possible to predict students at-risk of poor assessment performance with a high degree of accuracy, and to do so well in advance. These insights can be used to pro-actively initiate personalized intervention programs and improve the chances of student success.
【Keywords】: analysis; education; longitudinal data; prediction; risk
【Paper Link】 【Pages】:1553-1562
【Authors】: Bingsheng Wang ; Jinjun Xiong
【Abstract】: This paper addresses geospatial interpolation for meteorological measurements in which we estimate the values of climatic metrics at unsampled sites with existing observations. Providing climatological and meteorological conditions covering a large region is potentially useful in many applications, such as smart grid. However, existing research works on interpolation either cause a large number of complex calculations or are lack of high accuracy. We propose a Bayesian compressed sensing based non-parametric statistical model to efficiently perform the spatial interpolation task. Student-t priors are employed to model the sparsity of unknown signals' coefficients, and the Approximated Variational Inference (AVI) method is provided for effective and fast learning. The presented model has been deployed at IBM, targeting for aiding the intelligent management of smart grid. The evaluations on two real world datasets demonstrate that our algorithm achieves state-of-the-art performance in both effectiveness and efficiency.
【Keywords】: analytics; bayesian inference; geospatial interpolation; meteorological measurements; smart grid
【Paper Link】 【Pages】:1563-1572
【Authors】: Brian Abelson ; Kush R. Varshney ; Joy Sun
【Abstract】: Unconditional cash transfers to the extreme poor via mobile telephony represent a radical, new approach to giving. GiveDirectly is a non-governmental organization (NGO) at the vanguard of delivering this proven and effective approach to reducing poverty. In this work, we streamline an important step in the operations of the NGO by developing and deploying a data-driven system for locating villages with extreme poverty in Kenya and Uganda. Using the type of roof of a home, thatched or metal, as a proxy for poverty, we develop a new remote sensing approach for selecting extremely poor villages to target for cash transfers. We develop an analytics algorithm that estimates housing quality and density in patches of publicly-available satellite imagery by learning a predictive model with sieves of template matching results combined with color histograms as features. We develop and deploy a crowdsourcing interface to obtain labeled training data. We deploy the predictive model to construct a fine-scale heat map of poverty and integrate this discovered knowledge into the processes of GiveDirectly's operations. Aggregating estimates at the village level, we produce a ranked list from which top villages are included in GiveDirectly's planned distribution of cash transfers. The automated approach increases village selection efficiency significantly.
【Keywords】: poverty economics; remote sensing; social good
【Paper Link】 【Pages】:1573-1582
【Authors】: Brian Dalessandro ; Daizhuo Chen ; Troy Raeder ; Claudia Perlich ; Melinda Han Williams ; Foster J. Provost
【Abstract】: Internet display advertising is a critical revenue source for publishers and online content providers, and is supported by massive amounts of user and publisher data. Targeting display ads can be improved substantially with machine learning methods, but building many models on massive data becomes prohibitively expensive computationally. This paper presents a combination of strategies, deployed by the online advertising firm Dstillery, for learning many models from extremely high-dimensional data efficiently and without human intervention. This combination includes: (i)~A method for simple-yet-effective transfer learning where a model learned from data that is relatively abundant and cheap is taken as a prior for Bayesian logistic regression trained with stochastic gradient descent (SGD) from the more expensive target data. (ii)~A new update rule for automatic learning rate adaptation, to support learning from sparse, high-dimensional data, as well as the integration with adaptive regularization. We present an experimental analysis across 100 different ad campaigns, showing that the transfer learning indeed improves performance across a large number of them, especially at the start of the campaigns. The combined "hands-free" method needs no fiddling with the SGD learning rate, and we show that it is just as effective as using expensive grid search to set the regularization parameter for each campaign.
【Keywords】: online advertising; stochastic gradient descent; transfer learning
【Paper Link】 【Pages】:1583-1592
【Authors】: Chen Luo ; Jian-Guang Lou ; Qingwei Lin ; Qiang Fu ; Rui Ding ; Dongmei Zhang ; Zhe Wang
【Abstract】: As online services have more and more popular, incident diagnosis has emerged as a critical task in minimizing the service downtime and ensuring high quality of the services provided. For most online services, incident diagnosis is mainly conducted by analyzing a large amount of telemetry data collected from the services at runtime. Time series data and event sequence data are two major types of telemetry data. Techniques of correlation analysis are important tools that are widely used by engineers for data-driven incident diagnosis. Despite their importance, there has been little previous work addressing the correlation between two types of heterogeneous data for incident diagnosis: continuous time series data and temporal event data. In this paper, we propose an approach to evaluate the correlation between time series data and event data. Our approach is capable of discovering three important aspects of event-timeseries correlation in the context of incident diagnosis: existence of correlation, temporal order, and monotonic effect. Our experimental results on simulation data sets and two real data sets demonstrate the effectiveness of the algorithm.
【Keywords】: correlation; incident diagnosis; two-sample problem
【Paper Link】 【Pages】:1593-1602
【Authors】: Chuanren Liu ; Yong Ge ; Hui Xiong ; Keli Xiao ; Wei Geng ; Matt Perkins
【Abstract】: Advances in real-time location system (RTLS) solutions have enabled us to collect massive amounts of fine-grained semantically rich location traces, which provide unparalleled opportunities for understanding human activities and discovering useful knowledge. This, in turn, delivers intelligence for real-time decision making in various fields, such as workflow management. Indeed, it is a new paradigm for workflow modeling by the knowledge discovery in location traces. To that end, in this paper, we provide a focused study of workflow modeling by the integrated analysis of indoor location traces in the hospital environment. In comparison with conventional workflow modeling based on passive workflow logs, one salient feature of our approach is that it can proactively unravel the workflow patterns hidden in the location traces, by automatically constructing the workflow states and estimating parameters describing the transition patterns of moving objects. Specifically, to determine a meaningful granularity for the model, the workflow states are first constructed as regions associated with specific healthcare activities. Then, we transform the original indoor location traces to the sequences of workflow states and model the workflow transition patterns by finite state machines. Furthermore, we leverage the correlations in the location traces between related types of medical devices to reinforce the modeling performance and enable more applications. The results show that the proposed framework can not only model the workflow patterns effectively, but also have managerial applications in workflow monitoring, auditing, and inspection of workflow compliance, which are critical in the healthcare industry.
【Keywords】: healthcare operation and management; indoor location traces; workflow modeling
【Paper Link】 【Pages】:1603-1612
【Authors】: Deepak Agarwal ; Bee-Chung Chen ; Rupesh Gupta ; Joshua Hartman ; Qi He ; Anand Iyer ; Sumanth Kolar ; Yiming Ma ; Pannagadatta Shivaswamy ; Ajit Singh ; Liang Zhang
【Abstract】: Users on an online social network site generate a large number of heterogeneous activities, ranging from connecting with other users, to sharing content, to updating their profiles. The set of activities within a user's network neighborhood forms a stream of updates for the user's consumption. In this paper, we report our experience with the problem of ranking activities in the LinkedIn homepage feed. In particular, we provide a taxonomy of social network activities, describe a system architecture (with a number of key components open-sourced) that supports fast iteration in model development, demonstrate a number of key factors for effective ranking, and report experimental results from extensive online bucket tests.
【Keywords】: activity ranking; large scale learning; relevance
【Paper Link】 【Pages】:1613-1619
【Authors】: Deepak Agarwal ; Souvik Ghosh ; Kai Wei ; Siyu You
【Abstract】: Targeted online advertising is a prime source of revenue for many Internet companies. It is a common industry practice to use a generalized second price auction mechanism to rank advertisements at every opportunity of an impression. This greedy algorithm is suboptimal for both advertisers and publishers when advertisers have a finite budget. In a greedy mechanism high performing advertisers tend to drop out of the auction marketplace fast and that adversely affects both the advertiser experience and the publisher revenue. We describe a method for improving such ad serving systems by including a budget pacing component that serves ads by being aware of global supply patterns. Such a system is beneficial for both advertisers and publishers. We demonstrate the benefits of this component using experiments we conducted on advertising at LinkedIn.
【Keywords】: budget pacing; generalized second price auction; targeted online advertising
【Paper Link】 【Pages】:1620-1629
【Authors】: Dejan Radosavljevik ; Peter van der Putten
【Abstract】: This paper outlines the approach developed together with the Radio Network Strategy & Design Department of a large European telecom operator in order to forecast the Air-Interface load in their 3G network, which is used for planning network upgrades and budgeting purposes. It is based on large scale intelligent data analysis and modeling at the level of thousands of individual radio cells resulting in 30,000 models per day. It has been embedded into a scenario simulation framework that is used by end users not experienced in data mining for studying and simulating the behavior of this complex networked system, as an example of a systematic approach to the deployment step in the KDD process. This system is already in use for two years in the country where it was developed and it is a part of a standard business process. In the last six months this national operator became a competence center for predictive modeling for micro-simulation of 3G air interface load for four other operators of the same parent company.
【Keywords】: air-interface load; linear regression; mobile network; simulation
【Paper Link】 【Pages】:1630-1639
【Authors】: Derek Lin ; Rashmi Raghu ; Vivek Ramamurthy ; Jin Yu ; Regunathan Radhakrishnan ; Joseph Fernandez
【Abstract】: Large enterprise IT (Information Technology) infrastructure components generate large volumes of alerts and incident tickets. These are manually screened, but it is otherwise difficult to extract information automatically from them to gain insights in order to improve operational efficiency. We propose a framework to cluster alerts and incident tickets based on the text in them, using unsupervised machine learning. This would be a step towards eliminating manual classification of the alerts and incidents, which is very labor intense and costly. Our framework can handle the semi-structured text in alerts generated by IT infrastructure components such as storage devices, network devices, servers etc., as well as the unstructured text in incident tickets created manually by operations support personnel. After text pre-processing and application of appropriate distance metrics, we apply different graph-theoretic approaches to cluster the alerts and incident tickets, based on their semi-structured and unstructured text respectively. For automated interpretation and read-ability on semi-structured text clusters, we propose a method to visualize clusters that preserves the structure and human-readability of the text data as compared to traditional word clouds where the text structure is not preserved; for unstructured text clusters, we find a simple way to define prototypes of clusters for easy interpretation. This framework for clustering and visualization will enable enterprises to prioritize the issues in their IT infrastructure and improve the reliability and availability of their services.
【Keywords】: alerts and incidents management; complete linkage; connected components; graph cut; hierarchical clustering; kd-tree; non-negative matrix factorization; tickets analysis
【Paper Link】 【Pages】:1640-1649
【Authors】: Diane J. Hu ; Rob Hall ; Josh Attenberg
【Abstract】: Purchasing decisions in many product categories are heavily influenced by the shopper's aesthetic preferences. It's insufficient to simply match a shopper with popular items from the category in question; a successful shopping experience also identifies products that match those aesthetics. The challenge of capturing shoppers' styles becomes more difficult as the size and diversity of the marketplace increases. At Etsy, an online marketplace for handmade and vintage goods with over 30 million diverse listings, the problem of capturing taste is particularly important -- users come to the site specifically to find items that match their eclectic styles. In this paper, we describe our methods and experiments for deploying two new style-based recommender systems on the Etsy site. We use Latent Dirichlet Allocation (LDA) to discover trending categories and styles on Etsy, which are then used to describe a user's "interest" profile. We also explore hashing methods to perform fast nearest neighbor search on a map-reduce framework, in order to efficiently obtain recommendations. These techniques have been implemented successfully at very large scale, substantially improving many key business metrics.
【Keywords】: collaborative filtering; recommender systems; topic modeling
【Paper Link】 【Pages】:1650-1659
【Authors】: Enric Junqué de Fortuny ; Marija Stankova ; Julie Moeyersoms ; Bart Minnaert ; Foster J. Provost ; David Martens
【Abstract】: With the globalisation of the world's economies and ever-evolving financial structures, fraud has become one of the main dissipaters of government wealth and perhaps even a major contributor in the slowing down of economies in general. Although corporate residence fraud is known to be a major factor, data availability and high sensitivity have caused this domain to be largely untouched by academia. The current Belgian government has pledged to tackle this issue at large by using a variety of in-house approaches and cooperations with institutions such as academia, the ultimate goal being a fair and efficient taxation system. This is the first data mining application specifically aimed at finding corporate residence fraud, where we show the predictive value of using both structured and fine-grained invoicing data. We further describe the problems involved in building such a fraud detection system, which are mainly data-related (e.g. data asymmetry, quality, volume, variety and velocity) and deployment-related (e.g. the need for explanations of the predictions made).
【Keywords】: corporate residence fraud; fraud detection; structured data; transactional data
【Paper Link】 【Pages】:1660-1669
【Authors】: Fang Jin ; Rupinder Paul Khandpur ; Nathan Self ; Edward R. Dougherty ; Sheng Guo ; Feng Chen ; B. Aditya Prakash ; Naren Ramakrishnan
【Abstract】: Modeling the movement of information within social media outlets, like Twitter, is key to understanding to how ideas spread but quantifying such movement runs into several difficulties. Two specific areas that elude a clear characterization are (i) the intrinsic random nature of individuals to potentially adopt and subsequently broadcast a Twitter topic, and (ii) the dissemination of information via non-Twitter sources, such as news outlets and word of mouth, and its impact on Twitter propagation. These distinct yet inter-connected areas must be incorporated to generate a comprehensive model of information diffusion. We propose a bispace model to capture propagation in the union of (exclusively) Twitter and non-Twitter environments. To quantify the stochastic nature of Twitter topic propagation, we combine principles of geometric Brownian motion and traditional network graph theory. We apply Poisson process functions to model information diffusion outside of the Twitter mentions network. We discuss techniques to unify the two sub-models to accurately model information dissemination. We demonstrate the novel application of these techniques on real Twitter datasets related to mass protest adoption in social communities.
【Keywords】: geometric brownian motion; information diffusion; social networks
【Paper Link】 【Pages】:1670-1678
【Authors】: Gabor Melli
【Abstract】: With billions of database-generated pages on the Web where consumers can readily add priced product offerings to their virtual shopping cart, several opportunities will become possible once we can automatically recognize what exactly is being offered for sale on each page. We present a case study of a deployed data-driven system that first chunks individual titles into semantically classified sub-segments, and then uses this information to improve a hyperlink insertion service. To accomplish this process, we propose an annotation structure that is general enough to apply to offering titles from most e-commerce industries while also being specific enough to identify useful semantics about each offer. To automate the parsing task we apply the best-practices approach of training a supervised conditional random fields model and discover that creating separate prediction models for some of the industries along with the use of model-ensembles achieves the best performance to date. We further report on a real-world application of the trained parser to the task of growing a lexical dictionary of product-related terms which critically provides background knowledge to an affiliate-marketing hyperlink insertion service. On a regular basis we apply the parser to offering titles to produce a large set of labeled terms. From these candidates we select the most confidently predicted novel terms for review by crowd-sourced annotators. The agreed on terms are then added into a dictionary which significantly improves the performance of the link-insertion service. Finally, to continually improve system performance, we retrain the model in an online fashion by performing additional annotations on titles with incorrect predictions on each batch.
【Keywords】: automated terminology extraction; composite crf ensembles; hyperlink insertion; product offer titles; shallow semantic parsing
【Paper Link】 【Pages】:1679-1688
【Authors】: Gergely Ács ; Claude Castelluccia
【Abstract】: With billions of handsets in use worldwide, the quantity of mobility data is gigantic. When aggregated they can help understand complex processes, such as the spread viruses, and built better transportation systems, prevent traffic congestion. While the benefits provided by these datasets are indisputable, they unfortunately pose a considerable threat to location privacy. In this paper, we present a new anonymization scheme to release the spatio-temporal density of Paris, in France, i.e., the number of individuals in 989 different areas of the city released every hour over a whole week. The density is computed from a call-data-record (CDR) dataset, provided by the French Telecom operator Orange, containing the CDR of roughly 2 million users over one week. Our scheme is differential private, and hence, provides provable privacy guarantee to each individual in the dataset. Our main goal with this case study is to show that, even with large dimensional sensitive data, differential privacy can provide practical utility with meaningful privacy guarantee, if the anonymization scheme is carefully designed. This work is part of the national project XData (http://xdata.fr) that aims at combining large (anonymized) datasets provided by different service providers (telecom, electricity, water management, postal service, etc.).
【Keywords】: call-data-record (cdr); differential privacy; spatio-temporal density
【Paper Link】 【Pages】:1689-1698
【Authors】: Herodotos Herodotou ; Bolin Ding ; Shobana Balakrishnan ; Geoff Outhred ; Percy Fitter
【Abstract】: Large-scale data center networks are complex---comprising several thousand network devices and several hundred thousand links---and form the critical infrastructure upon which all higher-level services depend on. Despite the built-in redundancy in data center networks, performance issues and device or link failures in the network can lead to user-perceived service interruptions. Therefore, determining and localizing user-impacting availability and performance issues in the network in near real time is crucial. Traditionally, both passive and active monitoring approaches have been used for failure localization. However, data from passive monitoring is often too noisy and does not effectively capture silent or gray failures, whereas active monitoring is potent in detecting faults but limited in its ability to isolate the exact fault location depending on its scale and granularity. Our key idea is to use statistical data mining techniques on large-scale active monitoring data to determine a ranked list of suspect causes, which we refine with passive monitoring signals. In particular, we compute a failure probability for devices and links in near real time using data from active monitoring, and look for statistically significant increases in the failure probability. We also correlate the probabilistic output with other failure signals from passive monitoring to increase the confidence of the probabilistic analysis. We have implemented our approach in the Windows Azure production environment and have validated its effectiveness in terms of localization accuracy, precision, and time to localization using known network incidents over the past three months. The correlated ranked list of devices and links is surfaced as a report that is used by network operators to investigate current issues and identify probable root causes.
【Keywords】: data center networks; failure localization
【Paper Link】 【Pages】:1699-1708
【Authors】: Jian Xu ; Thanuka L. Wickramarathne ; Nitesh V. Chawla ; Erin K. Grey ; Karsten Steinhaeuser ; Reuben P. Keller ; John M. Drake ; David M. Lodge
【Abstract】: The unintentional transport of invasive species (i.e., non-native and harmful species that adversely affect habitats and native species) through the Global Shipping Network (GSN) causes substantial losses to social and economic welfare (e.g., annual losses due to ship-borne invasions in the Laurentian Great Lakes is estimated to be as high as USD 800 million). Despite the huge negative impacts, management of such invasions remains challenging because of the complex processes that lead to species transport and establishment. Numerous difficulties associated with quantitative risk assessments (e.g., inadequate characterizations of invasion processes, lack of crucial data, large uncertainties associated with available data, etc.) have hampered the usefulness of such estimates in the task of supporting the authorities who are battling to manage invasions with limited resources. We present here an approach for addressing the problem at hand via creative use of computational techniques and multiple data sources, thus illustrating how data mining can be used for solving crucial, yet very complex problems towards social good. By modeling implicit species exchanges as a network that we refer to as the Species Flow Network (SFN), large-scale species flow dynamics are studied via a graph clustering approach that decomposes the SFN into clusters of ports and inter-cluster connections. We then exploit this decomposition to discover crucial knowledge on how patterns in GSN affect aquatic invasions, and then illustrate how such knowledge can be used to devise effective and economical invasive species management strategies. By experimenting on actual GSN traffic data for years 1997-2006, we have discovered crucial knowledge that can significantly aid the management authorities.
【Keywords】: clustering; data mining; data mining for social good; invasive species; networks; risk assessment
【Paper Link】 【Pages】:1709-1718
【Authors】: Kiran Kate ; Sneha Chaudhari ; Andy Prapanca ; Jayant Kalagnanam
【Abstract】: Food safety is an important health issue in Singapore as the number of food poisoning cases have increased significantly over the past few decades. The National Environment Agency of Singapore (NEA) is the primary government agency responsible for monitoring and mitigating the food safety risks. In an effort to pro-actively monitor emerging food safety issues and to stay abreast with developments related to food safety in the world, NEA tracks the World Wide Web as a source of news feeds to identify food safety related articles. However, such information gathering is a difficult and time consuming process due to information overload. In this paper, we present FoodSIS, a system for end-to-end web information gathering for food safety. FoodSIS improves efficiency of such focused information gathering process with the use of machine learning techniques to identify and rank relevant content. We discuss the challenges in building such a system and describe how thoughtful system design and recent advances in machine learning provide a framework that synthesizes interactive learning with classification to provide a system that is used in daily operations. We conduct experiments and demonstrate that our classification approach results in improving the efficiency by average 35% compared to a conventional approach and the ranking approach leads to average 16% improvement in elevating the ranks of relevant articles.
【Keywords】: information gathering; learning; machine learning; ranking; semi-supervised; text classification
【Paper Link】 【Pages】:1719-1728
【Authors】: Komal Kapoor ; Mingxuan Sun ; Jaideep Srivastava ; Tao Ye
【Abstract】: In the competitive environment of the internet, retaining and growing one's user base is of major concern to most web services. Furthermore, the economic model of many web services is allowing free access to most content, and generating revenue through advertising. This unique model requires securing user time on a site rather than the purchase of good which makes it crucially important to create new kinds of metrics and solutions for growth and retention efforts for web services. In this work, we address this problem by proposing a new retention metric for web services by concentrating on the rate of user return. We further apply predictive analysis to the proposed retention metric on a service, as a means for characterizing lost customers. Finally, we set up a simple yet effective framework to evaluate a multitude of factors that contribute to user return. Specifically, we define the problem of return time prediction for free web services. Our solution is based on the Cox's proportional hazard model from survival analysis. The hazard based approach offers several benefits including the ability to work with censored data, to model the dynamics in user return rates, and to easily incorporate different types of covariates in the model. We compare the performance of our hazard based model in predicting the user return time and in categorizing users into buckets based on their predicted return time, against several baseline regression and classification methods and find the hazard based approach to be superior.
【Keywords】: customer relationship management; growth and retention; hazard based methods; online user behavior
【Paper Link】 【Pages】:1729-1738
【Authors】: Kush R. Varshney ; Vijil Chenthamarakshan ; Scott W. Fancher ; Jun Wang ; DongPing Fang ; Aleksandra Mojsilovic
【Abstract】: Strategic planning and talent management in large enterprises composed of knowledge workers requires complete, accurate, and up-to-date representation of the expertise of employees in a form that integrates with business processes. Like other similar organizations operating in dynamic environments, the IBM Corporation strives to maintain such current and correct information, specifically assessments of employees against job roles and skill sets from its expertise taxonomy. In this work, we deploy an analytics-driven solution that infers the expertise of employees through the mining of enterprise and social data that is not specifically generated and collected for expertise inference. We consider job role and specialty prediction and pose them as supervised classification problems. We evaluate a large number of feature sets, predictive models and postprocessing algorithms, and choose a combination for deployment. This expertise analytics system has been deployed for key employee population segments, yielding large reductions in manual effort and the ability to continually and consistently serve up-to-date and accurate data for several business functions. This expertise management system is in the process of being deployed throughout the corporation.
【Keywords】: expertise assessment; human resources; supervised classification; talent planning; workforce analytics
【Paper Link】 【Pages】:1739-1748
【Authors】: Li Zheng ; Chunqiu Zeng ; Lei Li ; Yexi Jiang ; Wei Xue ; Jingxuan Li ; Chao Shen ; Wubai Zhou ; Hongtai Li ; Liang Tang ; Tao Li ; Bing Duan ; Ming Lei ; Pengnian Wang
【Abstract】: Advanced manufacturing such as aerospace, semi-conductor, and flat display device often involves complex production processes, and generates large volume of production data. In general, the production data comes from products with different levels of quality, assembly line with complex flows and equipments, and processing craft with massive controlling parameters. The scale and complexity of data is beyond the analytic power of traditional IT infrastructures. To achieve better manufacturing performance, it is imperative to explore the underlying dependencies of the production data and exploit analytic insights to improve the production process. However, few research and industrial efforts have been reported on providing manufacturers with integrated data analytical solutions to reveal potentials and optimize the production process from data-driven perspectives. In this paper, we design, implement and deploy an integrated solution, named PDP-Miner, which is a data analytics platform customized for process optimization in Plasma Display Panel (PDP) manufacturing. The system utilizes the latest advances in data mining technologies and Big Data infrastructures to create a complete analytical solution. Besides, our proposed system is capable of supporting automatically configuring and scheduling analysis tasks, and balancing heterogeneous computing resources. The system and the analytic strategies can be applied to other advanced manufacturing fields to enable complex data analysis tasks. Since 2013, PDP-Miner has been deployed as the data analysis platform of ChangHong COC. By taking the advantages of our system, the overall PDP yield rate has increased from 91% to 94%. The monthly production is boosted by 10,000 panels, which brings more than 117 million RMB of revenue improvement per year.
【Keywords】: advanced manufacturing; big data; data mining platform; process optimization
【Paper Link】 【Pages】:1749-1758
【Authors】: Marco Avvenuti ; Stefano Cresci ; Andrea Marchetti ; Carlo Meletti ; Maurizio Tesconi
【Abstract】: Social sensing is based on the idea that communities or groups of people can provide a set of information similar to those obtainable from a sensor network. Emergency management is a candidate field of application for social sensing. In this work we describe the design, implementation and deployment of a decision support system for the detection and the damage assessment of earthquakes in Italy. Our system exploits the messages shared in real-time on Twitter, one of the most popular social networks in the world. Data mining and natural language processing techniques are employed to select meaningful and comprehensive sets of tweets. We then apply a burst detection algorithm in order to promptly identify outbreaking seismic events. Detected events are automatically broadcasted by our system via a dedicated Twitter account and by email notifications. In addition, we mine the content of the messages associated to an event to discover knowledge on its consequences. Finally we compare our results with official data provided by the National Institute of Geophysics and Volcanology (INGV), the authority responsible for monitoring seismic events in Italy. The INGV network detects shaking levels produced by the earthquake, but can only model the damage scenario by using empirical relationships. This scenario can be greatly improved with direct information site by site. Results show that the system has a great ability to detect events of a magnitude in the region of 3.5, with relatively low occurrences of false positives. Earthquake detection mostly occurs within seconds of the event and far earlier than the notifications shared by INGV or by other official channels. Thus, we are able to alert interested parties promptly. Information discovered by our system can be extremely useful to all the government agencies interested in mitigating the impact of earthquakes, as well as the news agencies looking for fresh information to publish.
【Keywords】: decision support system; disaster management; event detection; social mining; social sensing
【Paper Link】 【Pages】:1759-1768
【Authors】: Matthew F. Der ; Lawrence K. Saul ; Stefan Savage ; Geoffrey M. Voelker
【Abstract】: We describe an automated system for the large-scale monitoring of Web sites that serve as online storefronts for spam-advertised goods. Our system is developed from an extensive crawl of black-market Web sites that deal in illegal pharmaceuticals, replica luxury goods, and counterfeit software. The operational goal of the system is to identify the affiliate programs of online merchants behind these Web sites; the system itself is part of a larger effort to improve the tracking and targeting of these affiliate programs. There are two main challenges in this domain. The first is that appearances can be deceiving: Web pages that render very differently are often linked to the same affiliate program of merchants. The second is the difficulty of acquiring training data: the manual labeling of Web pages, though necessary to some degree, is a laborious and time-consuming process. Our approach in this paper is to extract features that reveal when Web pages linked to the same affiliate program share a similar underlying structure. Using these features, which are mined from a small initial seed of labeled data, we are able to profile the Web sites of forty-four distinct affiliate programs that account, collectively, for hundreds of millions of dollars in illicit e-commerce. Our work also highlights several broad challenges that arise in the large-scale, empirical study of malicious activity on the Web.
【Keywords】: email spam; web page classification
【Paper Link】 【Pages】:1769-1778
【Authors】: Michael Bendersky ; Lluis Garcia Pueyo ; Jeremiah J. Harmsen ; Vanja Josifovski ; Dima Lepikhin
【Abstract】: The explosive growth in sharing and consumption of the video content on the web creates a unique opportunity for scientific advances in video retrieval, recommendation and discovery. In this paper, we focus on the task of video suggestion, commonly found in many online applications. The current state-of-the-art video suggestion techniques are based on the collaborative filtering analysis, and suggest videos that are likely to be co-viewed with the watched video. In this paper, we propose augmenting the collaborative filtering analysis with the topical representation of the video content to suggest related videos. We propose two novel methods for topical video representation. The first method uses information retrieval heuristics such as tf-idf, while the second method learns the optimal topical representations based on the implicit user feedback available in the online scenario. We conduct a large scale live experiment on YouTube traffic, and demonstrate that augmenting collaborative filtering with topical representations significantly improves the quality of the related video suggestions in a live setting, especially for categories with fresh and topically-rich video content such as news videos. In addition, we show that employing user feedback for learning the optimal topical video representations can increase the user engagement by more than 80% over the standard information retrieval representation, when compared to the collaborative filtering baseline.
【Keywords】: related video suggestion; video representation; video retrieval
【Paper Link】 【Pages】:1779-1788
【Authors】: Mingqiang Xue ; Huayu Wu ; Wei Chen ; Wee Siong Ng ; Gin Howe Goh
【Abstract】: Tourism industry has become a key economic driver for Singapore. Understanding the behaviors of tourists is very important for the government and private sectors, e.g., restaurants, hotels and advertising companies, to improve their existing services or create new business opportunities. In this joint work with Singapore's Land Transport Authority (LTA), we innovatively apply machine learning techniques to identity the tourists among public commuters using the public transportation data provided by LTA. On successful identification, the travelling patterns of tourists are then revealed and thus allow further analyses to be carried out such as on their favorite destinations, region of stay, etc. Technically, we model the tourists identification as a classification problem, and design an iterative learning algorithm to perform inference with limited prior knowledge and labeled data. We show the superiority of our algorithm with performance evaluation and comparison with other state-of-the-art learning algorithms. Further, we build an interactive web-based system for answering queries regarding the moving patterns of the tourists, which can be used by stakeholders to gain insight into tourists' travelling behaviors in Singapore.
【Keywords】: data analytics; ez-link; public transport; tourists
【Paper Link】 【Pages】:1789-1798
【Authors】: Mohammad A. Tayebi ; Martin Ester ; Uwe Glässer ; Patricia L. Brantingham
【Abstract】: Crime reduction and prevention strategies are essential to increase public safety and reduce the crime costs to society. Law enforcement agencies have long realized the importance of analyzing co-offending networks---networks of offenders who have committed crimes together---for this purpose. Although network structure can contribute significantly to co-offence prediction, research in this area is very limited. Here we address this important problem by proposing a framework for co-offence prediction using supervised learning. Considering the available information about offenders, we introduce social, geographic, geo-social and similarity feature sets which are used for classifying potential negative and positive pairs of offenders. Similar to other social networks, co-offending networks also suffer from a highly skewed distribution of positive and negative pairs. To address the class imbalance problem, we identify three types of criminal cooperation opportunities which help to reduce the class imbalance ratio significantly, while keeping half of the co-offences. The proposed framework is evaluated on a large crime dataset for the Province of British Columbia, Canada. Our experimental evaluation of four different feature sets show that the novel geo-social features are the best predictors. Overall, we experimentally show the high effectiveness of the proposed co-offence prediction framework. We believe that our framework will not only allow law enforcement agencies to improve their crime reduction and prevention strategies, but also offers new criminological insights into criminal link formation between offenders.
【Keywords】: co-offence prediction; link prediction; social network
【Paper Link】 【Pages】:1799-1808
【Authors】: Naren Ramakrishnan ; Patrick Butler ; Sathappan Muthiah ; Nathan Self ; Rupinder Paul Khandpur ; Parang Saraf ; Wei Wang ; Jose Cadena ; Anil Vullikanti ; Gizem Korkmaz ; Chris J. Kuhlman ; Achla Marathe ; Liang Zhao ; Ting Hua ; Feng Chen ; Chang-Tien Lu ; Bert Huang ; Aravind Srinivasan ; Khoa Trinh ; Lise Getoor ; Graham Katz ; Andy Doyle ; Chris Ackermann ; Ilya Zavorin ; Jim Ford ; Kristen Maria Summers ; Youssef Fayed ; Jaime Arredondo ; Dipak Gupta ; David Mares
【Abstract】: We describe the design, implementation, and evaluation of EMBERS, an automated, 24x7 continuous system for forecasting civil unrest across 10 countries of Latin America using open source indicators such as tweets, news sources, blogs, economic indicators, and other data sources. Unlike retrospective studies, EMBERS has been making forecasts into the future since Nov 2012 which have been (and continue to be) evaluated by an independent T&E team (MITRE). Of note, EMBERS has successfully forecast the June 2013 protests in Brazil and Feb 2014 violent protests in Venezuela. We outline the system architecture of EMBERS, individual models that leverage specific data sources, and a fusion and suppression engine that supports trading off specific evaluation criteria. EMBERS also provides an audit trail interface that enables the investigation of why specific predictions were made along with the data utilized for forecasting. Through numerous evaluations, we demonstrate the superiority of EMBERS over baserate methods and its capability to forecast significant societal happenings.
【Keywords】: civil unrest; event forecasting; open source indicators
【Paper Link】 【Pages】:1809-1818
【Authors】: Nemanja Spasojevic ; Jinyun Yan ; Adithya Rao ; Prantik Bhattacharyya
【Abstract】: Millions of people use social networks everyday to talk about a variety of subjects, publish opinions and share information. Understanding this data to infer user's topical interests is a challenging problem with applications in various data-powered products. In this paper, we present 'LASTA' (Large Scale Topic Assignment), a full production system used at Klout, Inc., which mines topical interests from five social networks and assigns over 10,000 topics to hundreds of millions of users on a daily basis. The system continuously collects streams of user data and is reactive to fresh information, updating topics for users as interests shift. LASTA generates over 50 distinct features derived from signals such as user generated posts and profiles, user reactions such as comments and retweets, user attributions such as lists, tags and endorsements, as well as signals based on social graph connections. We show that using this diverse set of features leads to a better representation of a user's topical interests as compared to using only generated text or only graph based features. We also show that using cross-network information for a user leads to a more complete and accurate understanding of the user's topics, as compared to using any single network. We evaluate LASTA's topic assignment system on an internal labeled corpus of 32,264 user-topic labels generated from real users.
【Keywords】: distributed systems; interest mining; large scale; online social networks; topic assignment; user modeling
【Paper Link】 【Pages】:1819-1828
【Authors】: Onno Zoeter ; Christopher R. Dance ; Stéphane Clinchant ; Jean-Marc Andreoli
【Abstract】: On-street parking, just as any publicly owned utility, is used inefficiently if access is free or priced very far from market rates. This paper introduces a novel demand management solution: using data from dedicated occupancy sensors an iteration scheme updates parking rates to better match demand. The new rates encourage parkers to avoid peak hours and peak locations and reduce congestion and underuse. The solution is deliberately simple so that it is easy to understand, easily seen to be fair and leads to parking policies that are easy to remember and act upon. We study the convergence properties of the iteration scheme and prove that it converges to a reasonable distribution for a very large class of models. The algorithm is in use to change parking rates in over 6000 spaces in downtown Los Angeles since June 2012 as part of the LA Express Park project. Initial results are encouraging with a reduction of congestion and underuse, while in more locations rates were decreased than increased.
【Keywords】: demand management; transportation
【Paper Link】 【Pages】:1829-1836
【Authors】: Paulo Shakarian ; Joseph Salmento ; William R. Pulleyblank ; John Bertetto
【Abstract】: In this paper, we study a variant of the social network maximum influence problem and its application to intelligently approaching individual gang members with incentives to leave a gang. The goal is to identify individuals who when influenced to leave gangs will propagate this action. We study this emerging application by exploring specific facets of the problem that must be addressed when modeling this particular situation. We formulate a new influence maximization variant - the "social incentive influence" (SII) problem and study it both formally and in the context of the law-enforcement domain. Using new techniques from unconstrained submodular maximization, we develop an approximation algorithm for SII and present a suite of experimental results - including tests on real-world police data from Chicago.
【Keywords】: complex networks; network diffusion; propagation in networks
【Paper Link】 【Pages】:1837-1846
【Authors】: Pei Lee ; Laks V. S. Lakshmanan ; Mitul Tiwari ; Sam Shah
【Abstract】: Recommender systems have become very important for many online activities, such as watching movies, shopping for products, and connecting with friends on social networks. User behavioral analysis and user feedback (both explicit and implicit) modeling are crucial for the improvement of any online recommender system. Widely adopted recommender systems at LinkedIn such as "People You May Know" and "Endorsements" are evolving by analyzing user behaviors on impressed recommendation items. In this paper, we address modeling impression discounting of recommended items, that is, how to model user's no-action feedback on impressed recommended items. The main contributions of this paper include (1) large-scale analysis of impression data from LinkedIn and KDD Cup; (2) novel anti-noise regression techniques, and its application to learn four different impression discounting functions including linear decay, inverse decay, exponential decay, and quadratic decay; (3) applying these impression discounting functions to LinkedIn's "People You May Know" and "Endorsements" recommender systems.
【Keywords】: impression discounting; recommender system
【Paper Link】 【Pages】:1847-1856
【Authors】: Richard J. Beckman ; Keith R. Bisset ; Jiangzhuo Chen ; Bryan Lewis ; Madhav V. Marathe ; Paula Elaine Stretz
【Abstract】: We describe ISIS, a high-performance-computing-based application to support computational epidemiology of infectious diseases. ISIS has been developed over the last seven years in close coordination with public health and policy experts. It has been used in a number of important federal planning and response exercises. ISIS grew out of years of experience in developing and using HPC-oriented models of complex socially coupled systems. This identified the guiding principle that complex models will be used by domain experts only if they can do realistic analysis without becoming computing experts. Using ISIS, one can carry out detailed computational experiments as they pertain to planning and response in the event of a pandemic. ISIS is designed to support networked epidemiology -- study of epidemic processes over social contact networks. The current system can handle airborne infectious diseases such as influenza, pertussis, and smallpox. ISIS is comprised of the following basic components: (i) a web app that serves as the user-interface, (ii) a middleware that coordinates user interaction via the web app with backend models and databases, (iii) a backend computational modeling framework that is comprised of highly resolved epidemic simulations combined with highly realistic control strategies that include pharmaceutical as well as non-pharmaceutical interventions and (iv) a backend data management framework that manages complex unstructured and semi-structured data. ISIS has been used in over a dozen case studies defined by the DoD, DHHS, NIH, BARDA and NSC. We describe three recent studies illustrating the use of ISIS in real-world settings:(i) uses of ISIS during the H1N1 pandemic, Cii) supporting a US military planning exercise, and (iii) distribution of limited stockpile of pharmaceuticals using public and private outlets.
【Keywords】: computational epidemiology; hpc; public health; simulation; web app
【Paper Link】 【Pages】:1857-1866
【Authors】: Ron Kohavi ; Alex Deng ; Roger Longbotham ; Ya Xu
【Abstract】: Web site owners, from small web sites to the largest properties that include Amazon, Facebook, Google, LinkedIn, Microsoft, and Yahoo, attempt to improve their web sites, optimizing for criteria ranging from repeat usage, time on site, to revenue. Having been involved in running thousands of controlled experiments at Amazon, Booking.com, LinkedIn, and multiple Microsoft properties, we share seven rules of thumb for experimenters, which we have generalized from these experiments and their results. These are principles that we believe have broad applicability in web optimization and analytics outside of controlled experiments, yet they are not provably correct, and in some cases exceptions are known. To support these rules of thumb, we share multiple real examples, most being shared in a public paper for the first time. Some rules of thumb have previously been stated, such as 'speed matters,' but we describe the assumptions in the experimental design and share additional experiments that improved our understanding of where speed matters more: certain areas of the web page are more critical. This paper serves two goals. First, it can guide experimenters with rules of thumb that can help them optimize their sites. Second, it provides the KDD community with new research challenges on the applicability, exceptions, and extensions to these, one of the goals for KDD's industrial track.
【Keywords】: a/b testing; controlled experiments; randomized experiments
【Paper Link】 【Pages】:1867-1876
【Authors】: Ruben Sipos ; Dmitriy Fradkin ; Fabian Mörchen ; Zhuang Wang
【Abstract】: Success of manufacturing companies largely depends on reliability of their products. Scheduled maintenance is widely used to ensure that equipment is operating correctly so as to avoid unexpected breakdowns. Such maintenance is often carried out separately for every component, based on its usage or simply on some fixed schedule. However, scheduled maintenance is labor-intensive and ineffective in identifying problems that develop between technician's visits. Unforeseen failures still frequently occur. In contrast, predictive maintenance techniques help determine the condition of in-service equipment in order to predict when and what repairs should be performed. The main goal of predictive maintenance is to enable pro-active scheduling of corrective work, and thus prevent unexpected equipment failures.
【Keywords】: crisp-dm; log mining; machine learning; predictive maintenance
【Paper Link】 【Pages】:1877-1886
【Authors】: W. Scott Spangler ; Angela D. Wilkins ; Benjamin J. Bachman ; Meena Nagarajan ; Tajhal Dayaram ; Peter J. Haas ; Sam Regenbogen ; Curtis R. Pickering ; Austin Comer ; Jeffrey N. Myers ; Ioana Stanoi ; Linda Kato ; Ana Lelescu ; Jacques J. Labrie ; Neha Parikh ; Andreas Martin Lisewski ; Lawrence A. Donehower ; Ying Chen ; Olivier Lichtarge
【Abstract】: Keeping up with the ever-expanding flow of data and publications is untenable and poses a fundamental bottleneck to scientific progress. Current search technologies typically find many relevant documents, but they do not extract and organize the information content of these documents or suggest new scientific hypotheses based on this organized content. We present an initial case study on KnIT, a prototype system that mines the information contained in the scientific literature, represents it explicitly in a queriable network, and then further reasons upon these data to generate novel and experimentally testable hypotheses. KnIT combines entity detection with neighbor-text feature analysis and with graph-based diffusion of information to identify potential new properties of entities that are strongly implied by existing relationships. We discuss a successful application of our approach that mines the published literature to identify new protein kinases that phosphorylate the protein tumor suppressor p53. Retrospective analysis demonstrates the accuracy of this approach and ongoing laboratory experiments suggest that kinases identified by our system may indeed phosphorylate p53. These results establish proof of principle for automated hypothesis generation and discovery based on text mining of the scientific literature.
【Keywords】: hypothesis generation; scientific discovery; text mining
【Paper Link】 【Pages】:1887-1896
【Authors】: Shashank Srikant ; Varun Aggarwal
【Abstract】: The automatic evaluation of computer programs is a nascent area of research with a potential for large-scale impact. Extant program assessment systems score mostly based on the number of test-cases passed, providing no insight into the competency of the programmer. In this paper, we present a system to grade computer programs automatically. In addition to grading a program on its programming practices and complexity, the key kernel of the system is a machine-learning based algorithm which determines closeness of the logic of the given program to a correct program. This algorithm uses a set of highly-informative features, derived from the abstract representations of a given program, that capture the program's functionality. These features are then used to learn a model to grade the programs, which are built against evaluations done by experts. We show that the regression models provide much better grading than the ubiquitous test-case-pass based grading and rivals the grading accuracy of other open-response problems such as essay grading . We also show that our novel features add significant value over and above basic keyword/expression count features. In addition to this, we propose a novel way of posing computer-program grading as a one-class modeling problem and report encouraging preliminary results. We show the value of the system through a case study in a real-world industrial deployment. To the best of the authors' knowledge, this is the first time a system using machine learning has been developed and used for grading programs. The work is timely with regard to the recent boom in Massively Online Open Courseware (MOOCs), which promises to produce a significant amount of hand-graded digitized data.
【Keywords】: automatic grading; feature engineering; mooc; one-class learning; recruitment; supervised learning
【Paper Link】 【Pages】:1897-1906
【Authors】: Shuai Yuan ; Jun Wang ; Bowei Chen ; Peter Mason ; Sam Seljan
【Abstract】: In this paper, we report the first empirical study and live test of the reserve price optimisation problem in the context of Real-Time Bidding (RTB) display advertising from an operational environment. A reserve price is the minimum that the auctioneer would accept from bidders in auctions, and in a second price auction it could potentially uplift the auctioneer's revenue by charging winners the reserve price instead of the second highest bids. As such it has been used for sponsored search and been well studied in that context. However, comparing with sponsored search and contextual advertising, this problem in the RTB context is less understood yet more critical for publishers because 1) bidders have to submit a bid for each individual impression, which mostly is associated with user data that is subject to change over time. This, coupled with practical constraints such as the budget, campaigns' life time, etc. makes the theoretical result from optimal auction theory not necessarily applicable and a further empirical study is required to confirm its optimality from the real-world system; 2) in RTB an advertiser is facing nearly unlimited supply and the auction is almost done in "last second", which encourages spending less on the high cost ad placements. This could imply the loss of bid volume over time if a correct reserve price is not in place. In this paper we empirically examine several commonly adopted algorithms for setting up a reserve price. We report our results of a large scale online experiment in a production platform. The results suggest the our proposed game theory based OneShot algorithm performed the best and the superiority is significant in most cases.
【Keywords】: display advertising; online advertising; real-time bidding; reserve price; revenue optimisation
【Paper Link】 【Pages】:1907-1916
【Authors】: Shuang-Hong Yang ; Alek Kolcz ; Andy Schlaikjer ; Pankaj Gupta
【Abstract】: We are interested in organizing a continuous stream of sparse and noisy texts, known as "tweets", in real time into an ontology of hundreds of topics with measurable and stringently high precision. This inference is performed over a full-scale stream of Twitter data, whose statistical distribution evolves rapidly over time. The implementation in an industrial setting with the potential of affecting and being visible to real users made it necessary to overcome a host of practical challenges. We present a spectrum of topic modeling techniques that contribute to a deployed system. These include non-topical tweet detection, automatic labeled data acquisition, evaluation with human computation, diagnostic and corrective learning and, most importantly, high-precision topic inference. The latter represents a novel two-stage training algorithm for tweet text classification and a close-loop inference mechanism for combining texts with additional sources of information. The resulting system achieves 93% precision at substantial overall coverage.
【Keywords】: large-scale machine learning; social media; text classification; topic modeling
【Paper Link】 【Pages】:1925-1934
【Authors】: Vignesh Jagadeesh ; Robinson Piramuthu ; Anurag Bhardwaj ; Wei Di ; Neel Sundaresan
【Abstract】: We describe a completely automated large scale visual recommendation system for fashion. Our focus is to efficiently harness the availability of large quantities of online fashion images and their rich meta-data. Specifically, we propose two classes of data driven models in the Deterministic Fashion Recommenders (DFR) and Stochastic Fashion Recommenders (SFR) for solving this problem. We analyze relative merits and pitfalls of these algorithms through extensive experimentation on a large-scale data set and baseline them against existing ideas from color science. We also illustrate key fashion insights learned through these experiments and show how they can be employed to design better recommendation systems. The industrial applicability of proposed models is in the context of mobile fashion shopping. Finally, we also outline a large-scale annotated data set of fashion images Fashion-136K) that can be exploited for future research in data driven visual fashion.
【Keywords】: color modeling; e-commerce; fashion; user behavior; visual recommenders
【Paper Link】 【Pages】:1935-1944
【Authors】: Wayne Xin Zhao ; Yanwei Guo ; Yulan He ; Han Jiang ; Yuexin Wu ; Xiaoming Li
【Abstract】: Product recommender systems are often deployed by e-commerce websites to improve user experience and increase sales. However, recommendation is limited by the product information hosted in those e-commerce sites and is only triggered when users are performing e-commerce activities. In this paper, we develop a novel product recommender system called METIS, a MErchanT Intelligence recommender System, which detects users' purchase intents from their microblogs in near real-time and makes product recommendation based on matching the users' demographic information extracted from their public profiles with product demographics learned from microblogs and online reviews. METIS distinguishes itself from traditional product recommender systems in the following aspects: 1) METIS was developed based on a microblogging service platform. As such, it is not limited by the information available in any specific e-commerce website. In addition, METIS is able to track users' purchase intents in near real-time and make recommendations accordingly. 2) In METIS, product recommendation is framed as a learning to rank problem. Users' characteristics extracted from their public profiles in microblogs and products' demographics learned from both online product reviews and microblogs are fed into learning to rank algorithms for product recommendation. We have evaluated our system in a large dataset crawled from Sina Weibo. The experimental results have verified the feasibility and effectiveness of our system. We have also made a demo version of our system publicly available and have implemented a live system which allows registered users to receive recommendations in real time.
【Keywords】: e-commerce; microblog; product demographic; product recommender
【Paper Link】 【Pages】:1945-1954
【Authors】: Ye Xu ; Zang Li ; Abhishek Gupta ; Ahmet Bugdayci ; Anmol Bhasin
【Abstract】: For decades large corporations as well as labor placement services have maintained extensive yet static resume databanks. Online professional networks like LinkedIn have taken these resume databanks to a dynamic, constantly updated and massive scale professional profile dataset spanning career records from hundreds of industries, millions of companies and hundreds of millions of people worldwide. Using this professional profile dataset, this paper attempts to model profiles of individuals as a sequence of positions held by them as a time-series of nodes, each of which represents one particular position or job experience in the individual's career trajectory. These career trajectory models can be employed in various utility applications including career trajectory planning for students in schools & universities using knowledge inferred from real world career outcomes. They can also be employed for decoding sequences to uncover paths leading to certain professional milestones from a user's current professional status. We deploy the proposed technique to ascertain professional similarity between two individuals by developing a similarity measure SimCareers (Similar Career Paths). The measure employs sequence alignment between two career trajectories to quantify professional similarity between career paths. To the best of our knowledge, SimCareers is the first framework to model professional similarity between two people taking account their career trajectory information. We posit, that using the temporal and structural features of a career trajectory for modeling profile similarity is a far more superior approach than using similarity measures on semi-structured attribute representation of a profile for this application. We validate our hypothesis by extensive quantitative evaluations on a gold dataset of similar profiles generated from recruiting activity logs from actual recruiters using LinkedIn. In addition, we show significant improvements in engagement by running an A/B test on a real-world application called Similar Profiles on LinkedIn, world's largest online professional network.
【Keywords】: career paths; similarity; social networks; user profiles
【Paper Link】 【Pages】:1955-1964
【Authors】: Yukihiro Tagami ; Toru Hotta ; Yusuke Tanaka ; Shingo Ono ; Koji Tsukamoto ; Akira Tajima
【Abstract】: Contextual advertising is a form of textual advertising usually displayed on third party Web pages. One of the main problems with contextual advertising is determining how to select ads that are relevant to the page content and/or the user information in order to achieve both effective advertising and a positive user experience. Typically, the relevance of an ad to page content is indicated by a tf-idf score that measures the word overlap between the page and the ad content, so this problem is transformed into a similarity search in a vector space. However, such an approach is not useful if the vocabulary used on the page is expected to be different from that in the ad. There have been studies proposing the use of semantic categories or hidden classes to overcome this problem. With these approaches it is necessary to expand the ad retrieval system or build new index to handle the categories or classes, and it is not always easy to maintain the number of categories and classes required for business needs. In this work, we propose a translation method that learns the mapping of the contextual information to the textual features of ads by using past click data. The contextual information includes the user's demographic information and behavioral information as well as page content information. The proposed method is able to retrieve more preferable ads while maintaining the sparsity of the inverted index and the performance of the ad retrieval system. In addition, it is easy to implement and there is no need to modify an existing ad retrieval system. We evaluated this approach offline on a data set based on logs from an ad network. Our method achieved better results than existing methods. We also applied our approach with a real ad serving system and compared the online performance using A/B testing. Our approach achieved an improvement over the existing production system.
【Keywords】: click feedback; contextual advertising; learning-to-rank; modeling
【Paper Link】 【Pages】:1965
【Authors】: Raghu Ramakrishnan ; Geoffrey I. Webb
【Abstract】: When data-driven improvements involve personally identifiable data, or even data that can be used to infer sensitive information about individuals, we face the dilemma that we potentially risk compromising privacy. As we see increased emphasis on using data mining to effect improvements in a range of socially beneficial activities, from improving matching of talented students to opportunities for higher education, or improving allocation of funds across competing school programs, or reducing hospitalization time following surgery, the dilemma can often be especially acute. The data involved often is personally identifiable or revealing and sensitive, and many of the institutions that must be involved in gathering and maintaining custody of the data are not equipped to adequately secure the data, raising the risk of privacy breaches. How should we approach this trade-off? Can we assess the risks? Can we control or mitigate them? Can we develop guidelines for when the risk is or is not worthwhile, and for how best to handle data in different common scenarios? Chairs Raghu Ramakrishnan and Geoffrey I. Webb bring this panel of leading data miners and privacy experts together to address these critical issues.
【Keywords】: privacy; social good
【Paper Link】 【Pages】:1966
【Authors】: Yoshua Bengio
【Abstract】: Deep learning has rapidly moved from a marginal approach in the machine learning community less than ten years ago to one that has strong industrial impact, in particular for high-dimensional perceptual data such as speech and images, but also natural language. The demand for experts in deep learning is growing very fast (faster than we can graduate PhDs), thereby considerably increasing their market value. Deep learning is based on the idea of learning multiple levels of representation, with higher levels computed as a function of lower levels, and corresponding to more abstract concepts automatically discovered by the learner. Deep learning arose out of research on artificial neural networks and graphical models and the literature on that subject has considerably grown in recent years, culminating in the creation of a dedicated conference (ICLR). The tutorial will introduce some of the basic algorithms, both on the supervised and unsupervised sides, as well as discuss some of the guidelines for successfully using them in practice. Finally, it will introduce current research questions regarding the challenge of scaling up deep learning to much larger models that can successfully extract information from huge datasets.
【Keywords】: big data; conditional computing; deep learning; neural networks
【Paper Link】 【Pages】:1967
【Authors】: Antoine Bordes ; Evgeniy Gabrilovich
【Abstract】: Recent years have witnessed a proliferation of large-scale knowledge graphs, such as Freebase, YAGO, Google's Knowledge Graph, and Microsoft's Satori. Whereas there is a large body of research on mining homogeneous graphs, this new generation of information networks are highly heterogeneous, with thousands of entity and relation types and billions of instances of vertices and edges. In this tutorial, we will present the state of the art in constructing, mining, and growing knowledge graphs. The purpose of the tutorial is to equip newcomers to this exciting field with an understanding of the basic concepts, tools and methodologies, available datasets, and open research challenges. A publicly available knowledge base (Freebase) will be used throughout the tutorial to exemplify the different techniques.
【Keywords】: big knowledge
【Paper Link】 【Pages】:1968
【Authors】: Jiawei Han ; Chi Wang ; Ahmed El-Kishky
【Abstract】: Mining phrases, entity concepts, topics, and hierarchies from massive text corpus is an essential problem in the age of big data. Text data in electronic forms are ubiquitous, ranging from scientific articles to social networks, enterprise logs, news articles, social media and general web pages. It is highly desirable but challenging to bring structure to unstructured text data, uncover underlying hierarchies, relationships, patterns and trends, and gain knowledge from such data. In this tutorial, we provide a comprehensive survey on the state-of-the art of data-driven methods that automatically mine phrases, extract and infer latent structures from text corpus, and construct multi-granularity topical groupings and hierarchies of the underlying themes. We study their principles, methodologies, algorithms and applications using several real datasets including research papers and news articles and demonstrate how these methods work and how the uncovered latent entity structures may help text understanding, knowledge discovery and management.
【Keywords】: information networks; phrase mining; text mining; topic model
【Paper Link】 【Pages】:1969
【Authors】: Madhav V. Marathe ; Anil Kumar S. Vullikanti
【Abstract】: As recent pandemics such as SARS and the Swine Flu outbreak have shown, diseases spread very fast in today's interconnected world, making public health an important research area. Some of the basic questions are: How can an outbreak be contained before it becomes an epidemic, and what disease surveillance strategies should be implemented? These problems have been studied traditionally using differential equation methods, which are amenable to analysis and closed form solutions. However, these models are based on complete mixing assumptions, which do not hold for realistic populations, thereby limiting their utility. In this tutorial, we focus on an approach based on diffusion processes on complex networks. This captures more realistic populations, but leads to novel mathematical and computational challenges. The structure of the underlying networks has a significant impact on the dynamical properties, motivating the need for improved network models, and efficient algorithms for computing network and dynamical properties that scale to large networks. We provide an overview of the state of the art in computational epidemiology, which is a multi-disciplinary research area, that overlaps different areas in computer science, including data mining, machine learning, high performance computing and theoretical computer science, as well as mathematics, economics and statistics. Specifically, we will discuss mathematical and computational models, problems of inference, forecasting and state assessment, and epidemic containment.
【Keywords】: agent based simulations; diffusion; epidemiology; social networks
【Paper Link】 【Pages】:1970
【Authors】: Mengling Feng ; Mohammad M. Ghassemi ; Thomas Brennan ; John Ellenberger ; Ishrar Hussain ; Roger G. Mark
【Abstract】: Analyzing Biomedical Big Data (BBD) is computationally expensive due to high dimensionality and large data volume. Performance and scalability issues of traditional database management systems (DBMS) often limit the usage of more sophisticated and complex data queries and analytic models. Moreover, in the conventional setting, data management and analysis use separate software platforms. Exporting and importing large amounts of data across platforms require a significant amount of computational and I/O resources, as well as potentially putting sensitive data at a security risk. In this tutorial, the participants will learn the difference between in-memory DBMS and traditional DBMS through hands-on exercises using SAP's cloud-based HANA in-memory DBMS in conjunction with the Multi-parameter Intelligent Monitoring in Intensive Care (MIMIC) dataset. MIMIC is an open-access critical care EHR archive (over 4TB in size) and consists of structured, unstructured and waveform data. Furthermore, this tutorial will seek to educate the participants on how a combination of dynamic querying, and in-memory DBMS may enhance the management and analysis of complex clinical data.
【Keywords】: big data; biomedical data; dynamic querying; in-memory dbms
【Paper Link】 【Pages】:1971
【Authors】: Xavier Amatriain ; Bamshad Mobasher
【Abstract】: In 2006, Netflix announced a $1M prize competition to advance recommendation algorithms. The recommendation problem was simplified as the accuracy in predicting a user rating measured by the Root Mean Squared Error. While that formulation helped get the attention of the research community in the area, it may have put an excessive focus on what is simply one of possible approaches to recommendations. In this tutorial we will describe different components of modern recommender systems such as: personalized ranking, similarity, explanations, context-awareness, or search as recommendation. In the first part, we will use the Netflix use case as a driving example of a prototypical industrial-scale recommender system. We will also review the usage of modern algorithmic approaches that include algorithms such as Factorization Machines, Restricted Boltzmann Machines, SimRank, Deep Neural Networks, or Listwise Learning-to-rank. In the second part, we will focus on the area of context-aware recommendations where the two dimensional user-item recommender problem is turned into an n-dimensional space.
【Keywords】: data mining; machine learning; personalization; recommender systems
【Paper Link】 【Pages】:1972
【Authors】: Francesco Bonchi ; David García-Soriano ; Edo Liberty
【Abstract】: Correlation clustering is arguably the most natural formulation of clustering. Given a set of objects and a pairwise similarity measure between them, the goal is to cluster the objects so that, to the best possible extent, similar objects are put in the same cluster and dissimilar objects are put in different clusters. As it just needs a definition of similarity, its broad generality makes it applicable to a wide range of problems in different contexts, and in particular makes it naturally suitable to clustering structured objects for which feature vectors can be difficult to obtain. Despite its simplicity, generality and wide applicability, correlation clustering has so far received much more attention from the algorithmic theory community than from the data mining community. The goal of this tutorial is to show how correlation clustering can be a powerful addition to the toolkit of the data mining researcher and practitioner, and to encourage discussions and further research in the area. In the tutorial we will survey the problem and its most common variants, with an emphasis on the algorithmic techniques and key ideas developed to derive efficient solutions. We will motivate the problems and discuss real-world applications, the scalability issues that may arise, and the existing approaches to handle them.
【Keywords】: clustering
【Paper Link】 【Pages】:1973
【Authors】: Ruslan Salakhutdinov
【Abstract】: Building intelligent systems that are capable of extracting high-level representations from high-dimensional data lies at the core of solving many AI related tasks, including visual object or pattern recognition, speech perception, and language understanding. Theoretical and biological arguments strongly suggest that building such systems requires deep architectures that involve many layers of nonlinear processing. Many existing learning algorithms use shallow architectures, including neural networks with only one hidden layer, support vector machines, kernel logistic regression, and many others. The internal representations learned by such systems are necessarily simple and are incapable of extracting some types of complex structure from high-dimensional input. In the past few years, researchers across many different communities, from applied statistics to engineering, computer science, and neuroscience, have proposed several deep (hierarchical) models that are capable of extracting meaningful, high-level representations. An important property of these models is that they can extract complex statistical dependencies from data and efficiently learn high-level representations by re-using and combining intermediate concepts, allowing these models to generalize well across a wide variety of tasks. The learned high-level representations have been shown to give state-of-the-art results in many challenging learning problems and have been successfully applied in a wide variety of application domains, including visual object recognition, information retrieval, natural language processing, and speech perception. A few notable examples of such models include Deep Belief Networks, Deep Boltzmann Machines, Deep Autoencoders, and sparse coding-based methods. The goal of the tutorial is to introduce the recent developments of various deep learning methods to the KDD community. The core focus will be placed on algorithms that can learn multi-layer hierarchies of representations, emphasizing their applications in information retrieval, object recognition, and speech perception.
【Keywords】: autoencoders; boltzmann machines; deep learning; graphical models; sparse coding
【Paper Link】 【Pages】:1974
【Authors】: Feida Zhu ; Huan Sun ; Xifeng Yan
【Abstract】: The recent blossom of social network and communication services in both public and corporate settings have generated a staggering amount of network data of all kinds. Unlike the bio-networks and the chemical compound graph data often used in traditional network mining and analysis, the new network data grown out of the social applications are characterized by their rich attributes, high heterogeneity, enormous sizes and complex patterns of various semantic meanings, all of which have posed significant research challenges to the graph/network mining community. In this tutorial, we aim to examine some recent advances in network mining and analysis for social applications, covering a diverse collection of methodologies and applications from the perspectives of event, relationship, collaboration, and network pattern. We would present the problem settings, the challenges, the recent research advances and some future directions for each perspective. Topics include but are not limited to correlation mining, iceberg finding, anomaly detection, relationship discovery, information flow, task routing, and pattern mining.
【Keywords】: network analysis; network mining; social applications
【Paper Link】 【Pages】:1975
【Authors】: Graham Cormode ; Nick G. Duffield
【Abstract】: One response to the proliferation of large datasets has been to develop ingenious ways to throw resources at the problem, using massive fault tolerant storage architectures, parallel and graphical computation models such as MapReduce, Pregel and Giraph. However, not all environments can support this scale of resources, and not all queries need an exact response. This motivates the use of sampling to generate summary datasets that support rapid queries, and prolong the useful life of the data in storage. To be effective, sampling must mediate the tensions between resource constraints, data characteristics, and the required query accuracy. The state-of-the-art in sampling goes far beyond simple uniform selection of elements, to maximize the usefulness of the resulting sample. This tutorial reviews progress in sample design for large datasets, including streaming and graph-structured data. Applications are discussed to sampling network traffic and social networks.
【Keywords】: random sampling
【Paper Link】 【Pages】:1976
【Authors】: Wilhelmiina Hämäläinen ; Geoffrey I. Webb
【Abstract】: Pattern discovery is a core data mining activity. Initial approaches were dominated by the frequent pattern discovery paradigm -- only patterns that occur frequently in the data were explored. Having been thoroughly researched and its limitations now well understood, this paradigm is giving way to a new one, which can be called statistically sound pattern discovery. In this paradigm, the main impetus is to discover statistically significant patterns, which are unlikely to have occurred by chance and are likely to hold in future data. Thus, the new paradigm provides a strict control over false discoveries and overfitting. This tutorial covers both classic and cutting-edge research topics on pattern discovery combined to statistical significance testing. We start with an advanced introduction to the relevant forms of statistical significance testing, including different schools and alternative models, their underlying assumptions, practical issues, and limitations. We then discuss their application to data mining specific problems, including evaluation of nested patterns, the multiple testing problem, algorithmic strategies and real-world considerations. We present the current state-of-the art solutions and explore in detail how this approach to pattern discovery can deliver efficient and effective discovery of small sets of interesting patterns.
【Keywords】: association mining; pattern discovery; statistics
【Paper Link】 【Pages】:1977
【Authors】: Jiliang Tang ; Jie Tang ; Huan Liu
【Abstract】: The pervasive use of social media generates massive data in an unprecedented rate and the information overload problem becomes increasingly severe for social media users. Recommendation has been proven to be effective in mitigating the information overload problem, demonstrated its strength in improving the quality of user experience, and positively impacted the success of social media. New types of data introduced by social media not only provide more information to advance traditional recommender systems but also manifest new research possibilities for recommendation. In this tutorial, we aim to provide a comprehensive overview of various recommendation tasks in social media, especially their recent advances and new frontiers. We introduce basic concepts, review state-of-the-art algorithms, and deliberate the emerging challenges and opportunities. Finally we summarize the tutorial with discussions on open issues and challenges about recommendation in social media. Updated information about the tutorial can be found at \url{http://www.public.asu.edu/~jtang20/Recommendation.htm}.
【Keywords】: content recommendation; cross-domain recommendation; friend recommendation; location recommendation; location-aware recommendation; social recommendation; social-aware recommendation; spatio-temporal-social