19. KDD 2013:Chicago, IL, USA

The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, Chicago, IL, USA, August 11-14, 2013. ACM 【DBLP Link

Paper Num: 197 || Session Num: 30

Keynotes 4

1. Scale-out beyond map-reduce.

Paper Link】 【Pages】:1

【Authors】: Raghu Ramakrishnan

【Abstract】: The amount and variety of data being collected in the enterprise is growing at a staggering pace. The default now is to capture and store any and all data, in anticipation of potential future strategic value, and vast amounts of data are being generated by instrumenting key customer and systems touch points. Until recently, data was gathered for well-defined objectives such as auditing, forensics, reporting and line-of-business operations; now, exploratory and predictive analysis is becoming ubiquitous. These differences in data heterogeneity, scale and usage are leading to a new generation of data management and analytic systems, where the emphasis is on supporting a wide range of large datasets to be stored uniformly and analyzed seamlessly using whatever techniques are most appropriate, including traditional tools like SQL and BI and newer tools, e.g., for machine learning. These new systems are necessarily based on scale-out architectures for both storage and computation. The terms Big Data and data science are often used to refer to this class of systems and applications. Hadoop has become a key building block in the new generation of scale-out systems. Early versions of analytic tools over Hadoop, such as Hive [1] and Pig [2] for SQL-like queries, were implemented by translation into Map-Reduce computations. This approach has inherent limitations, and the emergence of resource managers such as YARN [3] and Mesos [4] has opened the door for newer analytic tools to bypass the Map-Reduce layer. This trend is especially significant for iterative computations such as graph analytics and machine learning, for which Map-Reduce is widely recognized to be a poor fit. In fact, the website of the machine learning toolkit Apache Mahout [5] explicitly warns about the slow performance of some of the algorithms on Hadoop. In this talk, I will examine this architectural trend, and argue that resource managers are a first step in re-factoring the early implementations of Map-Reduce, and that more work is needed if we wish to support a variety of analytic tools on a common scale-out computational fabric. I will then present REEF, which runs on top of resource managers like YARN and provides support for task monitoring and restart, data movement and communications, and distributed state management. Finally, I will illustrate the value of using REEF to implement iterative algorithms for graph analytics and machine learning.

【Keywords】: analytics; big data; data science; hadoop; machine learning; map-reduce; reef; scale-out; sql; yarn

2. The online revolution: education for everyone.

Paper Link】 【Pages】:2

【Authors】: Andrew Y. Ng ; Daphne Koller

【Abstract】: In 2011, Stanford University offered three online courses, which anyone in the world could enroll in and take for free. Together, these three courses had enrollments of around 350,000 students, making this one of the largest experiments in online education ever performed. Since the beginning of 2012, we have transitioned this effort into a new venture, Coursera, a social entrepreneurship company whose mission is to make high-quality education accessible to everyone by allowing the best universities to offer courses to everyone around the world, for free. Coursera classes provide a real course experience to students, including video content, interactive exercises with meaningful feedback, using both auto-grading and peer-grading, and a rich peer-to-peer interaction around the course materials. Currently, Coursera has 62 university partners, and over 3 million students enrolled in its over 300 courses. These courses span a range of topics including computer science, business, medicine, science, humanities, social sciences, and more. In this talk, I'll report on this far-reaching experiment in education, and why we believe this model can provide both an improved classroom experience for our on-campus students, via a flipped classroom model, as well as a meaningful learning experience for the millions of students around the world who would otherwise never have access to education of this quality.

【Keywords】: coursera; online courses

3. Optimization in learning and data analysis.

Paper Link】 【Pages】:3

【Authors】: Stephen J. Wright

【Abstract】: Optimization tools are vital to data analysis and learning. The optimization perspective has provided valuable insights, and optimization formulations have led to practical algorithms with good theoretical properties. In turn, the rich collection of problems in learning and data analysis is providing fresh perspectives on optimization algorithms and is driving new fundamental research in the area. We discuss research on several areas in this domain, including signal reconstruction, manifold learning, and regression/classification, describing in each case recent research in which optimization algorithms have been developed and applied successfully. A particular focus is asynchronous parallel algorithms for optimization and linear algebra, and their applications in data analysis and learning.

【Keywords】: learning; optimization

Paper Link】 【Pages】:4

【Authors】: Hal Varian

【Abstract】: Many businesses now have almost real time data available about their operations. This data can be helpful in contemporaneous prediction ("nowcasting") of various economic indicators. We illustrate how one can use Google search data to nowcast economic metrics of interest, and discuss some of the ramifications for research and policy. Our approach combines three Bayesian techniques: Kalman filtering, spike-and-slab regression, and model averaging. We use Kalman filtering to whiten the time series in question by removing the trend and seasonal behavior. Spike-and-slab regression is a Bayesian method for variable selection that works even in cases where the number of predictors is far larger than the number of observations. Finally, we use Markov Chain Monte Carlo methods to sample from the posterior distribution for our model; the final forecast is an average over thousands of draws from the posterior. An advantage of the Bayesian approach is that it allows us to specify informative priors that affect the number and type of predictors in a flexible way.

【Keywords】: google data; macroeconomic indicators; nowcasting

Document and topic models 4

5. One theme in all views: modeling consensus topics in multiple contexts.

Paper Link】 【Pages】:5-13

【Authors】: Jian Tang ; Ming Zhang ; Qiaozhu Mei

【Abstract】: New challenges have been presented to classical topic models when applied to social media, as user-generated content suffers from significant problems of data sparseness. A variety of heuristic adjustments to these models have been proposed, many of which are based on the use of context information to improve the performance of topic modeling. Existing contextualized topic models rely on arbitrary manipulation of the model structure, by incorporating various context variables into the generative process of classical topic models in an ad hoc manner. Such manipulations usually result in much more complicated model structures, sophisticated inference procedures, and low generalizability to accommodate arbitrary types or combinations of contexts. In this paper we explore a different direction. We propose a general solution that is able to exploit multiple types of contexts without arbitrary manipulation of the structure of classical topic models. We formulate different types of contexts as multiple views of the partition of the corpus. A co-regularization framework is proposed to let these views collaborate with each other, vote for the consensus topics, and distinguish them from view-specific topics. Experiments with real-world datasets prove that the proposed method is both effective and flexible to handle arbitrary types of contexts.

【Keywords】: co-regularization; multiple contexts; topic modeling; user-generated content

6. Representing documents through their readers.

Paper Link】 【Pages】:14-22

【Authors】: Khalid El-Arini ; Min Xu ; Emily B. Fox ; Carlos Guestrin

【Abstract】: From Twitter to Facebook to Reddit, users have become accustomed to sharing the articles they read with friends or followers on their social networks. While previous work has modeled what these shared stories say about the user who shares them, the converse question remains unexplored: what can we learn about an article from the identities of its likely readers? To address this question, we model the content of news articles and blog posts by attributes of the people who are likely to share them. For example, many Twitter users describe themselves in a short profile, labeling themselves with phrases such as "vegetarian" or "liberal." By assuming that a user's labels correspond to topics in the articles he shares, we can learn a labeled dictionary from a training corpus of articles shared on Twitter. Thereafter, we can code any new document as a sparse non-negative linear combination of user labels, where we encourage correlated labels to appear together in the output via a structured sparsity penalty. Finally, we show that our approach yields a novel document representation that can be effectively used in many problem settings, from recommendation to modeling news dynamics. For example, while the top politics stories will change drastically from one month to the next, the "politics" label will still be there to describe them. We evaluate our model on millions of tweeted news articles and blog posts collected between September 2010 and September 2012, demonstrating that our approach is effective.

【Keywords】: document modeling; structured sparsity; twitter

7. Text-based measures of document diversity.

Paper Link】 【Pages】:23-31

【Authors】: Kevin Bache ; David Newman ; Padhraic Smyth

【Abstract】: Quantitative notions of diversity have been explored across a variety of disciplines ranging from conservation biology to economics. However, there has been relatively little work on measuring the diversity of text documents via their content. In this paper we present a text-based framework for quantifying how diverse a document is in terms of its content. The proposed approach learns a topic model over a corpus of documents, and computes a distance matrix between pairs of topics using measures such as topic co-occurrence. These pairwise distance measures are then combined with the distribution of topics within a document to estimate each document's diversity relative to the rest of the corpus. The method provides several advantages over existing methods. It is fully data-driven, requiring only the text from a corpus of documents as input, it produces human-readable explanations, and it can be generalized to score diversity of other entities such as authors, academic departments, or journals. We describe experimental results on several large data sets which suggest that the approach is effective and accurate in quantifying how diverse a document is relative to other documents in a corpus.

【Keywords】: diversity; interdisciplinarity

8. Diversity maximization under matroid constraints.

Paper Link】 【Pages】:32-40

【Authors】: Zeinab Abbassi ; Vahab S. Mirrokni ; Mayur Thakur

【Abstract】: Aggregator websites typically present documents in the form of representative clusters. In order for users to get a broader perspective, it is important to deliver a diversified set of representative documents in those clusters. One approach to diversification is to maximize the average dissimilarity among documents. Another way to capture diversity is to avoid showing several documents from the same category (e.g. from the same news channel). We combine the above two diversification concepts by modeling the latter approach as a (partition) matroid constraint, and study diversity maximization problems under matroid constraints. We present the first constant-factor approximation algorithm for this problem, using a new technique. Our local search 0.5-approximation algorithm is also the first constant-factor approximation for the max-dispersion problem under matroid constraints. Our combinatorial proof technique for maximizing diversity under matroid constraints uses the existence of a family of Latin squares which may also be of independent interest. In order to apply these diversity maximization algorithms in the context of aggregator websites and as a preprocessing step for our diversity maximization tool, we develop greedy clustering algorithms that maximize weighted coverage of a predefined set of topics. Our algorithms are based on computing a set of cluster centers, where clusters are formed around them. We show the better performance of our algorithms for diversity and coverage maximization by running experiments on real (Twitter) and synthetic data in the context of real-time search over micro-posts. Finally we perform a user study validating our algorithms and diversity metrics.

【Keywords】: approximation algorithms; clustering; diversity maximization; local search algorithms; matroid constraints

Social media 4

9. Connecting users across social media sites: a behavioral-modeling approach.

Paper Link】 【Pages】:41-49

【Authors】: Reza Zafarani ; Huan Liu

【Abstract】: People use various social media for different purposes. The information on an individual site is often incomplete. When sources of complementary information are integrated, a better profile of a user can be built to improve online services such as verifying online information. To integrate these sources of information, it is necessary to identify individuals across social media sites. This paper aims to address the cross-media user identification problem. We introduce a methodology (MOBIUS) for finding a mapping among identities of individuals across social media sites. It consists of three key components: the first component identifies users' unique behavioral patterns that lead to information redundancies across sites; the second component constructs features that exploit information redundancies due to these behavioral patterns; and the third component employs machine learning for effective user identification. We formally define the cross-media user identification problem and show that MOBIUS is effective in identifying users across social media sites. This study paves the way for analysis and mining across social media sites, and facilitates the creation of novel online services across sites.

【Keywords】: cross-media analysis; mobius; user identification

10. Automatic selection of social media responses to news.

Paper Link】 【Pages】:50-58

【Authors】: Tadej Stajner ; Bart Thomee ; Ana-Maria Popescu ; Marco Pennacchiotti ; Alejandro Jaimes

【Abstract】: Social media responses to news have increasingly gained in importance as they can enhance a consumer's news reading experience, promote information sharing and aid journalists in assessing their readership's response to a story. Given that the number of responses to an online news article may be huge, a common challenge is that of selecting only the most interesting responses for display. This paper addresses this challenge by casting message selection as an optimization problem. We define an objective function which jointly models the messages' utility scores and their entropy. We propose a near-optimal solution to the underlying optimization problem, which leverages the submodularity property of the objective function. Our solution first learns the utility of individual messages in isolation and then produces a diverse selection of interesting messages by maximizing the defined objective function. The intuitions behind our work are that an interesting selection of messages contains diverse, informative, opinionated and popular messages referring to the news article, written mostly by users that have authority on the topic. Our intuitions are embodied by a rich set of content, social and user features capturing the aforementioned aspects. We evaluate our approach through both human and automatic experiments, and demonstrate it outperforms the state of the art. Additionally, we perform an in-depth analysis of the annotated ``interesting'' responses, shedding light on the subjectivity around the selection process and the perception of interestingness.

【Keywords】: microblogging; sampling; social media; summarization

11. Estimating sharer reputation via social data calibration.

Paper Link】 【Pages】:59-67

【Authors】: Jaewon Yang ; Bee-Chung Chen ; Deepak Agarwal

【Abstract】: Online social networks have become important channels for users to share content with their connections and diffuse information. Although much work has been done to identify socially influential users, the problem of finding "reputable" sharers, who share good content, has received relatively little attention. Availability of such reputation scores can be useful or various applications like recommending people to follow, procuring high quality content in a scalable way, creating a content reputation economy to incentivize high quality sharing, and many more. To estimate sharer reputation, it is intuitive to leverage data that records how recipients respond (through clicking, liking, etc.) to content items shared by a sharer. However, such data is usually biased --- it has a selection bias since the shared items can only be seen and responded to by users connected to the sharer in most social networks, and it has a response bias since the response is usually influenced by the relationship between the sharer and the recipient (which may not indicate whether the shared content is good). To correct for such biases, we propose to utilize an additional data source that provides unbiased goodness estimates for a small set of shared items, and calibrate biased social data through a novel multi-level hierarchical model that describes how the unbiased data and biased data are jointly generated according to sharer reputation scores. The unbiased data also provides the ground truth for quantitative evaluation of different methods. Experiments based on such ground-truth data show that our proposed model significantly outperforms existing methods that estimate social influence using biased social data.

【Keywords】: influential users; sharer reputation

12. Linking named entities in Tweets with knowledge base via user interest modeling.

Paper Link】 【Pages】:68-76

【Authors】: Wei Shen ; Jianyong Wang ; Ping Luo ; Min Wang

【Abstract】: Twitter has become an increasingly important source of information, with more than 400 million tweets posted per day. The task to link the named entity mentions detected from tweets with the corresponding real world entities in the knowledge base is called tweet entity linking. This task is of practical importance and can facilitate many different tasks, such as personalized recommendation and user interest discovery. The tweet entity linking task is challenging due to the noisy, short, and informal nature of tweets. Previous methods focus on linking entities in Web documents, and largely rely on the context around the entity mention and the topical coherence between entities in the document. However, these methods cannot be effectively applied to the tweet entity linking task due to the insufficient context information contained in a tweet. In this paper, we propose KAURI, a graph-based framework to collectively link all the named entity mentions in all tweets posted by a user via modeling the user's topics of interest. Our assumption is that each user has an underlying topic interest distribution over various named entities. KAURI integrates the intra-tweet local information with the inter-tweet user interest information into a unified graph-based framework. We extensively evaluated the performance of KAURI over manually annotated tweet corpus, and the experimental results show that KAURI significantly outperforms the baseline methods in terms of accuracy, and KAURI is efficient and scales well to tweet stream.

【Keywords】: knowledge base; tweet entity linking; user interest modeling

Big data frameworks 3

13. TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC.

Paper Link】 【Pages】:77-85

【Authors】: Wook-Shin Han ; Sangyeon Lee ; Kyungyeol Park ; Jeong-Hoon Lee ; Min-Soo Kim ; Jinha Kim ; Hwanjo Yu

【Abstract】: Graphs are used to model many real objects such as social networks and web graphs. Many real applications in various fields require efficient and effective management of large-scale graph structured data. Although distributed graph engines such as GBase and Pregel handle billion-scale graphs, the user needs to be skilled at managing and tuning a distributed system in a cluster, which is a nontrivial job for the ordinary user. Furthermore, these distributed systems need many machines in a cluster in order to provide reasonable performance. In order to address this problem, a disk-based parallel graph engine called Graph-Chi, has been recently proposed. Although Graph-Chi significantly outperforms all representative (disk-based) distributed graph engines, we observe that Graph-Chi still has serious performance problems for many important types of graph queries due to 1) limited parallelism and 2) separate steps for I/O processing and CPU processing. In this paper, we propose a general, disk-based graph engine called TurboGraph to process billion-scale graphs very efficiently by using modern hardware on a single PC. TurboGraph is the first truly parallel graph engine that exploits 1) full parallelism including multi-core parallelism and FlashSSD IO parallelism and 2) full overlap of CPU processing and I/O processing as much as possible. Specifically, we propose a novel parallel execution model, called pin-and-slide. TurboGraph also provides engine-level operators such as BFS which are implemented under the pin-and-slide model. Extensive experimental results with large real datasets show that TurboGraph consistently and significantly outperforms Graph-Chi by up to four orders of magnitude! Our implementation of TurboGraph is available at ``http://wshan.net/turbograph}" as executable files.

【Keywords】: big data; graph processing; parallelism; pin-and-slide

14. Beyond myopic inference in big data pipelines.

Paper Link】 【Pages】:86-94

【Authors】: Karthik Raman ; Adith Swaminathan ; Johannes Gehrke ; Thorsten Joachims

【Abstract】: Big Data Pipelines decompose complex analyses of large data sets into a series of simpler tasks, with independently tuned components for each task. This modular setup allows re-use of components across several different pipelines. However, the interaction of independently tuned pipeline components yields poor end-to-end performance as errors introduced by one component cascade through the whole pipeline, affecting overall accuracy. We propose a novel model for reasoning across components of Big Data Pipelines in a probabilistically well-founded manner. Our key idea is to view the interaction of components as dependencies on an underlying graphical model. Different message passing schemes on this graphical model provide various inference algorithms to trade-off end-to-end performance and computational cost. We instantiate our framework with an efficient beam search algorithm, and demonstrate its efficiency on two Big Data Pipelines: parsing and relation extraction.

【Keywords】: big data pipelines; modular design; probabilistic inference

15. Big data analytics with small footprint: squaring the cloud.

Paper Link】 【Pages】:95-103

【Authors】: John Canny ; Huasha Zhao

【Abstract】: This paper describes the BID Data Suite, a collection of hardware, software and design patterns that enable fast, large-scale data mining at very low cost. By co-designing all of these elements we achieve single-machine performance levels that equal or exceed reported cluster implementations for common benchmark problems. A key design criterion is rapid exploration of models, hence the system is interactive and primarily single-user. The elements of the suite are: (i) the data engine, a hardware design pattern that balances storage, CPU and GPU acceleration for typical data mining workloads, (ii) BIDMat, an interactive matrix library that integrates CPU and GPU acceleration and novel computational kernels (iii), BIDMach, a machine learning system that includes very efficient model optimizers, (iv) Butterfly mixing, a communication strategy that hides the latency of frequent model updates needed by fast optimizers and (v) Design patterns to improve performance of iterative update algorithms. We present several benchmark problems to show how the above elements combine to yield multiple orders-of-magnitude improvements for each problem.

【Keywords】: cluster; data mining; gpu; machine learning; toolkit

Graph mining 4

16. Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees.

Paper Link】 【Pages】:104-112

【Authors】: Charalampos E. Tsourakakis ; Francesco Bonchi ; Aristides Gionis ; Francesco Gullo ; Maria A. Tsiarli

【Abstract】: Finding dense subgraphs is an important graph-mining task with many applications. Given that the direct optimization of edge density is not meaningful, as even a single edge achieves maximum density, research has focused on optimizing alternative density functions. A very popular among such functions is the average degree, whose maximization leads to the well-known densest-subgraph notion. Surprisingly enough, however, densest subgraphs are typically large graphs, with small edge density and large diameter. In this paper, we define a novel density function, which gives subgraphs of much higher quality than densest subgraphs: the graphs found by our method are compact, dense, and with smaller diameter. We show that the proposed function can be derived from a general framework, which includes other important density functions as subcases and for which we show interesting general theoretical properties. To optimize the proposed function we provide an additive approximation algorithm and a local-search heuristic. Both algorithms are very efficient and scale well to large graphs. We evaluate our algorithms on real and synthetic datasets, and we also devise several application studies as variants of our original problem. When compared with the method that finds the subgraph of the largest average degree, our algorithms return denser subgraphs with smaller diameter. Finally, we discuss new interesting research directions that our problem leaves open.

【Keywords】: dense subgraph; graph mining; quasi-clique

17. Guided learning for role discovery (GLRD): framework, algorithms, and applications.

Paper Link】 【Pages】:113-121

【Authors】: Sean Gilpin ; Tina Eliassi-Rad ; Ian N. Davidson

【Abstract】: Role discovery in graphs is an emerging area that allows analysis of complex graphs in an intuitive way. In contrast to community discovery, which finds groups of highly connected nodes, role discovery finds groups of nodes that share similar topological structure in the graph, and hence a common role (or function) such as being a broker or a periphery node. However, existing work so far is completely unsupervised, which is undesirable for a number of reasons. We provide an alternating least squares framework that allows convex constraints to be placed on the role discovery problem, which can provide useful supervision. In particular we explore supervision to enforce i) sparsity, ii) diversity, and iii) alternativeness in the roles. We illustrate the usefulness of this supervision on various data sets and applications.

【Keywords】: constrained clustering; graph mining; role discovery

18. Redundancy-aware maximal cliques.

Paper Link】 【Pages】:122-130

【Authors】: Jia Wang ; James Cheng ; Ada Wai-Chee Fu

【Abstract】: Recent research efforts have made notable progress in improving the performance of (exhaustive) maximal clique enumeration (MCE). However, existing algorithms still suffer from exploring the huge search space of MCE. Furthermore, their results are often undesirable as many of the returned maximal cliques have large overlapping parts. This redundancy leads to problems in both computational efficiency and usefulness of MCE. In this paper, we aim at providing a concise and complete summary of the set of maximal cliques, which is useful to many applications. We propose the notion of τ-visible MCE to achieve this goal and design algorithms to realize the notion. Based on the refined output space, we further consider applications including an efficient computation of the top-k results with diversity and an interactive clique exploration process. Our experimental results demonstrate that our approach is capable of producing output of high usability and our algorithms achieve superior efficiency over classic MCE algorithms.

【Keywords】: clique concise representation; clique summarization; maximal clique enumeration

19. Selective sampling on graphs for classification.

Paper Link】 【Pages】:131-139

【Authors】: Quanquan Gu ; Charu C. Aggarwal ; Jialu Liu ; Jiawei Han

【Abstract】: Selective sampling is an active variant of online learning in which the learner is allowed to adaptively query the label of an observed example. The goal of selective sampling is to achieve a good trade-off between prediction performance and the number of queried labels. Existing selective sampling algorithms are designed for vector-based data. In this paper, motivated by the ubiquity of graph representations in real-world applications, we propose to study selective sampling on graphs. We first present an online version of the well-known Learning with Local and Global Consistency method (OLLGC). It is essentially a second-order online learning algorithm, and can be seen as an online ridge regression in the Hilbert space of functions defined on graphs. We prove its regret bound in terms of the structural property (cut size) of a graph. Based on OLLGC, we present a selective sampling algorithm, namely Selective Sampling with Local and Global Consistency (SSLGC), which queries the label of each node based on the confidence of the linear function on graphs. Its bound on the label complexity is also derived. We analyze the low-rank approximation of graph kernels, which enables the online algorithms scale to large graphs. Experiments on benchmark graph datasets show that OLLGC outperforms the state-of-the-art first-order algorithm significantly, and SSLGC achieves comparable or even better results than OLLGC while querying substantially fewer nodes. Moreover, SSLGC is overwhelmingly better than random sampling.

【Keywords】: label complexity; mistake bound; online learning; regret bound; selective sampling on graphs

Classification 4

20. Density-based logistic regression.

Paper Link】 【Pages】:140-148

【Authors】: Wenlin Chen ; Yixin Chen ; Yi Mao ; Baolong Guo

【Abstract】: This paper introduces a nonlinear logistic regression model for classification. The main idea is to map the data to a feature space based on kernel density estimation. A discriminative model is then learned to optimize the feature weights as well as the bandwidth of a Nadaraya-Watson kernel density estimator. We then propose a hierarchical optimization algorithm for learning the coefficients and kernel bandwidths in an integrated way. Compared to other nonlinear models such as kernel logistic regression (KLR) and SVM, our approach is far more efficient since it solves an optimization problem with a much smaller size. Two other major advantages are that it can cope with categorical attributes in a unified fashion and naturally handle multi-class problems. Moveover, our approach inherits from logistic regression good interpretability of the model, which is important for clinical applications but not offered by KLR and SVM. Extensive results on real datasets, including a clinical prediction application currently under deployment in a major hospital, show that our approach not only achieves superior classification accuracy, but also drastically reduces the computing time as compared to other leading methods.

【Keywords】: density estimation; logistic regression; medical prediction; nonlinear classification

21. MI2LS: multi-instance learning from multiple informationsources.

Paper Link】 【Pages】:149-157

【Authors】: Dan Zhang ; Jingrui He ; Richard D. Lawrence

【Abstract】: In Multiple Instance Learning (MIL), each entity is normally expressed as a set of instances. Most of the current MIL methods only deal with the case when each instance is represented by one type of features. However, in many real world applications, entities are often described from several different information sources/views. For example, when applying MIL to image categorization, the characteristics of each image can be derived from both its RGB features and SIFT features. Previous research work has shown that, in traditional learning methods, leveraging the consistencies between different information sources could improve the classification performance drastically. Out of a similar motivation, to incorporate the consistencies between different information sources into MIL, we propose a novel research framework -- Multi-Instance Learning from Multiple Information Sources (MI2LS). Based on this framework, an algorithm -- Fast MI2LS (FMI2LS) is designed, which combines Concave-Convex Constraint Programming (CCCP) method and an adapte- d Stoachastic Gradient Descent (SGD) method. Some theoretical analysis on the optimality of the adapted SGD method and the generalized error bound of the formulation are given based on the proposed method. Experimental results on document classification and a novel application -- Insider Threat Detection (ITD), clearly demonstrate the superior performance of the proposed method over state-of-the-art MIL methods.

【Keywords】: multi-instance learning; multi-view learning; stoachastic gradient descent

22. Querying discriminative and representative samples for batch mode active learning.

Paper Link】 【Pages】:158-166

【Authors】: Zheng Wang ; Jieping Ye

【Abstract】: Empirical risk minimization (ERM) provides a useful guideline for many machine learning and data mining algorithms. Under the ERM principle, one minimizes an upper bound of the true risk, which is approximated by the summation of empirical risk and the complexity of the candidate classifier class. To guarantee a satisfactory learning performance, ERM requires that the training data are i.i.d. sampled from the unknown source distribution. However, this may not be the case in active learning, where one selects the most informative samples to label and these data may not follow the source distribution. In this paper, we generalize the empirical risk minimization principle to the active learning setting. We derive a novel form of upper bound for the true risk in the active learning setting; by minimizing this upper bound we develop a practical batch mode active learning method. The proposed formulation involves a non-convex integer programming optimization problem. We solve it efficiently by an alternating optimization method. Our method is shown to query the most informative samples while preserving the source distribution as much as possible, thus identifying the most uncertain and representative queries. Experiments on benchmark data sets and real-world applications demonstrate the superior performance of our proposed method in comparison with the state-of-the-art methods.

【Keywords】: active learning; empirical risk minimization; maximum mean discrepancy; representative and discriminative

23. SVMpAUCtight: a new support vector method for optimizing partial AUC based on a tight convex upper bound.

Paper Link】 【Pages】:167-175

【Authors】: Harikrishna Narasimhan ; Shivani Agarwal

【Abstract】: The area under the ROC curve (AUC) is a well known performance measure in machine learning and data mining. In an increasing number of applications, however, ranging from ranking applications to a variety of important bioinformatics applications, performance is measured in terms of the partial area under the ROC curve between two specified false positive rates. In recent work, we proposed a structural SVM based approach for optimizing this performance measure (Narasimhan and Agarwal, 2013). In this paper, we develop a new support vector method, SVMpAUCtight, that optimizes a tighter convex upper bound on the partial AUC loss, which leads to both improved accuracy and reduced computational complexity. In particular, by rewriting the empirical partial AUC risk as a maximum over subsets of negative instances, we derive a new formulation, where a modified form of the earlier optimization objective is evaluated on each of these subsets, leading to a tighter hinge relaxation on the partial AUC loss. As with our previous method, the resulting optimization problem can be solved using a cutting-plane algorithm, but the new method has better run time guarantees. We also discuss a projected subgradient method for solving this problem, which offers additional computational savings in certain settings. We demonstrate on a wide variety of bioinformatics tasks, ranging from protein-protein interaction prediction to drug discovery tasks, that the proposed method does, in many cases, perform significantly better on the partial AUC measure than the previous structural SVM approach. In addition, we also develop extensions of our method to learn sparse and group sparse models, often of interest in biological applications.

【Keywords】: cutting-plane method; partial auc; roc curve; svm

Healthcare and bioinformatics 3

Paper Link】 【Pages】:176-184

【Authors】: Yasuo Tabei ; Akihiro Kishimoto ; Masaaki Kotera ; Yoshihiro Yamanishi

【Abstract】: Analyzing functional interactions between small compounds and proteins is indispensable in genomic drug discovery. Since rich information on various compound-protein inter- actions is available in recent molecular databases, strong demands for making best use of such databases require to in- vent powerful methods to help us find new functional compound-protein pairs on a large scale. We present the succinct interval-splitting tree algorithm (SITA) that efficiently per- forms similarity search in databases for compound-protein pairs with respect to both binary fingerprints and real-valued properties. SITA achieves both time and space efficiency by developing the data structure called interval-splitting trees, which enables to efficiently prune the useless portions of search space, and by incorporating the ideas behind wavelet tree, a succinct data structure to compactly represent trees. We experimentally test SITA on the ability to retrieve similar compound-protein pairs/substrate-product pairs for a query from large databases with over 200 million compound- protein pairs/substrate-product pairs and show that SITA performs better than other possible approaches.

【Keywords】: similarity search; succinct data structure; wavelet tree

25. Multi-source learning with block-wise missing data for Alzheimer's disease prediction.

Paper Link】 【Pages】:185-193

【Authors】: Shuo Xiang ; Lei Yuan ; Wei Fan ; Yalin Wang ; Paul M. Thompson ; Jieping Ye

【Abstract】: With the advances and increasing sophistication in data collection techniques, we are facing with large amounts of data collected from multiple heterogeneous sources in many applications. For example, in the study of Alzheimer's Disease (AD), different types of measurements such as neuroimages, gene/protein expression data, genetic data etc. are often collected and analyzed together for improved predictive power. It is believed that a joint learning of multiple data sources is beneficial as different data sources may contain complementary information, and feature-pruning and data source selection are critical for learning interpretable models from high-dimensional data. Very often the collected data comes with block-wise missing entries; for example, a patient without the MRI scan will have no information in the MRI data block, making his/her overall record incomplete. There has been a growing interest in the data mining community on expanding traditional techniques for single-source complete data analysis to the study of multi-source incomplete data. The key challenge is how to effectively integrate information from multiple heterogeneous sources in the presence of block-wise missing data. In this paper we first investigate the situation of complete data and present a unified ``bi-level" learning model for multi-source data. Then we give a natural extension of this model to the more challenging case with incomplete data. Our major contributions are threefold: (1) the proposed models handle both feature-level and source-level analysis in a unified formulation and include several existing feature learning approaches as special cases; (2) the model for incomplete data avoids direct imputation of the missing elements and thus provides superior performances. Moreover, it can be easily generalized to other applications with block-wise missing data sources; (3) efficient optimization algorithms are presented for both the complete and incomplete models. We have performed comprehensive evaluations of the proposed models on the application of AD diagnosis. Our proposed models compare favorably against existing approaches.

【Keywords】: alzheimer's disease; block-wise missing data; multi-source; optimization

26. Network discovery via constrained tensor analysis of fMRI data.

Paper Link】 【Pages】:194-202

【Authors】: Ian N. Davidson ; Sean Gilpin ; Owen T. Carmichael ; Peter B. Walker

【Abstract】: We pose the problem of network discovery which involves simplifying spatio-temporal data into cohesive regions (nodes) and relationships between those regions (edges). Such problems naturally exist in fMRI scans of human subjects. These scans consist of activations of thousands of voxels over time with the aim to simplify them into the underlying cognitive network being used. We propose supervised and semi-supervised variations of this problem and postulate a constrained tensor decomposition formulation and a corresponding alternating least squares solver that is easy to implement. We show this formulation works well in controlled experiments where supervision is incomplete, superfluous and noisy and is able to recover the underlying ground truth network. We then show that for real fMRI data our approach can reproduce well known results in neurology regarding the default mode network in resting-state healthy and Alzheimer affected individuals. Finally, we show that the reconstruction error of the decomposition provides a useful measure of the network strength and is useful at predicting key cognitive scores both by itself and with clinical information.

【Keywords】: cosntraints; fmri; tensor decomposition

Recommender systems 3

27. Learning to question: leveraging user preferences for shopping advice.

Paper Link】 【Pages】:203-211

【Authors】: Mahashweta Das ; Gianmarco De Francisci Morales ; Aristides Gionis ; Ingmar Weber

【Abstract】: We present ShoppingAdvisor, a novel recommender system that helps users in shopping for technical products. ShoppingAdvisor leverages both user preferences and technical product attributes in order to generate its suggestions. The system elicits user preferences via a tree-shaped flowchart, where each node is a question to the user. At each node, ShoppingAdvisor suggests a ranking of products matching the preferences of the user, and that gets progressively refined along the path from the tree's root to one of its leafs. In this paper we show (i) how to learn the structure of the tree, i.e., which questions to ask at each node, and (ii) how to produce a suitable ranking at each node. First, we adapt the classical top-down strategy for building decision trees in order to find the best user attribute to ask at each node. Differently from decision trees, ShoppingAdvisor partitions the user space rather than the product space. Second, we show how to employ a learning-to-rank approach in order to learn, for each node of the tree, a ranking of products appropriate to the users who reach that node. We experiment with two real-world datasets for cars and cameras, and a synthetic one. We use mean reciprocal rank to evaluate ShoppingAdvisor, and show how the performance increases by more than 50% along the path from root to leaf. We also show how collaborative recommendation algorithms such as k-nearest neighbor benefits from feature selection done by the ShoppingAdvisor tree. Our experiments show that ShoppingAdvisor produces good quality interpretable recommendations, while requiring less input from users and being able to handle the cold-start problem.

【Keywords】: collaborative content; learning; ranking; recommendation

Paper Link】 【Pages】:212-220

【Authors】: Dougal J. Sutherland ; Barnabás Póczos ; Jeff G. Schneider

【Abstract】: Collaborative prediction is a powerful technique, useful in domains from recommender systems to guiding the scientific discovery process. Low-rank matrix factorization is one of the most powerful tools for collaborative prediction. This work presents a general approach for active collaborative prediction with the Probabilistic Matrix Factorization model. Using variational approximations or Markov chain Monte Carlo sampling to estimate the posterior distribution over models, we can choose query points to maximize our understanding of the model, to best predict unknown elements of the data matrix, or to find as many "positive" data points as possible. We evaluate our methods on simulated data, and also show their applicability to movie ratings prediction and the discovery of drug-target interactions.

【Keywords】: active learning; active search; cold-start; collaborative filtering; drug discovery; matrix factorization; recommender systems

29. LCARS: a location-content-aware recommender system.

Paper Link】 【Pages】:221-229

【Authors】: Hongzhi Yin ; Yizhou Sun ; Bin Cui ; Zhiting Hu ; Ling Chen

【Abstract】: Newly emerging location-based and event-based social network services provide us with a new platform to understand users' preferences based on their activity history. A user can only visit a limited number of venues/events and most of them are within a limited distance range, so the user-item matrix is very sparse, which creates a big challenge for traditional collaborative filtering-based recommender systems. The problem becomes more challenging when people travel to a new city where they have no activity history. In this paper, we propose LCARS, a location-content-aware recommender system that offers a particular user a set of venues (e.g., restaurants) or events (e.g., concerts and exhibitions) by giving consideration to both personal interest and local preference. This recommender system can facilitate people's travel not only near the area in which they live, but also in a city that is new to them. Specifically, LCARS consists of two components: offline modeling and online recommendation. The offline modeling part, called LCA-LDA, is designed to learn the interest of each individual user and the local preference of each individual city by capturing item co-occurrence patterns and exploiting item contents. The online recommendation part automatically combines the learnt interest of the querying user and the local preference of the querying city to produce the top-k recommendations. To speed up this online process, a scalable query processing technique is developed by extending the classic Threshold Algorithm (TA). We evaluate the performance of our recommender system on two large-scale real data sets, DoubanEvent and Foursquare. The results show the superiority of LCARS in recommending spatial items for users, especially when traveling to new cities, in terms of both effectiveness and efficiency.

【Keywords】: cold start; location-based service; probabilistic generative model; recommender system; ta algorithm

Scalable methods for big data 4

30. Comparing apples to oranges: a scalable solution with heterogeneous hashing.

Paper Link】 【Pages】:230-238

【Authors】: Mingdong Ou ; Peng Cui ; Fei Wang ; Jun Wang ; Wenwu Zhu ; Shiqiang Yang

【Abstract】: Although hashing techniques have been popular for the large scale similarity search problem, most of the existing methods for designing optimal hash functions focus on homogeneous similarity assessment, i.e., the data entities to be indexed are of the same type. Realizing that heterogeneous entities and relationships are also ubiquitous in the real world applications, there is an emerging need to retrieve and search similar or relevant data entities from multiple heterogeneous domains, e.g., recommending relevant posts and images to a certain Facebook user. In this paper, we address the problem of ``comparing apples to oranges'' under the large scale setting. Specifically, we propose a novel Relation-aware Heterogeneous Hashing (RaHH), which provides a general framework for generating hash codes of data entities sitting in multiple heterogeneous domains. Unlike some existing hashing methods that map heterogeneous data in a common Hamming space, the RaHH approach constructs a Hamming space for each type of data entities, and learns optimal mappings between them simultaneously. This makes the learned hash codes flexibly cope with the characteristics of different data domains. Moreover, the RaHH framework encodes both homogeneous and heterogeneous relationships between the data entities to design hash functions with improved accuracy. To validate the proposed RaHH method, we conduct extensive evaluations on two large datasets; one is crawled from a popular social media sites, Tencent Weibo, and the other is an open dataset of Flickr(NUS-WIDE). The experimental results clearly demonstrate that the RaHH outperforms several state-of-the-art hashing methods with significant performance gains.

【Keywords】: big data; heterogeneous hashing; heterogeneous network; heterogeneous similarity search; scalability

31. Fast and scalable polynomial kernels via explicit feature maps.

Paper Link】 【Pages】:239-247

【Authors】: Ninh Pham ; Rasmus Pagh

【Abstract】: Approximation of non-linear kernels using random feature mapping has been successfully employed in large-scale data analysis applications, accelerating the training of kernel machines. While previous random feature mappings run in O(ndD) time for $n$ training samples in d-dimensional space and D random feature maps, we propose a novel randomized tensor product technique, called Tensor Sketching, for approximating any polynomial kernel in O(n(d+D \log{D})) time. Also, we introduce both absolute and relative error bounds for our approximation to guarantee the reliability of our estimation algorithm. Empirically, Tensor Sketching achieves higher accuracy and often runs orders of magnitude faster than the state-of-the-art approach for large-scale real-world datasets.

【Keywords】: count sketch; fft; polynomial kernel; svm; tensor product

32. Indexed block coordinate descent for large-scale linear classification with limited memory.

Paper Link】 【Pages】:248-256

【Authors】: Ian En-Hsu Yen ; Chun-Fu Chang ; Ting-Wei Lin ; Shan-Wei Lin ; Shou-De Lin

【Abstract】: Linear Classification has achieved complexity linear to the data size. However, in many applications, data contain large amount of samples that does not help improve the quality of model, but still cost much I/O and memory to process. In this paper, we show how a Block Coordinate Descent method based on Nearest-Neighbor Index can significantly reduce such cost when learning a dual-sparse model. In particular, we employ truncated loss function to induce a series of convex programs with superior dual sparsity, and solve each dual using Indexed Block Coordinate Descent, which makes use of Approximate Nearest Neighbor (ANN) search to select active dual variables without I/O cost on irrelevant samples. We prove that, despite the bias and weak guarantee from ANN query, the proposed algorithm has global convergence to the solution defined on entire dataset, with sublinear complexity each iteration. Experiments in both sufficient and limited memory conditions show that the proposed approach learns many times faster than other state-of-the-art solvers without sacrificing accuracy.

【Keywords】: cccp; classification; dual coordinate descent; indexing; large-scale; limited-memory; nearest-neighbor; ramp-loss

33. Recursive regularization for large-scale classification with hierarchical and graphical dependencies.

Paper Link】 【Pages】:257-265

【Authors】: Siddharth Gopal ; Yiming Yang

【Abstract】: The two key challenges in hierarchical classification are to leverage the hierarchical dependencies between the class-labels for improving performance, and, at the same time maintaining scalability across large hierarchies. In this paper we propose a regularization framework for large-scale hierarchical classification that addresses both the problems. Specifically, we incorporate the hierarchical dependencies between the class-labels into the regularization structure of the parameters thereby encouraging classes nearby in the hierarchy to share similar model parameters. Furthermore, we extend our approach to scenarios where the dependencies between the class-labels are encoded in the form of a graph rather than a hierarchy. To enable large-scale training, we develop a parallel-iterative optimization scheme that can handle datasets with hundreds of thousands of classes and millions of instances and learning terabytes of parameters. Our experiments showed a consistent improvement over other competing approaches and achieved state-of-the-art results on benchmark datasets.

【Keywords】: hierarchical classification; large-scale evaluation; parallel optimization; recursive regularization

Temporal/social influence 3

34. Discovering latent influence in online social activities via shared cascade poisson processes.

Paper Link】 【Pages】:266-274

【Authors】: Tomoharu Iwata ; Amar Shah ; Zoubin Ghahramani

【Abstract】: Many people share their activities with others through online communities. These shared activities have an impact on other users' activities. For example, users are likely to become interested in items that are adopted (e.g. liked, bought and shared) by their friends. In this paper, we propose a probabilistic model for discovering latent influence from sequences of item adoption events. An inhomogeneous Poisson process is used for modeling a sequence, in which adoption by a user triggers the subsequent adoption of the same item by other users. For modeling adoption of multiple items, we employ multiple inhomogeneous Poisson processes, which share parameters, such as influence for each user and relations between users. The proposed model can be used for finding influential users, discovering relations between users and predicting item popularity in the future. We present an efficient Bayesian inference procedure of the proposed model based on the stochastic EM algorithm. The effectiveness of the proposed model is demonstrated by using real data sets in a social bookmark sharing service.

【Keywords】: bayesian inference; latent variable models; poisson processes; social community

35. STRIP: stream learning of influence probabilities.

Paper Link】 【Pages】:275-283

【Authors】: Konstantin Kutzkov ; Albert Bifet ; Francesco Bonchi ; Aristides Gionis

【Abstract】: Influence-driven diffusion of information is a fundamental process in social networks. Learning the latent variables of such process, i.e., the influence strength along each link, is a central question towards understanding the structure and function of complex networks, modeling information cascades, and developing applications such as viral marketing. Motivated by modern microblogging platforms, such as twitter, in this paper we study the problem of learning influence probabilities in a data-stream scenario, in which the network topology is relatively stable and the challenge of a learning algorithm is to keep up with a continuous stream of tweets using a small amount of time and memory. Our contribution is a number of randomized approximation algorithms, categorized according to the available space (superlinear, linear, and sublinear in the number of nodes n) and according to different models (landmark and sliding window). Among several results, we show that we can learn influence probabilities with one pass over the data, using O(nlog n) space, in both the landmark model and the sliding-window model, and we further show that our algorithm is within a logarithmic factor of optimal. For truly large graphs, when one needs to operate with sublinear space, we show that we can still learn influence probabilities in one pass, assuming that we restrict our attention to the most active users. Our thorough experimental evaluation on large social graph demonstrates that the empirical performance of our algorithms agrees with that predicted by the theory.

【Keywords】: randomized approximation algorithms; social influence; social network analysis; streaming

36. Fast structure learning in generalized stochastic processes with latent factors.

Paper Link】 【Pages】:284-292

【Authors】: Mohammad Taha Bahadori ; Yan Liu ; Eric P. Xing

【Abstract】: Understanding and quantifying the impact of unobserved processes is one of the major challenges of analyzing multivariate time series data. In this paper, we analyze a flexible stochastic process model, the generalized linear auto-regressive process (GLARP) and identify the conditions under which the impact of hidden variables appears as an additive term to the evolution matrix estimated with the maximum likelihood. In particular, we examine three examples, including two popular models for count data, i.e, Poisson and Conwey-Maxwell Poisson vector auto-regressive processes, and one powerful model for extreme value data, i.e., Gumbel vector auto-regressive processes. We demonstrate that the impact of hidden factors can be separated out via convex optimization in these three models. We also propose a fast greedy algorithm based on the selection of composite atoms in each iteration and provide a performance guarantee for it. Experiments on two synthetic datasets, one social network dataset and one climatology dataset demonstrate the the superior performance of our proposed models.

【Keywords】: generalized linear models; latent factors; time series analysis

Sparse learning 3

37. Robust sparse estimation of multiresponse regression and inverse covariance matrix via the L2 distance.

Paper Link】 【Pages】:293-301

【Authors】: Aurelie C. Lozano ; Huijing Jiang ; Xinwei Deng

【Abstract】: We propose a robust framework to jointly perform two key modeling tasks involving high dimensional data: (i) learning a sparse functional mapping from multiple predictors to multiple responses while taking advantage of the coupling among responses, and (ii) estimating the conditional dependency structure among responses while adjusting for their predictors. The traditional likelihood-based estimators lack resilience with respect to outliers and model misspecification. This issue is exacerbated when dealing with high dimensional noisy data. In this work, we propose instead to minimize a regularized distance criterion, which is motivated by the minimum distance functionals used in nonparametric methods for their excellent robustness properties. The proposed estimates can be obtained efficiently by leveraging a sequential quadratic programming algorithm. We provide theoretical justification such as estimation consistency for the proposed estimator. Additionally, we shed light on the robustness of our estimator through its linearization, which yields a combination of weighted lasso and graphical lasso with the sample weights providing an intuitive explanation of the robustness. We demonstrate the merits of our framework through simulation study and the analysis of real financial and genetics data.

【Keywords】: high dimensional data; inverse covariance; l2e; multiresponse regression; robust estimation; sparse learning; variable selection

38. Exact sparse recovery with L0 projections.

Paper Link】 【Pages】:302-310

【Authors】: Ping Li ; Cun-Hui Zhang

【Abstract】: Many applications (e.g., anomaly detection) concern sparse signals. This paper focuses on the problem of recovering a K-sparse signal x ∈ R/1×N, i.e., K << N and ∑N/i=1 1{xi ≠ 0} = K. In the mainstream framework of compressed sensing (CS), × is recovered from M linear measurements y = xS ∈ R/1×M, where S ∈ RN×M is often a Gaussian (or Gaussian-like) design matrix. In our proposed method, the design matrix S is generated from an α-stable distribution with α ≈ 0. Our decoding algorithm mainly requires one linear scan of the coordinates, followed by a few iterations on a small number of coordinates which are "undetermined" in the previous iteration. Our practical algorithm consists of two estimators. In the first iteration, the (absolute) minimum estimator is able to filter out a majority of the zero coordinates. The gap estimator, which is applied in each iteration, can accurately recover the magnitudes of the nonzero coordinates. Comparisons with linear programming (LP) and orthogonal matching pursuit (OMP) demonstrate that our algorithm can be significantly faster in decoding speed and more accurate in recovery quality, for the task of exact spare recovery. Our procedure is robust against measurement noise. Even when there are no sufficient measurements, our algorithm can still reliably recover a significant portion of the nonzero coordinates.

【Keywords】: compressed sensing; l0 projections; stable distributions

39. Robust principal component analysis via capped norms.

Paper Link】 【Pages】:311-319

【Authors】: Qian Sun ; Shuo Xiang ; Jieping Ye

【Abstract】: In many applications such as image and video processing, the data matrix often possesses simultaneously a low-rank structure capturing the global information and a sparse component capturing the local information. How to accurately extract the low-rank and sparse components is a major challenge. Robust Principal Component Analysis (RPCA) is a general framework to extract such structures. It is well studied that under certain assumptions, convex optimization using the trace norm and l1-norm can be an effective computation surrogate of the difficult RPCA problem. However, such convex formulation is based on a strong assumption which may not hold in real-world applications, and the approximation error in these convex relaxations often cannot be neglected. In this paper, we present a novel non-convex formulation for the RPCA problem using the capped trace norm and the capped l1-norm. In addition, we present two algorithms to solve the non-convex optimization: one is based on the Difference of Convex functions (DC) framework and the other attempts to solve the sub-problems via a greedy approach. Our empirical evaluations on synthetic and real-world data show that both of the proposed algorithms achieve higher accuracy than existing convex formulations. Furthermore, between the two proposed algorithms, the greedy algorithm is more efficient than the DC programming, while they achieve comparable accuracy.

【Keywords】: admm; dc programming; image processing; low-rank; non-convex optimization; sparsity; trace norm

Graph clustering 3

40. Flexible and robust co-regularized multi-domain graph clustering.

Paper Link】 【Pages】:320-328

【Authors】: Wei Cheng ; Xiang Zhang ; Zhishan Guo ; Yubao Wu ; Patrick F. Sullivan ; Wei Wang

【Abstract】: Multi-view graph clustering aims to enhance clustering performance by integrating heterogeneous information collected in different domains. Each domain provides a different view of the data instances. Leveraging cross-domain information has been demonstrated an effective way to achieve better clustering results. Despite the previous success, existing multi-view graph clustering methods usually assume that different views are available for the same set of instances. Thus instances in different domains can be treated as having strict one-to-one relationship. In many real-life applications, however, data instances in one domain may correspond to multiple instances in another domain. Moreover, relationships between instances in different domains may be associated with weights based on prior (partial) knowledge. In this paper, we propose a flexible and robust framework, CGC (Co-regularized Graph Clustering), based on non-negative matrix factorization (NMF), to tackle these challenges. CGC has several advantages over the existing methods. First, it supports many-to-many cross-domain instance relationship. Second, it incorporates weight on cross-domain relationship. Third, it allows partial cross-domain mapping so that graphs in different domains may have different sizes. Finally, it provides users with the extent to which the cross-domain instance relationship violates the in-domain clustering structure, and thus enables users to re-evaluate the consistency of the relationship. Extensive experimental results on UCI benchmark data sets, newsgroup data sets and biological interaction networks demonstrate the effectiveness of our approach.

【Keywords】: co-regularization; graph clustering; nonnegative matrix factorization

41. Graph cluster randomization: network exposure to multiple universes.

Paper Link】 【Pages】:329-337

【Authors】: Johan Ugander ; Brian Karrer ; Lars Backstrom ; Jon M. Kleinberg

【Abstract】: A/B testing is a standard approach for evaluating the effect of online experiments; the goal is to estimate the average treatment effect' of a new feature or condition by exposing a sample of the overall population to it. A drawback with A/B testing is that it is poorly suited for experiments involving social interference, when the treatment of individuals spills over to neighboring individuals along an underlying social network. In this work, we propose a novel methodology using graph clustering to analyze average treatment effects under social interference. To begin, we characterize graph-theoretic conditions under which individuals can be considered to benetwork exposed' to an experiment. We then show how graph cluster randomization admits an efficient exact algorithm to compute the probabilities for each vertex being network exposed under several of these exposure conditions. Using these probabilities as inverse weights, a Horvitz-Thompson estimator can then provide an effect estimate that is unbiased, provided that the exposure model has been properly specified. Given an estimator that is unbiased, we focus on minimizing the variance. First, we develop simple sufficient conditions for the variance of the estimator to be asymptotically small in n, the size of the graph. However, for general randomization schemes, this variance can be lower bounded by an exponential function of the degrees of a graph. In contrast, we show that if a graph satisfies a restricted-growth condition on the growth rate of neighborhoods, then there exists a natural clustering algorithm, based on vertex neighborhoods, for which the variance of the estimator can be upper bounded by a linear function of the degrees. Thus we show that proper cluster randomization can lead to exponentially lower estimator variance when experimentally measuring average treatment effects under interference.

【Keywords】: a/b testing; bucket testing; causal inference; graph clustering; interference; network effects; social networks

42. Social influence based clustering of heterogeneous information networks.

Paper Link】 【Pages】:338-346

【Authors】: Yang Zhou ; Ling Liu

【Abstract】: Social networks continue to grow in size and the type of information hosted. We witness a growing interest in clustering a social network of people based on both their social relationships and their participations in activity based information networks. In this paper, we present a social influence based clustering framework for analyzing heterogeneous information networks with three unique features. First, we introduce a novel social influence based vertex similarity metric in terms of both self-influence similarity and co-influence similarity. We compute self-influence and co-influence based similarity based on social graph and its associated activity graphs and influence graphs respectively. Second, we compute the combined social influence based similarity between each pair of vertices by unifying the self-similarity and multiple co-influence similarity scores through a weight function with an iterative update method. Third, we design an iterative learning algorithm, SI-Cluster, to dynamically refine the K clusters by continuously quantifying and adjusting the weights on self-influence similarity and on multiple co-influence similarity scores towards the clustering convergence. To make SI-Cluster converge fast, we transformed a sophisticated nonlinear fractional programming problem of multiple weights into a straightforward nonlinear parametric programming problem of single variable. Our experiment results show that SI-Cluster not only achieves a better balance between self-influence and co-influence similarities but also scales extremely well for large graph clustering.

【Keywords】: graph clustering; heterogeneous network; social influence

Diffusion in social networks 3

43. Confluence: conformity influence in large social networks.

Paper Link】 【Pages】:347-355

【Authors】: Jie Tang ; Sen Wu ; Jimeng Sun

【Abstract】: Conformity is a type of social influence involving a change in opinion or behavior in order to fit in with a group. Employing several social networks as the source for our experimental data, we study how the effect of conformity plays a role in changing users' online behavior. We formally define several major types of conformity in individual, peer, and group levels. We propose Confluence model to formalize the effects of social conformity into a probabilistic model. Confluence can distinguish and quantify the effects of the different types of conformities. To scale up to large social networks, we propose a distributed learning method that can construct the Confluence model efficiently with near-linear speedup. Our experimental results on four different types of large social networks, i.e., Flickr, Gowalla, Weibo and Co-Author, verify the existence of the conformity phenomena. Leveraging the conformity information, Confluence can accurately predict actions of users. Our experiments show that Confluence significantly improves the prediction accuracy by up to 5-10% compared with several alternative methods.

【Keywords】: conformity; social influence; social network

44. The role of information diffusion in the evolution of social networks.

Paper Link】 【Pages】:356-364

【Authors】: Lilian Weng ; Jacob Ratkiewicz ; Nicola Perra ; Bruno Gonçalves ; Carlos Castillo ; Francesco Bonchi ; Rossano Schifanella ; Filippo Menczer ; Alessandro Flammini

【Abstract】: Every day millions of users are connected through online social networks, generating a rich trove of data that allows us to study the mechanisms behind human interactions. Triadic closure has been treated as the major mechanism for creating social links: if Alice follows Bob and Bob follows Charlie, Alice will follow Charlie. Here we present an analysis of longitudinal micro-blogging data, revealing a more nuanced view of the strategies employed by users when expanding their social circles. While the network structure affects the spread of information among users, the network is in turn shaped by this communication activity. This suggests a link creation mechanism whereby Alice is more likely to follow Charlie after seeing many messages by Charlie. We characterize users with a set of parameters associated with different link creation strategies, estimated by a Maximum-Likelihood approach. Triadic closure does have a strong effect on link formation, but shortcuts based on traffic are another key factor in interpreting network evolution. However, individual strategies for following other users are highly heterogeneous. Link creation behaviors can be summarized by classifying users in different categories with distinct structural and behavioral characteristics. Users who are popular, active, and influential tend to create traffic-based shortcuts, making the information diffusion process more efficient in the network.

【Keywords】: information diffusion; link creation; network evolution; network structure; shortcut; social media; traffic; user behavior

45. Extracting social events for learning better information diffusion models.

Paper Link】 【Pages】:365-373

【Authors】: Shuyang Lin ; Fengjiao Wang ; Qingbo Hu ; Philip S. Yu

【Abstract】: Learning of the information diffusion model is a fundamental problem in the study of information diffusion in social networks. Existing approaches learn the diffusion models from events in social networks. However, events in social networks may have different underlying reasons. Some of them may be caused by the social influence inside the network, while others may reflect external trends in the ``real world''. Most existing work on the learning of diffusion models does not distinguish the events caused by the social influence from those caused by external trends. In this paper, we extract social events from data streams in social networks, and then use the extracted social events to improve the learning of information diffusion models. We propose a LADP (Latent Action Diffusion Path) model to incorporate the information diffusion model with the model of external trends, and then design an EM-based algorithm to infer the diffusion probabilities, the external trends and the sources of events efficiently.

【Keywords】: information diffusion; social event; social influence

Time series and spatial data 3

46. Model selection in markovian processes.

Paper Link】 【Pages】:374-382

【Authors】: Assaf Hallak ; Dotan Di Castro ; Shie Mannor

【Abstract】: When analyzing data that originated from a dynamical system, a common practice is to encompass the problem in the well known frameworks of Markov Decision Processes (MDPs) and Reinforcement Learning (RL). The state space in these solutions is usually chosen in some heuristic fashion and the formed MDP can then be used to simulate and predict data, as well as indicate the best possible action in each state. The model chosen to characterize the data affects the complexity and accuracy of any further action we may wish to apply, yet few methods that rely on the dynamic structure to select such a model were suggested. In this work we address the problem of how to use time series data to choose from a finite set of candidate discrete state spaces, where these spaces are constructed by a domain expert. We formalize the notion of model selection consistency in the proposed setup. We then discuss the difference between our proposed framework and the classical Maximum Likelihood (ML) framework, and give an example where ML fails. Afterwards, we suggest alternative selection criteria and show them to be weakly consistent. We then define weak consistency for a model construction algorithm and show a simple algorithm that is weakly consistent. Finally, we test the performance of the suggested criteria and algorithm on both simulated and real world data.

【Keywords】: dynamic mailing policies; markov decision processes; model selection; reinforcement learning

47. DTW-D: time series semi-supervised learning from a single example.

Paper Link】 【Pages】:383-391

【Authors】: Yanping Chen ; Bing Hu ; Eamonn J. Keogh ; Gustavo E. A. P. A. Batista

【Abstract】: Classification of time series data is an important problem with applications in virtually every scientific endeavor. The large research community working on time series classification has typically used the UCR Archive to test their algorithms. In this work we argue that the availability of this resource has isolated much of the research community from the following reality, labeled time series data is often very difficult to obtain. The obvious solution to this problem is the application of semi-supervised learning; however, as we shall show, direct applications of off-the-shelf semi-supervised learning algorithms do not typically work well for time series. In this work we explain why semi-supervised learning algorithms typically fail for time series problems, and we introduce a simple but very effective fix. We demonstrate our ideas on diverse real word problems.

【Keywords】: classification; semi-supervised learning; time series

48. Model-based kernel for efficient time series analysis.

Paper Link】 【Pages】:392-400

【Authors】: Huanhuan Chen ; Fengzhen Tang ; Peter Tiño ; Xin Yao

【Abstract】: We present novel, efficient, model based kernels for time series data rooted in the reservoir computation framework. The kernels are implemented by fitting reservoir models sharing the same fixed deterministically constructed state transition part to individual time series. The proposed kernels can naturally handle time series of different length without the need to specify a parametric model class for the time series. Compared with most time series kernels, our kernels are computationally efficient. We show how the model distances used in the kernel can be calculated analytically or efficiently estimated. The experimental results on synthetic and benchmark time series classification tasks confirm the efficiency of the proposed kernel in terms of both generalization accuracy and computational speed. This paper also investigates on-line reservoir kernel construction for extremely long time series.

【Keywords】: kernel methods; reservoir computing; time series

Diffusion in social networks 1

49. Information cascade at group scale.

Paper Link】 【Pages】:401-409

【Authors】: Milad Eftekhar ; Yashar Ganjali ; Nick Koudas

【Abstract】: Identifying the k most influential individuals in a social network is a well-studied problem. The objective is to detect k individuals in a (social) network who will influence the maximum number of people, if they are independently convinced of adopting a new strategy (product, idea, etc). There are cases in real life, however, where we aim to instigate groups instead of individuals to trigger network diffusion. Such cases abound, e.g., billboards, TV commercials and newspaper ads are utilized extensively to boost the popularity and raise awareness. In this paper, we generalize the "influential nodes" problem. Namely we are interested to locate the most "influential groups" in a network. As the first paper to address this problem: we (1) propose a fine-grained model of information diffusion for the group-based problem, (2) show that the process is submodular and present an algorithm to determine the influential groups under this model (with a precise approximation bound), (3) propose a coarse-grained model that inspects the network at group level (not individuals) significantly speeding up calculations for large networks, (4) show that the diffusion function we design here is submodular in general case, and propose an approximation algorithm for this coarse-grained model, and finally by conducting experiments on real datasets, (5) demonstrate that seeding members of selected groups to be the first adopters can broaden diffusion (when compared to the influential individuals case). Moreover, we can identify these influential groups much faster (up to 12 million times speedup), delivering a practical solution to this problem.

【Keywords】: group diffusion; influential groups; information cascade; social networks

Time series and spatial data 1

50. Mining lines in the sand: on trajectory discovery from untrustworthy data in cyber-physical system.

Paper Link】 【Pages】:410-418

【Authors】: Lu An Tang ; Xiao Yu ; Quanquan Gu ; Jiawei Han ; Alice Leung ; Thomas F. La Porta

【Abstract】: A Cyber-Physical System (CPS) integrates physical (i.e., sensor) devices with cyber (i.e., informational) components to form a context sensitive system that responds intelligently to dynamic changes in real-world situations. The CPS has wide applications in scenarios such as environment monitoring, battlefield surveillance and traffic control. One key research problem of CPS is called "mining lines in the sand". With a large number of sensors (sand) deployed in a designated area, the CPS is required to discover all the trajectories (lines) of passing intruders in real time. There are two crucial challenges that need to be addressed: (1) the collected sensor data are not trustworthy; (2) the intruders do not send out any identification information. The system needs to distinguish multiple intruders and track their movements. In this study, we propose a method called LiSM (Line-in-the-Sand Miner) to discover trajectories from untrustworthy sensor data. LiSM constructs a watching network from sensor data and computes the locations of intruder appearances based on the link information of the network. The system retrieves a cone-model from the historical trajectories and tracks multiple intruders based on this model. Finally the system validates the mining results and updates the sensor's reliability in a feedback process. Extensive experiments on big datasets demonstrate the feasibility and applicability of the proposed methods.

【Keywords】: cyber-physical system; sensor network; trajectory

Unsupervised and topic learning 4

51. A general bootstrap performance diagnostic.

Paper Link】 【Pages】:419-427

【Authors】: Ariel Kleiner ; Ameet Talwalkar ; Sameer Agarwal ; Ion Stoica ; Michael I. Jordan

【Abstract】: As datasets become larger, more complex, and more available to diverse groups of analysts, it would be quite useful to be able to automatically and generically assess the quality of estimates, much as we are able to automatically train and evaluate predictive models such as classifiers. However, despite the fundamental importance of estimator quality assessment in data analysis, this task has eluded highly automatic solutions. While the bootstrap provides perhaps the most promising step in this direction, its level of automation is limited by the difficulty of evaluating its finite sample performance and even its asymptotic consistency. Thus, we present here a general diagnostic procedure which directly and automatically evaluates the accuracy of the bootstrap's outputs, determining whether or not the bootstrap is performing satisfactorily when applied to a given dataset and estimator. We show that our proposed diagnostic is effective via an extensive empirical evaluation on a variety of estimators and simulated and real datasets, including a real-world query workload from Conviva, Inc. involving 1.7TB of data (i.e., approximately 0.5 billion data points).

【Keywords】: bootstrap; diagnostic; estimator quality assessment; performance

52. Subsampling for efficient and effective unsupervised outlier detection ensembles.

Paper Link】 【Pages】:428-436

【Authors】: Arthur Zimek ; Matthew Gaudet ; Ricardo J. G. B. Campello ; Jörg Sander

【Abstract】: Outlier detection and ensemble learning are well established research directions in data mining yet the application of ensemble techniques to outlier detection has been rarely studied. Here, we propose and study subsampling as a technique to induce diversity among individual outlier detectors. We show analytically and experimentally that an outlier detector based on a subsample per se, besides inducing diversity, can, under certain conditions, already improve upon the results of the same outlier detector on the complete dataset. Building an ensemble on top of several subsamples is further improving the results. While in the literature so far the intuition that ensembles improve over single outlier detectors has just been transferred from the classification literature, here we also justify analytically why ensembles are also expected to work in the unsupervised area of outlier detection. As a side effect, running an ensemble of several outlier detectors on subsamples of the dataset is more efficient than ensembles based on other means of introducing diversity and, depending on the sample rate and the size of the ensemble, can be even more efficient than just the single outlier detector on the complete data.

【Keywords】: ensemble; outlier detection

53. A phrase mining framework for recursive construction of a topical hierarchy.

Paper Link】 【Pages】:437-445

【Authors】: Chi Wang ; Marina Danilevsky ; Nihit Desai ; Yinan Zhang ; Phuong Nguyen ; Thrivikrama Taula ; Jiawei Han

【Abstract】: A high quality hierarchical organization of the concepts in a dataset at different levels of granularity has many valuable applications such as search, summarization, and content browsing. In this paper we propose an algorithm for recursively constructing a hierarchy of topics from a collection of content-representative documents. We characterize each topic in the hierarchy by an integrated ranked list of mixed-length phrases. Our mining framework is based on a phrase-centric view for clustering, extracting, and ranking topical phrases. Experiments with datasets from three different domains illustrate our ability to generate hierarchies of high quality topics represented by meaningful phrases.

【Keywords】: keyphrase extraction; keyphrase ranking; network analysis; ontology learning; topic modeling

54. Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation.

Paper Link】 【Pages】:446-454

【Authors】: James R. Foulds ; Levi Boyles ; Christopher DuBois ; Padhraic Smyth ; Max Welling

【Abstract】: There has been an explosion in the amount of digital text information available in recent years, leading to challenges of scale for traditional inference algorithms for topic models. Recent advances in stochastic variational inference algorithms for latent Dirichlet allocation (LDA) have made it feasible to learn topic models on very large-scale corpora, but these methods do not currently take full advantage of the collapsed representation of the model. We propose a stochastic algorithm for collapsed variational Bayesian inference for LDA, which is simpler and more efficient than the state of the art method. In experiments on large-scale text corpora, the algorithm was found to converge faster and often to a better solution than previous methods. Human-subject experiments also demonstrated that the method can learn coherent topics in seconds on small corpora, facilitating the use of topic models in interactive document analysis software.

【Keywords】: stochastic learning; topic models; variational inference

Social and information networks 4

55. WiseMarket: a new paradigm for managing wisdom of online social users.

Paper Link】 【Pages】:455-463

【Authors】: Caleb Chen Cao ; Yongxin Tong ; Lei Chen ; H. V. Jagadish

【Abstract】: The benefits of crowdsourcing are well-recognized today for an increasingly broad range of problems. Meanwhile, the rapid development of social media makes it possible to seek the wisdom of a crowd of targeted users. However, it is not trivial to implement the crowdsourcing platform on social media, specifically to make social media users as workers, we need to address the following two challenges: 1) how to motivate users to participate in tasks, and 2) how to choose users for a task. In this paper, we present Wise Market as an effective framework for crowdsourcing on social media that motivates users to participate in a task with care and correctly aggregates their opinions on pairwise choice problems. The Wise Market consists of a set of investors each with an associated individual confidence in his/her prediction, and after the investment, only the ones whose choices are the same as the whole market are granted rewards. Therefore, a social media user has to give his/her ``best'' answer in order to get rewards, as a consequence, careless answers from sloppy users are discouraged. Under the Wise Market framework, we define an optimization problem to minimize expected cost of paying out rewards while guaranteeing a minimum confidence level, called the Effective Market Problem (EMP). We propose exact algorithms for calculating the market confidence and the expected cost with O(nlog2n) time cost in a Wise Market with n investors. To deal with the enormous number of users on social media, we design a Central Limit Theorem-based approximation algorithm to compute the market confidence with O(n) time cost, as well as a bounded approximation algorithm to calculate the expected cost with O(n) time cost. Finally, we have conducted extensive experiments to validate effectiveness of the proposed algorithms on real and synthetic data.

【Keywords】: crowdsourcing; human computation; market; social media

56. Multi-label relational neighbor classification using social context features.

Paper Link】 【Pages】:464-472

【Authors】: Xi Wang ; Gita Sukthankar

【Abstract】: Networked data, extracted from social media, web pages, and bibliographic databases, can contain entities of multiple classes, interconnected through different types of links. In this paper, we focus on the problem of performing multi-label classification on networked data, where the instances in the network can be assigned multiple labels. In contrast to traditional content-only classification methods, relational learning succeeds in improving classification performance by leveraging the correlation of the labels between linked instances. However, instances in a network can be linked for various causal reasons, hence treating all links in a homogeneous way can limit the performance of relational classifiers. In this paper, we propose a multi-label iterative relational neighbor classifier that employs social context features (SCRN). Our classifier incorporates a class propagation probability distribution obtained from instances' social features, which are in turn extracted from the network topology. This class-propagation probability captures the node's intrinsic likelihood of belonging to each class, and serves as a prior weight for each class when aggregating the neighbors' class labels in the collective inference procedure. Experiments on several real-world datasets demonstrate that our proposed classifier boosts classification performance over common benchmarks on networked multi-label data.

【Keywords】: collective classification; relational learning; social dimensions

Paper Link】 【Pages】:473-481

【Authors】: Yaojia Zhu ; Xiaoran Yan ; Lise Getoor ; Cristopher Moore

【Abstract】: Many data sets contain rich information about objects, as well as pairwise relations between them. For instance, in networks of websites, scientific papers, and other documents, each node has content consisting of a collection of words, as well as hyperlinks or citations to other nodes. In order to perform inference on such data sets, and make predictions and recommendations, it is useful to have models that are able to capture the processes which generate the text at each node and the links between them. In this paper, we combine classic ideas in topic modeling with a variant of the mixed-membership block model recently developed in the statistical physics community. The resulting model has the advantage that its parameters, including the mixture of topics of each document and the resulting overlapping communities, can be inferred with a simple and scalable expectation-maximization algorithm. We test our model on three data sets, performing unsupervised topic classification and link prediction. For both tasks, our model outperforms several existing state-of-the-art methods, achieving higher accuracy with significantly less computation, analyzing a data set with 1.3 million words and 44 thousand links in a few minutes.

【Keywords】: document classification; link prediction; stochastic block model; topic modeling

58. Collaborative boosting for activity classification in microblogs.

Paper Link】 【Pages】:482-490

【Authors】: Yangqiu Song ; Zhengdong Lu ; Cane Wing-ki Leung ; Qiang Yang

【Abstract】: Users' daily activities, such as dining and shopping, inherently reflect their habits, intents and preferences, thus provide invaluable information for services such as personalized information recommendation and targeted advertising. Users' activity information, although ubiquitous on social media, has largely been unexploited. This paper addresses the task of user activity classification in microblogs, where users can publish short messages and maintain social networks online. We identify the importance of modeling a user's individuality, and that of exploiting opinions of the user's friends for accurate activity classification. In this light, we propose a novel collaborative boosting framework comprising a text-to-activity classifier for each user, and a mechanism for collaboration between classifiers of users having social connections. The collaboration between two classifiers includes exchanging their own training instances and their dynamically changing labeling decisions. We propose an iterative learning procedure that is formulated as gradient descent in learning function space, while opinion exchange between classifiers is implemented with a weighted voting in each learning iteration. We show through experiments that on real-world data from Sina Weibo, our method outperforms existing off-the-shelf algorithms that do not take users' individuality or social connections into account.

【Keywords】: activity classification; boosting; collaborative classification; social regularization.

Graph mining and sampling 4

59. Trace complexity of network inference.

Paper Link】 【Pages】:491-499

【Authors】: Bruno D. Abrahao ; Flavio Chierichetti ; Robert Kleinberg ; Alessandro Panconesi

【Abstract】: The network inference problem consists of reconstructing the edge set of a network given traces representing the chronology of infection times as epidemics spread through the network. This problem is a paradigmatic representative of prediction tasks in machine learning that require deducing a latent structure from observed patterns of activity in a network, which often require an unrealistically large number of resources (e.g., amount of available data, or computational time). A fundamental question is to understand which properties we can predict with a reasonable degree of accuracy with the available resources, and which we cannot. We define the trace complexity as the number of distinct traces required to achieve high fidelity in reconstructing the topology of the unobserved network or, more generally, some of its properties. We give algorithms that are competitive with, while being simpler and more efficient than, existing network inference approaches. Moreover, we prove that our algorithms are nearly optimal, by proving an information-theoretic lower bound on the number of traces that an optimal inference algorithm requires for performing this task in the general case. Given these strong lower bounds, we turn our attention to special cases, such as trees and bounded-degree graphs, and to property recovery tasks, such as reconstructing the degree distribution without inferring the network. We show that these problems require a much smaller (and more realistic) number of traces, making them potentially solvable in practice.

【Keywords】: independent cascade model; network epidemics; network inference; sampling complexity

60. Debiasing social wisdom.

Paper Link】 【Pages】:500-508

【Authors】: Abhimanyu Das ; Sreenivas Gollapudi ; Rina Panigrahy ; Mahyar Salek

【Abstract】: With the explosive growth of social networks, many applications are increasingly harnessing the pulse of online crowds for a variety of tasks such as marketing, advertising, and opinion mining. An important example is the wisdom of crowd effect that has been well studied for such tasks when the crowd is non-interacting. However, these studies don't explicitly address the network effects in social networks. A key difference in this setting is the presence of social influences that arise from these interactions and can undermine the wisdom of the crowd [17]. Using a natural model of opinion formation, we analyze the effect of these interactions on an individual's opinion and estimate her propensity to conform. We then propose efficient sampling algorithms incorporating these conformity values to arrive at a debiased estimate of the wisdom of a crowd. We analyze the trade-off between the sample size and estimation error and validate our algorithms using both real data obtained from online user experiments and synthetic data.

【Keywords】: opinion formation; social networks; wisdom of crowd

61. Mining discriminative subgraphs from global-state networks.

Paper Link】 【Pages】:509-517

【Authors】: Sayan Ranu ; Minh X. Hoang ; Ambuj K. Singh

【Abstract】: Global-state networks provide a powerful mechanism to model the increasing heterogeneity in data generated by current systems. Such a network comprises of a series of network snapshots with dynamic local states at nodes, and a global network state indicating the occurrence of an event. Mining discriminative subgraphs from global-state networks allows us to identify the influential sub-networks that have maximum impact on the global state and unearth the complex relationships between the local entities of a network and their collective behavior. In this paper, we explore this problem and design a technique called MINDS to mine minimally discriminative subgraphs from large global-state networks. To combat the exponential subgraph search space, we derive the concept of an edit map and perform Metropolis Hastings sampling on it to compute the answer set. Furthermore, we formulate the idea of network-constrained decision trees to learn prediction models that adhere to the underlying network structure. Extensive experiments on real datasets demonstrate excellent accuracy in terms of prediction quality. Additionally, MINDS achieves a speed-up of at least four orders of magnitude over baseline techniques.

【Keywords】: discriminative subgraphs; network-constrained decision trees

62. Approximate graph mining with label costs.

Paper Link】 【Pages】:518-526

【Authors】: Pranay Anchuri ; Mohammed J. Zaki ; Omer Barkol ; Shahar Golan ; Moshe Shamy

【Abstract】: Many real-world graphs have complex labels on the nodes and edges. Mining only exact patterns yields limited insights, since it may be hard to find exact matches. However, in many domains it is relatively easy to define a cost (or distance) between different labels. Using this information, it becomes possible to mine a much richer set of approximate subgraph patterns, which preserve the topology but allow bounded label mismatches. We present novel and scalable methods to efficiently solve the approximate isomorphism problem. We show that approximate mining yields interesting patterns in several real-world graphs ranging from IT and protein interaction networks to protein structures.

【Keywords】: approximate graph mining; approximate subgraph isomorphism; label costs

Rule and pattern mining 3

63. Summarizing probabilistic frequent patterns: a fast approach.

Paper Link】 【Pages】:527-535

【Authors】: Chunyang Liu ; Ling Chen ; Chengqi Zhang

【Abstract】: Mining probabilistic frequent patterns from uncertain data has received a great deal of attention in recent years due to the wide applications. However, probabilistic frequent pattern mining suffers from the problem that an exponential number of result patterns are generated, which seriously hinders further evaluation and analysis. In this paper, we focus on the problem of mining probabilistic representative frequent patterns (P-RFP), which is the minimal set of patterns with adequately high probability to represent all frequent patterns. Observing the bottleneck in checking whether a pattern can probabilistically represent another, which involves the computation of a joint probability of the supports of two patterns, we introduce a novel approximation of the joint probability with both theoretical and empirical proofs. Based on the approximation, we propose an Approximate P-RFP Mining (APM) algorithm, which effectively and efficiently compresses the set of probabilistic frequent patterns. To our knowledge, this is the first attempt to analyze the relationship between two probabilistic frequent patterns through an approximate approach. Our experiments on both synthetic and real-world datasets demonstrate that the APM algorithm accelerates P-RFP mining dramatically, orders of magnitudes faster than an exact solution. Moreover, the error rate of APM is guaranteed to be very small when the database contains hundreds transactions, which further affirms APM is a practical solution for summarizing probabilistic frequent patterns.

【Keywords】: pattern summarization; uncertain data

64. Mining high utility episodes in complex event sequences.

Paper Link】 【Pages】:536-544

【Authors】: Cheng-Wei Wu ; Yu-Feng Lin ; Philip S. Yu ; Vincent S. Tseng

【Abstract】: Frequent episode mining (FEM) is an interesting research topic in data mining with wide range of applications. However, the traditional framework of FEM treats all events as having the same importance/utility and assumes that a same type of event appears at most once at any time point. These simplifying assumptions do not reflect the characteristics of scenarios in real applications and thus the useful information of episodes in terms of utilities such as profits is lost. Furthermore, most studies on FEM focused on mining episodes in simple event sequences and few considered the scenario of complex event sequences, where different events can occur simultaneously. To address these issues, in this paper, we incorporate the concept of utility into episode mining and address a new problem of mining high utility episodes from complex event sequences, which has not been explored so far. In the proposed framework, the importance/utility of different events is considered and multiple events can appear simultaneously. Several novel features are incorporated into the proposed framework to resolve the challenges raised by this new problem, such as the absence of anti-monotone property and the huge set of candidate episodes. Moreover, an efficient algorithm named UP-Span (Utility ePisodes mining by Spanning prefixes) is proposed for mining high utility episodes with several strategies incorporated for pruning the search space to achieve high efficiency. Experimental results on real and synthetic datasets show that UP-Span has excellent performance and serves as an effective solution to the new problem of mining high utility episodes from complex event sequences.

【Keywords】: complex event sequences; episode mining; high utility episodes; utility mining

65. Mining frequent graph patterns with differential privacy.

Paper Link】 【Pages】:545-553

【Authors】: Entong Shen ; Ting Yu

【Abstract】: Discovering frequent graph patterns in a graph database offers valuable information in a variety of applications. However, if the graph dataset contains sensitive data of individuals such as mobile phone-call graphs and web-click graphs, releasing discovered frequent patterns may present a threat to the privacy of individuals. Differential privacy has recently emerged as the de facto standard for private data analysis due to its provable privacy guarantee. In this paper we propose the first differentially private algorithm for mining frequent graph patterns. We first show that previous techniques on differentially private discovery of frequent itemsets cannot apply in mining frequent graph patterns due to the inherent complexity of handling structural information in graphs. We then address this challenge by proposing a Markov Chain Monte Carlo (MCMC) sampling based algorithm. Unlike previous work on frequent itemset mining, our techniques do not rely on the output of a non-private mining algorithm. Instead, we observe that both frequent graph pattern mining and the guarantee of differential privacy can be unified into an MCMC sampling framework. In addition, we establish the privacy and utility guarantee of our algorithm and propose an efficient neighboring pattern counting technique as well. Experimental results show that the proposed algorithm is able to output frequent patterns with good precision.

【Keywords】: differential privacy; graph pattern mining

Web mining 3

66. Statistical quality estimation for general crowdsourcing tasks.

Paper Link】 【Pages】:554-562

【Authors】: Yukino Baba ; Hisashi Kashima

【Abstract】: One of the biggest challenges for requesters and platform providers of crowdsourcing is quality control, which is to expect high-quality results from crowd workers who are neither necessarily very capable nor motivated. A common approach to tackle this problem is to introduce redundancy, that is, to request multiple workers to work on the same tasks. For simple multiple-choice tasks, several statistical methods to aggregate the multiple answers have been proposed. However, these methods cannot always be applied to more general tasks with unstructured response formats such as article writing, program coding, and logo designing, which occupy the majority on most crowdsourcing marketplaces. In this paper, we propose an unsupervised statistical quality estimation method for such general crowdsourcing tasks. Our method is based on the two-stage procedure; multiple workers are first requested to work on the same tasks in the creation stage, and then another set of workers review and grade each artifact in the review stage. We model the ability of each author and the bias of each reviewer, and propose a two-stage probabilistic generative model using the graded response model in the item response theory. Experiments using several general crowdsourcing tasks show that our method outperforms popular vote aggregation methods, which implies that our method can deliver high quality results with lower costs.

【Keywords】: crowdsourcing; human computation; quality control

Paper Link】 【Pages】:563-571

【Authors】: Taifeng Wang ; Jiang Bian ; Shusen Liu ; Yuyu Zhang ; Tie-Yan Liu

【Abstract】: Precise click prediction is one of the key components in the sponsored search system. Previous studies usually took advantage of two major kinds of information for click prediction, i.e., relevance information representing the similarity between ads and queries and historical click-through information representing users' previous preferences on the ads. These existing works mainly focused on interpreting ad clicks in terms of what users seek (i.e., relevance information) and how users choose to click (historically clicked-through information). However, few of them attempted to understand why users click the ads. In this paper, we aim at answering this why'' question. In our opinion, users click those ads that can convince them to take further actions, and the critical factor is if those ads can trigger users' desires in their hearts. Our data analysis on a commercial search engine reveals that specific text patterns, e.g.,official site'', $x\%$ off'', andguaranteed return in $x$ days'', are very effective in triggering users' desires, and therefore lead to significant differences in terms of click-through rate (CTR). These observations motivate us to systematically model user psychological desire in order for a precise prediction on ad clicks. To this end, we propose modeling user psychological desire in sponsored search according to Maslow's desire theory, which categorizes psychological desire into five levels and each one is represented by a set of textual patterns automatically mined from ad texts. We then construct novel features for both ads and users based on our definition on psychological desire and incorporate them into the learning framework of click prediction. Large scale evaluations on the click-through logs from a commercial search engine demonstrate that this approach can result in significant improvement in terms of click prediction accuracy, for both the ads with rich historical data and those with rare one. Further analysis reveals that specific pattern combinations are especially effective in driving click-through rates, which provides a good guideline for advertisers to improve their ad textual descriptions.

【Keywords】: click prediction; online advertising; sponsored search; user psychological desire

68. SIGMa: simple greedy matching for aligning large knowledge bases.

Paper Link】 【Pages】:572-580

【Authors】: Simon Lacoste-Julien ; Konstantina Palla ; Alex Davies ; Gjergji Kasneci ; Thore Graepel ; Zoubin Ghahramani

【Abstract】: The Internet has enabled the creation of a growing number of large-scale knowledge bases in a variety of domains containing complementary information. Tools for automatically aligning these knowledge bases would make it possible to unify many sources of structured knowledge and answer complex queries. However, the efficient alignment of large-scale knowledge bases still poses a considerable challenge. Here, we present Simple Greedy Matching (SiGMa), a simple algorithm for aligning knowledge bases with millions of entities and facts. SiGMa is an iterative propagation algorithm that leverages both the structural information from the relationship graph and flexible similarity measures between entity properties in a greedy local search, which makes it scalable. Despite its greedy nature, our experiments indicate that SiGMa can efficiently match some of the world's largest knowledge bases with high accuracy. We provide additional experiments on benchmark datasets which demonstrate that SiGMa can outperform state-of-the-art approaches both in accuracy and efficiency.

【Keywords】: alignment; entity; greedy algorithm; knowledge base; large-scale; relationship

Best paper session 2

69. Simple and deterministic matrix sketching.

Paper Link】 【Pages】:581-588

【Authors】: Edo Liberty

【Abstract】: A sketch of a matrix A is another matrix B which is significantly smaller than A but still approximates it well. Finding such sketches efficiently is an important building block in modern algorithms for approximating, for example, the PCA of massive matrices. This task is made more challenging in the streaming model, where each row of the input matrix can only be processed once and storage is severely limited. In this paper we adapt a well known streaming algorithm for approximating item frequencies to the matrix sketching setting. The algorithm receives n rows of a large matrix A ε ℜ n x m one after the other in a streaming fashion. It maintains a sketch B ℜ l x m containing only l << n rows but still guarantees that ATA BTB. More accurately, ∀x || x,||=1 0≤||Ax||2 - ||Bx||2 ≤ 2||A||_f 2 l Or BTB prec ATA and ||ATA - BTB|| ≤ 2 ||A||f2 l. This gives a streaming algorithm whose error decays proportional to 1/l using O(ml) space. For comparison, random-projection, hashing or sampling based algorithms produce convergence bounds proportional to 1/√l. Sketch updates per row in A require amortized O(ml) operations and the algorithm is perfectly parallelizable. Our experiments corroborate the algorithm's scalability and improved convergence rate. The presented algorithm also stands out in that it is deterministic, simple to implement and elementary to prove.

【Keywords】: sketching; streaming

70. A space efficient streaming algorithm for triangle counting using the birthday paradox.

Paper Link】 【Pages】:589-597

【Authors】: Madhav Jha ; C. Seshadhri ; Ali Pinar

【Abstract】: We design a space efficient algorithm that approximates the transitivity (global clustering coefficient) and total triangle count with only a single pass through a graph given as a stream of edges. Our procedure is based on the classic probabilistic result, the birthday paradox. When the transitivity is constant and there are more edges than wedges (common properties for social networks), we can prove that our algorithm requires O(√n) space (n is the number of vertices) to provide accurate estimates. We run a detailed set of experiments on a variety of real graphs and demonstrate that the memory requirement of the algorithm is a tiny fraction of the graph. For example, even for a graph with 200 million edges, our algorithm stores just 60,000 edges to give accurate results. Being a single pass streaming algorithm, our procedure also maintains a real-time estimate of the transitivity/number of triangles of a graph, by storing a miniscule fraction of edges.

【Keywords】: streaming algorithms; triangle counting

Research poster session 59

71. Who, where, when and what: discover spatio-temporal topics for twitter users.

Paper Link】 【Pages】:605-613

【Authors】: Quan Yuan ; Gao Cong ; Zongyang Ma ; Aixin Sun ; Nadia Magnenat-Thalmann

【Abstract】: Micro-blogging services, such as Twitter, and location-based social network applications have generated short text messages associated with geographic information, posting time, and user ids. The availability of such data received from users offers a good opportunity to study the user's spatial-temporal behavior and preference. In this paper, we propose a probabilistic model W4 (short for Who+Where+When+What) to exploit such data to discover individual users' mobility behaviors from spatial, temporal and activity aspects. To the best of our knowledge, our work offers the first solution to jointly model individual user's mobility behavior from the three aspects. Our model has a variety of applications, such as user profiling and location prediction; it can be employed to answer questions such as ``Can we infer the location of a user given a tweet posted by the user and the posting time?" Experimental results on two real-world datasets show that the proposed model is effective in discovering users' spatial-temporal topics, and outperforms state-of-the-art baselines significantly for the task of location prediction for tweets.

【Keywords】: graphical model; prediction and recommendation; spatio-temporal; twitter; user profiling

72. Multi-label classification by mining label and instance correlations from heterogeneous information networks.

Paper Link】 【Pages】:614-622

【Authors】: Xiangnan Kong ; Bokai Cao ; Philip S. Yu

【Abstract】: Multi-label classification is prevalent in many real-world applications, where each example can be associated with a set of multiple labels simultaneously. The key challenge of multi-label classification comes from the large space of all possible label sets, which is exponential to the number of candidate labels. Most previous work focuses on exploiting correlations among different labels to facilitate the learning process. It is usually assumed that the label correlations are given beforehand or can be derived directly from data samples by counting their label co-occurrences. However, in many real-world multi-label classification tasks, the label correlations are not given and can be hard to learn directly from data samples within a moderate-sized training set. Heterogeneous information networks can provide abundant knowledge about relationships among different types of entities including data samples and class labels. In this paper, we propose to use heterogeneous information networks to facilitate the multi-label classification process. By mining the linkage structure of heterogeneous information networks, multiple types of relationships among different class labels and data samples can be extracted. Then we can use these relationships to effectively infer the correlations among different class labels in general, as well as the dependencies among the label sets of data examples inter-connected in the network. Empirical studies on real-world tasks demonstrate that the performance of multi-label classification can be effectively boosted using heterogeneous information net- works.

【Keywords】: data mining; heterogeneous information network; label correlation; multi-label classification

73. Accurate intelligible models with pairwise interactions.

Paper Link】 【Pages】:623-631

【Authors】: Yin Lou ; Rich Caruana ; Johannes Gehrke ; Giles Hooker

【Abstract】: Standard generalized additive models (GAMs) usually model the dependent variable as a sum of univariate models. Although previous studies have shown that standard GAMs can be interpreted by users, their accuracy is significantly less than more complex models that permit interactions. In this paper, we suggest adding selected terms of interacting pairs of features to standard GAMs. The resulting models, which we call GA2{M}$-models, for Generalized Additive Models plus Interactions, consist of univariate terms and a small number of pairwise interaction terms. Since these models only include one- and two-dimensional components, the components of GA2M-models can be visualized and interpreted by users. To explore the huge (quadratic) number of pairs of features, we develop a novel, computationally efficient method called FAST for ranking all possible pairs of features as candidates for inclusion into the model. In a large-scale empirical study, we show the effectiveness of FAST in ranking candidate pairs of features. In addition, we show the surprising result that GA2M-models have almost the same performance as the best full-complexity models on a number of real datasets. Thus this paper postulates that for many problems, GA2M-models can yield models that are both intelligible and accurate.

【Keywords】: classification; interaction detection; regression

74. Spotting opinion spammers using behavioral footprints.

Paper Link】 【Pages】:632-640

【Authors】: Arjun Mukherjee ; Abhinav Kumar ; Bing Liu ; Junhui Wang ; Meichun Hsu ; Malú Castellanos ; Riddhiman Ghosh

【Abstract】: Opinionated social media such as product reviews are now widely used by individuals and organizations for their decision making. However, due to the reason of profit or fame, people try to game the system by opinion spamming (e.g., writing fake reviews) to promote or to demote some target products. In recent years, fake review detection has attracted significant attention from both the business and research communities. However, due to the difficulty of human labeling needed for supervised learning and evaluation, the problem remains to be highly challenging. This work proposes a novel angle to the problem by modeling spamicity as latent. An unsupervised model, called Author Spamicity Model (ASM), is proposed. It works in the Bayesian setting, which facilitates modeling spamicity of authors as latent and allows us to exploit various observed behavioral footprints of reviewers. The intuition is that opinion spammers have different behavioral distributions than non-spammers. This creates a distributional divergence between the latent population distributions of two clusters: spammers and non-spammers. Model inference results in learning the population distributions of the two clusters. Several extensions of ASM are also considered leveraging from different priors. Experiments on a real-life Amazon review dataset demonstrate the effectiveness of the proposed models which significantly outperform the state-of-the-art competitors.

【Keywords】: abuse; deceptive and fake reviewer detection; opinion spam

75. An efficient ADMM algorithm for multidimensional anisotropic total variation regularization problems.

Paper Link】 【Pages】:641-649

【Authors】: Sen Yang ; Jie Wang ; Wei Fan ; Xiatian Zhang ; Peter Wonka ; Jieping Ye

【Abstract】: Total variation (TV) regularization has important applications in signal processing including image denoising, image deblurring, and image reconstruction. A significant challenge in the practical use of TV regularization lies in the nondifferentiable convex optimization, which is difficult to solve especially for large-scale problems. In this paper, we propose an efficient alternating augmented Lagrangian method (ADMM) to solve total variation regularization problems. The proposed algorithm is applicable for tensors, thus it can solve multidimensional total variation regularization problems. One appealing feature of the proposed algorithm is that it does not need to solve a linear system of equations, which is often the most expensive part in previous ADMM-based methods. In addition, each step of the proposed algorithm involves a set of independent and smaller problems, which can be solved in parallel. Thus, the proposed algorithm scales to large size problems. Furthermore, the global convergence of the proposed algorithm is guaranteed, and the time complexity of the proposed algorithm is O(dN/ε) on a d-mode tensor with N entries for achieving an ε-optimal solution. Extensive experimental results demonstrate the superior performance of the proposed algorithm in comparison with current state-of-the-art methods.

【Keywords】: admm; large scale; multidimensional total variation; parallel computing

76. Speeding up large-scale learning with a social prior.

Paper Link】 【Pages】:650-658

【Authors】: Deepayan Chakrabarti ; Ralf Herbrich

【Abstract】: Slow convergence and poor initial accuracy are two problems that plague efforts to use very large feature sets in online learning. This is especially true when only a few features are "active" in any training example, and the frequency of activations of different features is skewed. We show how these problems can be mitigated if a graph of relationships between features is known. We study this problem in a fully Bayesian setting, focusing on the problem of using Facebook user-IDs as features, with the social network giving the relationship structure. Our analysis uncovers significant problems with the obvious regularizations, and motivates a two-component mixture-model "social prior" that is provably better. Empirical results on large-scale click prediction problems show that our algorithm can learn as well as the baseline with 12M fewer training examples, and continuously outperforms it for over 60M examples. On a second problem using binned features, our model outperforms the baseline even after the latter sees 5x as much data.

【Keywords】: mixture model; social prior

77. FISM: factored item similarity models for top-N recommender systems.

Paper Link】 【Pages】:659-667

【Authors】: Santosh Kabbur ; Xia Ning ; George Karypis

【Abstract】: The effectiveness of existing top-N recommendation methods decreases as the sparsity of the datasets increases. To alleviate this problem, we present an item-based method for generating top-N recommendations that learns the item-item similarity matrix as the product of two low dimensional latent factor matrices. These matrices are learned using a structural equation modeling approach, wherein the value being estimated is not used for its own estimation. A comprehensive set of experiments on multiple datasets at three different sparsity levels indicate that the proposed methods can handle sparse datasets effectively and outperforms other state-of-the-art top-N recommendation methods. The experimental results also show that the relative performance gains compared to competing methods increase as the data gets sparser.

【Keywords】: item similarity; recommender systems; sparse data; topn

78. Nonparametric hierarchal bayesian modeling in non-contractual heterogeneous survival data.

Paper Link】 【Pages】:668-676

【Authors】: Shouichi Nagano ; Yusuke Ichikawa ; Noriko Takaya ; Tadasu Uchiyama ; Makoto Abe

【Abstract】: An important problem in the non-contractual marketing domain is discovering the customer lifetime and assessing the impact of customer's characteristic variables on the lifetime. Unfortunately, the conventional hierarchical Bayes model cannot discern the impact of customer's characteristic variables for each customer. To overcome this problem, we present a new survival model using a non-parametric Bayes paradigm with MCMC. The assumption of a conventional model, logarithm of purchase rate and dropout rate with linear regression, is extended to include our assumption of the Dirichlet Process Mixture of regression. The extension assumes that each customer belongs probabilistically to different mixtures of regression, thereby permitting us to estimate a different impact of customer characteristic variables for each customer. Our model creates several customer groups to mirror the structure of the target data set. The effectiveness of our proposal is confirmed by a comparison involving a real e-commerce transaction dataset and an artificial dataset; it generally achieves higher predictive performance. In addition, we show that preselecting the actual number of customer groups does not always lead to higher predictive performance.

【Keywords】: crm; mcmc; model choice; non-parametric bayes

79. Cross-task crowdsourcing.

Paper Link】 【Pages】:677-685

【Authors】: Kaixiang Mo ; ErHeng Zhong ; Qiang Yang

【Abstract】: Crowdsourcing is an effective method for collecting labeled data for various data mining tasks. It is critical to ensure the veracity of the produced data because responses collected from different users may be noisy and unreliable. Previous works solve this veracity problem by estimating both the user ability and question difficulty based on the knowledge in each task individually. In this case, each single task needs large amounts of data to provide accurate estimations. However, in practice, budgets provided by customers for a given target task may be limited, and hence each question can be presented to only a few users where each user can answer only a few questions. This data sparsity problem can cause previous approaches to perform poorly due to the overfitting problem on rare data and eventually damage the data veracity. Fortunately, in real-world applications, users can answer questions from multiple historical tasks. For example, one can annotate images as well as label the sentiment of a given title. In this paper, we employ transfer learning, which borrows knowledge from auxiliary historical tasks to improve the data veracity in a given target task. The motivation is that users have stable characteristics across different crowdsourcing tasks and thus data from different tasks can be exploited collectively to estimate users' abilities in the target task. We propose a hierarchical Bayesian model, TLC (Transfer Learning for Crowdsourcing), to implement this idea by considering the overlapping users as a bridge. In addition, to avoid possible negative impact, TLC introduces task-specific factors to model task differences. The experimental results show that TLC significantly improves the accuracy over several state-of-the-art non-transfer-learning approaches under very limited budget in various labeling tasks.

【Keywords】: crowdsourcing; transfer learning

80. Evaluating the crowd with confidence.

Paper Link】 【Pages】:686-694

【Authors】: Manas Joglekar ; Hector Garcia-Molina ; Aditya G. Parameswaran

【Abstract】: Worker quality control is a crucial aspect of crowdsourcing systems; typically occupying a large fraction of the time and money invested on crowdsourcing. In this work, we devise techniques to generate confidence intervals for worker error rate estimates, thereby enabling a better evaluation of worker quality. We show that our techniques generate correct confidence intervals on a range of real-world datasets, and demonstrate wide applicability by using them to evict poorly performing workers, and provide confidence intervals on the accuracy of the answers.

【Keywords】: confidence; crowdsourcing

81. Inferring social roles and statuses in social networks.

Paper Link】 【Pages】:695-703

【Authors】: Yuchen Zhao ; Guan Wang ; Philip S. Yu ; Shaobo Liu ; Simon Zhang

【Abstract】: Users in online social networks play a variety of social roles and statuses. For example, users in Twitter can be represented as advertiser, content contributor, information receiver, etc; users in Linkedin can be in different professional roles, such as engineer, salesperson and recruiter. Previous research work mainly focuses on using categorical and textual information to predict the attributes of users. However, it cannot be applied to a large number of users in real social networks, since much of such information is missing, outdated and non-standard. In this paper, we investigate the social roles and statuses that people act in online social networks in the perspective of network structures, since the uniqueness of social networks is connecting people. We quantitatively analyze a number of key social principles and theories that correlate with social roles and statuses. We systematically study how the network characteristics reflect the social situations of users in an online society. We discover patterns of homophily, the tendency of users to connect with users with similar social roles and statuses. In addition, we observe that different factors in social theories influence the social role/status of an individual user to various extent, since these social principles represent different aspects of the network. We then introduce an optimization framework based on Factor Conditioning Symmetry, and we propose a probabilistic model to integrate the optimization framework on local structural information as well as network influence to infer the unknown social roles and statuses of online users. We will present experiment results to show the effectiveness of the inference.

【Keywords】: linkedin; network inference; social networks; social role; social status; user modeling

82. Adaptive collective routing using gaussian process dynamic congestion models.

Paper Link】 【Pages】:704-712

【Authors】: Siyuan Liu ; Yisong Yue ; Ramayya Krishnan

【Abstract】: We consider the problem of adaptively routing a fleet of cooperative vehicles within a road network in the presence of uncertain and dynamic congestion conditions. To tackle this problem, we first propose a Gaussian Process Dynamic Congestion Model that can effectively characterize both the dynamics and the uncertainty of congestion conditions. Our model is efficient and thus facilitates real-time adaptive routing in the face of uncertainty. Using this congestion model, we develop an efficient algorithm for non-myopic adaptive routing to minimize the collective travel time of all vehicles in the system. A key property of our approach is the ability to efficiently reason about the long-term value of exploration, which enables collectively balancing the exploration/exploitation trade-off for entire fleets of vehicles. We validate our approach based on traffic data from two large Asian cities. We show that our congestion model is effective in modeling dynamic congestion conditions. We also show that our routing algorithm generates significantly faster routes compared to standard baselines, and achieves near-optimal performance compared to an omniscient routing algorithm. We also present the results from a preliminary field study, which showcases the efficacy of our approach.

【Keywords】: collective routing; dynamic congestion model.; gaussian process

83. Maximizing acceptance probability for active friending in online social networks.

Paper Link】 【Pages】:713-721

【Authors】: De-Nian Yang ; Hui-Ju Hung ; Wang-Chien Lee ; Wei Chen

【Abstract】: Friending recommendation has successfully contributed to the explosive growth of online social networks. Most friending recommendation services today aim to support passive friending, where a user passively selects friending targets from the recommended candidates. In this paper, we advocate a recommendation support for active friending, where a user actively specifies a friending target. To the best of our knowledge, a recommendation designed to provide guidance for a user to systematically approach his friending target has not been explored for existing online social networking services. To maximize the probability that the friending target would accept an invitation from the user, we formulate a new optimization problem, namely, Acceptance Probability Maximization (APM), and develop a polynomial time algorithm, called Selective Invitation with Tree and In-Node Aggregation (SITINA), to find the optimal solution. We implement an active friending service with SITINA on Facebook to validate our idea. Our user study and experimental results reveal that SITINA outperforms manual selection and the baseline approach in solution quality efficiently.

【Keywords】: friending; social influence; social network

84. Mining evolutionary multi-branch trees from text streams.

Paper Link】 【Pages】:722-730

【Authors】: Xiting Wang ; Shixia Liu ; Yangqiu Song ; Baining Guo

【Abstract】: Understanding topic hierarchies in text streams and their evolution patterns over time is very important in many applications. In this paper, we propose an evolutionary multi-branch tree clustering method for streaming text data. We build evolutionary trees in a Bayesian online filtering framework. The tree construction is formulated as an online posterior estimation problem, which considers both the likelihood of the current tree and conditional prior given the previous tree. We also introduce a constraint model to compute the conditional prior of a tree in the multi-branch setting. Experiments on real world news data demonstrate that our algorithm can better incorporate historical tree information and is more efficient and effective than the traditional evolutionary hierarchical clustering algorithm.

【Keywords】: clustering; multi-branch tree; time series data; topic evolution; visualization

Paper Link】 【Pages】:731-738

【Authors】: Xuezhi Wang ; Roman Garnett ; Jeff G. Schneider

【Abstract】: Active search is an increasingly important learning problem in which we use a limited budget of label queries to discover as many members of a certain class as possible. Numerous real-world applications may be approached in this manner, including fraud detection, product recommendation, and drug discovery. Active search has model learning and exploration/exploitation features similar to those encountered in active learning and bandit problems, but algorithms for those problems do not fit active search. Previous work on the active search problem [5] showed that the optimal algorithm requires a lookahead evaluation of expected utility that is exponential in the number of selections to be made and proposed a truncated lookahead heuristic. Inspired by the success of myopic methods for active learning and bandit problems, we propose a myopic method for active search on graphs. We suggest selecting points by maximizing a score considering the potential impact of selecting a node, meant to emulate lookahead while avoiding exponential search. We test the proposed algorithm empirically on real-world graphs and show that it outperforms popular approaches for active learning and bandit problems as well as truncated lookahead of a few steps.

【Keywords】: active learning; graph search

86. Fast rank-2 nonnegative matrix factorization for hierarchical document clustering.

Paper Link】 【Pages】:739-747

【Authors】: Da Kuang ; Haesun Park

【Abstract】: Nonnegative matrix factorization (NMF) has been successfully used as a clustering method especially for flat partitioning of documents. In this paper, we propose an efficient hierarchical document clustering method based on a new algorithm for rank-2 NMF. When the two block coordinate descent framework of nonnegative least squares is applied to computing rank-2 NMF, each subproblem requires a solution for nonnegative least squares with only two columns in the matrix. We design the algorithm for rank-2 NMF by exploiting the fact that an exhaustive search for the optimal active set can be performed extremely fast when solving these NNLS problems. In addition, we design a measure based on the results of rank-2 NMF for determining which leaf node should be further split. On a number of text data sets, our proposed method produces high-quality tree structures in significantly less time compared to other methods such as hierarchical K-means, standard NMF, and latent Dirichlet allocation.

【Keywords】: active-set algorithm; hierarchical document clustering; nonnegative matrix factorization; rank-2 nmf

87. A "semi-lazy" approach to probabilistic path prediction.

Paper Link】 【Pages】:748-756

【Authors】: Jingbo Zhou ; Anthony K. H. Tung ; Wei Wu ; Wee Siong Ng

【Abstract】: Path prediction is useful in a wide range of applications. Most of the existing solutions, however, are based on eager learning methods where models and patterns are extracted from historical trajectories and then used for future prediction. Since such approaches are committed to a set of statistically significant models or patterns, problems can arise in dynamic environments where the underlying models change quickly or where the regions are not covered with statistically significant models or patterns. We propose a "semi-lazy" approach to path prediction that builds prediction models on the fly using dynamically selected reference trajectories. Such an approach has several advantages. First, the target trajectories to be predicted are known before the models are built, which allows us to construct models that are deemed relevant to the target trajectories. Second, unlike the lazy learning approaches, we use sophisticated learning algorithms to derive accurate prediction models with acceptable delay based on a small number of selected reference trajectories. Finally, our approach can be continuously self-correcting since we can dynamically re-construct new models if the predicted movements do not match the actual ones. Our prediction model can construct a probabilistic path whose probability of occurrence is larger than a threshold and which is furthest ahead in term of time. Users can control the confidence of the path prediction by setting a probability threshold. We conducted a comprehensive experimental study on real-world and synthetic datasets to show the effectiveness and efficiency of our approach.

【Keywords】: lazy learning; spatial-temporal data mining; trajectory analysis

88. Optimizing parallel belief propagation in junction treesusing regression.

Paper Link】 【Pages】:757-765

【Authors】: Lu Zheng ; Ole J. Mengshoel

【Abstract】: The junction tree approach, with applications in artificial intelligence, computer vision, machine learning, and statistics, is often used for computing posterior distributions in probabilistic graphical models. One of the key challenges associated with junction trees is computational, and several parallel computing technologies - including many-core processors - have been investigated to meet this challenge. Many-core processors (including GPUs) are now programmable, unfortunately their complexities make it hard to manually tune their parameters in order to optimize software performance. In this paper, we investigate a machine learning approach to minimize the execution time of parallel junction tree algorithms implemented on a GPU. By carefully allocating a GPU's threads to different parallel computing opportunities in a junction tree, and treating this thread allocation problem as a machine learning problem, we find in experiments that regression - specifically support vector regression - can substantially outperform manual optimization.

【Keywords】: belief propagation; parallel computing; regression

89. Multi-source deep learning for information trustworthiness estimation.

Paper Link】 【Pages】:766-774

【Authors】: Liang Ge ; Jing Gao ; Xiaoyi Li ; Aidong Zhang

【Abstract】: In recent years, information trustworthiness has become a serious issue when user-generated contents prevail in our information world. In this paper, we investigate the important problem of estimating information trustworthiness from the perspective of correlating and comparing multiple data sources. To a certain extent, the consistency degree is an indicator of information reliability--Information unanimously agreed by all the sources is more likely to be reliable. Based on this principle, we develop an effective computational approach to identify consistent information from multiple data sources. Particularly, we analyze vast amounts of information collected from multiple review platforms (multiple sources) in which people can rate and review the items they have purchased. The major challenge is that different platforms attract diverse sets of users, and thus information cannot be compared directly at the surface. However, latent reasons hidden in user ratings are mostly shared by multiple sources, and thus inconsistency about an item only appears when some source provides ratings deviating from the common latent reasons. Therefore, we propose a novel two-step procedure to calculate information consistency degrees for a set of items which are rated by multiple sets of users on different platforms. We first build a Multi-Source Deep Belief Network (MSDBN) to identify the common reasons hidden in multi-source rating data, and then calculate a consistency score for each item by comparing individual sources with the reconstructed data derived from the latent reasons. We conduct experiments on real user ratings collected from Orbitz, Priceline and TripAdvisor on all the hotels in Las Vegas and New York City. Experimental results demonstrate that the proposed approach successfully finds the hotels that receive inconsistent, and possibly unreliable, ratings.

【Keywords】: deep learning; information trustworthiness; multiple-source

Paper Link】 【Pages】:775-783

【Authors】: Tsung-Ting Kuo ; Rui Yan ; Yu-Yang Huang ; Perng-Hwa Kung ; Shou-De Lin

【Abstract】: The concern of privacy has become an important issue for online social networks. In services such as Foursquare.com, whether a person likes an article is considered private and therefore not disclosed; only the aggregative statistics of articles (i.e., how many people like this article) is revealed. This paper tries to answer a question: can we predict the opinion holder in a heterogeneous social network without any labeled data? This question can be generalized to a link prediction with aggregative statistics problem. This paper devises a novel unsupervised framework to solve this problem, including two main components: (1) a three-layer factor graph model and three types of potential functions; (2) a ranked-margin learning and inference algorithm. Finally, we evaluate our method on four diverse prediction scenarios using four datasets: preference (Foursquare), repost (Twitter), response (Plurk), and citation (DBLP). We further exploit nine unsupervised models to solve this problem as baselines. Our approach not only wins out in all scenarios, but on the average achieves 9.90% AUC and 12.59% NDCG improvement over the best competitors. The resources are available at http://www.csie.ntu.edu.tw/~d97944007/aggregative/

【Keywords】: heterogeneous social network; link prediction; probabilistic graphical model; social network mining

Paper Link】 【Pages】:784-792

【Authors】: Conrad Lee ; Bobo Nick ; Ulrik Brandes ; Pádraig Cunningham

【Abstract】: State-of-the-art link prediction utilizes combinations of complex features derived from network panel data. We here show that computationally less expensive features can achieve the same performance in the common scenario in which the data is available as a sequence of interactions. Our features are based on social vector clocks, an adaptation of the vector-clock concept introduced in distributed computing to social interaction networks. In fact, our experiments suggest that by taking into account the order and spacing of interactions, social vector clocks exploit different aspects of link formation so that their combination with previous approaches yields the most accurate predictor to date.

【Keywords】: link prediction; online algorithms; social networks; vector clocks

92. Geo-spotting: mining online location-based services for optimal retail store placement.

Paper Link】 【Pages】:793-801

【Authors】: Dmytro Karamshuk ; Anastasios Noulas ; Salvatore Scellato ; Vincenzo Nicosia ; Cecilia Mascolo

【Abstract】: The problem of identifying the optimal location for a new retail store has been the focus of past research, especially in the field of land economy, due to its importance in the success of a business. Traditional approaches to the problem have factored in demographics, revenue and aggregated human flow statistics from nearby or remote areas. However, the acquisition of relevant data is usually expensive. With the growth of location-based social networks, fine grained data describing user mobility and popularity of places has recently become attainable. In this paper we study the predictive power of various machine learning features on the popularity of retail stores in the city through the use of a dataset collected from Foursquare in New York. The features we mine are based on two general signals: geographic, where features are formulated according to the types and density of nearby places, and user mobility, which includes transitions between venues or the incoming flow of mobile users from distant areas. Our evaluation suggests that the best performing features are common across the three different commercial chains considered in the analysis, although variations may exist too, as explained by heterogeneities in the way retail facilities attract users. We also show that performance improves significantly when combining multiple features in supervised learning algorithms, suggesting that the retail success of a business may depend on multiple factors.

【Keywords】: location-based services; machine learning; optimal retail location

93. Location-aware publish/subscribe.

Paper Link】 【Pages】:802-810

【Authors】: Guoliang Li ; Yang Wang ; Ting Wang ; Jianhua Feng

【Abstract】: Location-based services have become widely available on mobile devices. Existing methods employ a pull model or user-initiated model, where a user issues a query to a server which replies with location-aware answers. To provide users with instant replies, a push model or server-initiated model is becoming an inevitable computing model in the next-generation location-based services. In the push model, subscribers register spatio-textual subscriptions to capture their interests, and publishers post spatio-textual messages. This calls for a high-performance location-aware publish/subscribe system to deliver publishers' messages to relevant subscribers.In this paper, we address the research challenges that arise in designing a location-aware publish/subscribe system. We propose an rtree based index structure by integrating textual descriptions into rtree nodes. We devise efficient filtering algorithms and develop effective pruning techniques to improve filtering efficiency. Experimental results show that our method achieves high performance. For example, our method can filter 500 tweets in a second for 10 million registered subscriptions on a commodity computer.

【Keywords】: location-aware publish/subscribe

94. Quadratic optimization to identify highly heritable quantitative traits from complex phenotypic features.

Paper Link】 【Pages】:811-819

【Authors】: Jiangwen Sun ; Jinbo Bi ; Henry R. Kranzler

【Abstract】: Identifying genetic variation underlying a complex disease is important. Many complex diseases have heterogeneous phenotypes and are products of a variety of genetic and environmental factors acting in concert. Deriving highly heritable quantitative traits of a complex disease can improve the identification of genetic risk of the disease. The most sophisticated methods so far perform unsupervised cluster analysis on phenotypic features; and then a quantitative trait is derived based on each resultant cluster. Heritability is estimated to assess the validity of the derived quantitative traits. However, none of these methods explicitly maximize the heritability of the derived traits. We propose a quadratic optimization approach that directly utilizes heritability as an objective during the derivation of quantitative traits of a disease. This method maximizes an objective function that is formulated by decomposing the traditional maximum likelihood method for estimating heritability of a quantitative trait. We demonstrate the effectiveness of the proposed method on both synthetic data and real-world problems. We apply our algorithm to identify highly heritable traits of complex human-behavior disorders including opioid and cocaine use disorders, and highly heritable traits of dairy cattle that are economically important. Our approach outperforms standard cluster analysis and several previous methods.

【Keywords】: heritability; quadratic optimization; quantitative trait

Paper Link】 【Pages】:820-828

【Authors】: Dóra Erdös ; Vatche Ishakian ; Azer Bestavros ; Evimaria Terzi

【Abstract】: Arguably, the most effective technique to ensure wide adoption of a concept (or product) is by repeatedly exposing individuals to messages that reinforce the concept (or promote the product). Recognizing the role of repeated exposure to a message, in this paper we propose a novel framework for the effective placement of content: Given the navigational patterns of users in a network, e.g., web graph, hyperlinked corpus, or road network, and given a model of the relationship between content-adoption and frequency of exposition, we define the repetition-aware content-placement (RACP) problem as that of identifying the set of B nodes on which content should be placed so that the expected number of users adopting that content is maximized. The key contribution of our work is the introduction of memory into the navigation process, by making user conversion dependent on the number of her exposures to that content. This dependency is captured using a conversion model that is general enough to capture arbitrary dependencies. Our solution to this general problem builds upon the notion of absorbing random walks, which we extend appropriately in order to address the technicalities of our definitions. Although we show the RACP problem to be NP-hard, we propose a general and efficient algorithmic solution. Our experimental results demonstrate the efficacy and the efficiency of our methods in multiple real-world datasets obtained from different application domains.

【Keywords】: markov chains; navigational networks; optimization

Paper Link】 【Pages】:829-837

【Authors】: Ye Wang ; Ahmed Metwally ; Srinivasan Parthasarathy

【Abstract】: Given a set of entities, the all-pairs similarity search aims at identifying all pairs of entities that have similarity greater than (or distance smaller than) some user-defined threshold. In this article, we propose a parallel framework for solving this problem in metric spaces. Novel elements of our solution include: i) flexible support for multiple metrics of interest; ii) an autonomic approach to partition the input dataset with minimal redundancy to achieve good load-balance in the presence of limited computing resources; iii) an on-the- fly lossless compression strategy to reduce both the running time and the final output size. We validate the utility, scalability and the effectiveness of the approach on hundreds of machines using real and synthetic datasets.

【Keywords】: all-pairs similarity search; similarity joins

97. Massively parallel expectation maximization using graphics processing units.

Paper Link】 【Pages】:838-846

【Authors】: Muzaffer Can Altinigneli ; Claudia Plant ; Christian Böhm

【Abstract】: Composed of several hundreds of processors, the Graphics Processing Unit (GPU) has become a very interesting platform for computationally demanding tasks on massive data. A special hierarchy of processors and fast memory units allow very powerful and efficient parallelization but also demands novel parallel algorithms. Expectation Maximization (EM) is a widely used technique for maximum likelihood estimation. In this paper, we propose an innovative EM clustering algorithm particularly suited for the GPU platform on NVIDIA's Fermi architecture. The central idea of our algorithm is to allow the parallel threads exchanging their local information in an asynchronous way and thus updating their cluster representatives on demand by a technique called Asynchronous Model Updates (Async-EM). Async-EM enables our algorithm not only to accelerate convergence but also to reduce the overhead induced by memory bandwidth limitations and synchronization requirements. We demonstrate (1) how to reformulate the EM algorithm to be able to exchange information using Async-EM and (2) how to exploit the special memory and processor architecture of a modern GPU in order to share this information among threads in an optimal way. As a perspective Async-EM is not limited to EM but can be applied to a variety of algorithms.

【Keywords】: cuda; expectation maximization; fermi; graphics processing unit

98. Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms.

Paper Link】 【Pages】:847-855

【Authors】: Chris Thornton ; Frank Hutter ; Holger H. Hoos ; Kevin Leyton-Brown

【Abstract】: Many different machine learning algorithms exist; taking into account each algorithm's hyperparameters, there is a staggeringly large number of possible alternatives overall. We consider the problem of simultaneously selecting a learning algorithm and setting its hyperparameters, going beyond previous work that attacks these issues separately. We show that this problem can be addressed by a fully automated approach, leveraging recent innovations in Bayesian optimization. Specifically, we consider a wide range of feature selection techniques (combining 3 search and 8 evaluator methods) and all classification approaches implemented in WEKA's standard distribution, spanning 2 ensemble methods, 10 meta-methods, 27 base classifiers, and hyperparameter settings for each classifier. On each of 21 popular datasets from the UCI repository, the KDD Cup 09, variants of the MNIST dataset and CIFAR-10, we show classification performance often much better than using standard selection and hyperparameter optimization methods. We hope that our approach will help non-expert users to more effectively identify machine learning algorithms and hyperparameter settings appropriate to their applications, and hence to achieve improved performance.

【Keywords】: hyperparameter optimization; model selection; weka

99. Direct optimization of ranking measures for learning to rank models.

Paper Link】 【Pages】:856-864

【Authors】: Ming Tan ; Tian Xia ; Lily Guo ; Shaojun Wang

【Abstract】: We present a novel learning algorithm, DirectRank, which directly and exactly optimizes ranking measures without resorting to any upper bounds or approximations. Our approach is essentially an iterative coordinate ascent method. In each iteration, we choose one coordinate and only update the corresponding parameter, with all others remaining fixed. Since the ranking measure is a stepwise function of a single parameter, we propose a novel line search algorithm that can locate the interval with the best ranking measure along this coordinate quite efficiently. In order to stabilize our system in small datasets, we construct a probabilistic framework for document-query pairs to maximize the likelihood of the objective permutation of top-$\tau$ documents. This iterative procedure ensures convergence. Furthermore, we integrate regression trees as our weak learners in order to consider the correlation between the different features. Experiments on LETOR datasets and two large datasets, Yahoo challenge data and Microsoft 30K web data, show an improvement over state-of-the-art systems.

【Keywords】: direct optimization; learning to rank; ranking measures; supervised learning

100. Multi-space probabilistic sequence modeling.

Paper Link】 【Pages】:865-873

【Authors】: Shuo Chen ; Jiexun Xu ; Thorsten Joachims

【Abstract】: Learning algorithms that embed objects into Euclidean space have become the methods of choice for a wide range of problems, ranging from recommendation and image search to playlist prediction and language modeling. Probabilistic embedding methods provide elegant approaches to these problems, but can be expensive to train and store as a large monolithic model. In this paper, we propose a method that trains not one monolithic model, but multiple local embeddings for a class of pairwise conditional models especially suited for sequence and co-occurrence modeling. We show that computation and memory for training these multi-space models can be efficiently parallelized over many nodes of a cluster. Focusing on sequence modeling for music playlists, we show that the method substantially speeds up training while maintaining high model quality.

【Keywords】: embedding; music playlists; parallel computing; recommendation; sequences

101. Towards never-ending learning from time series streams.

Paper Link】 【Pages】:874-882

【Authors】: Yuan Hao ; Yanping Chen ; Jesin Zakaria ; Bing Hu ; Thanawin Rakthanmanon ; Eamonn J. Keogh

【Abstract】: Time series classification has been an active area of research in the data mining community for over a decade, and significant progress has been made in the tractability and accuracy of learning. However, virtually all work assumes a one-time training session in which labeled examples of all the concepts to be learned are provided. This assumption may be valid in a handful of situations, but it does not hold in most medical and scientific applications where we initially may have only the vaguest understanding of what concepts can be learned. Based on this observation, we propose a never-ending learning framework for time series in which an agent examines an unbounded stream of data and occasionally asks a teacher (which may be a human or an algorithm) for a label. We demonstrate the utility of our ideas with experiments in domains as diverse as medicine, entomology, wildlife monitoring, and human behavior analyses.

【Keywords】: classification; data streams; never-ending learning; time series

102. Constrained stochastic gradient descent for large-scale least squares problem.

Paper Link】 【Pages】:883-891

【Authors】: Yang Mu ; Wei Ding ; Tianyi Zhou ; Dacheng Tao

【Abstract】: The least squares problem is one of the most important regression problems in statistics, machine learning and data mining. In this paper, we present the Constrained Stochastic Gradient Descent (CSGD) algorithm to solve the large-scale least squares problem. CSGD improves the Stochastic Gradient Descent (SGD) by imposing a provable constraint that the linear regression line passes through the mean point of all the data points. It results in the best regret bound $O(\log{T})$, and fastest convergence speed among all first order approaches. Empirical studies justify the effectiveness of CSGD by comparing it with SGD and other state-of-the-art approaches. An example is also given to show how to use CSGD to optimize SGD based least squares problems to achieve a better performance.

【Keywords】: large-scale least squares; online learning; stochastic optimization

103. Making recommendations from multiple domains.

Paper Link】 【Pages】:892-900

【Authors】: Wei Chen ; Wynne Hsu ; Mong-Li Lee

【Abstract】: Given the vast amount of information on the World Wide Web, recommender systems are increasingly being used to help filter irrelevant data and suggest information that would interest users. Traditional systems make recommendations based on a single domain e.g., movie or book domain. Recent work has examined the correlations in different domains and designed models that exploit user preferences on a source domain to predict user preferences on a target domain. However, these methods are based on matrix factorization and can only be applied to two-dimensional data. Transferring high dimensional data from one domain to another requires decomposing the high dimensional data to binary relations which results in information loss. Furthermore, this decomposition creates a large number of matrices that need to be transferred and combining them in the target domain is non-trivial. Separately, researchers have looked into using social network information to improve recommendation. However, this social network information has not been explored in cross domain collaborative filtering. In this work, we propose a generalized cross domain collaborative filtering framework that integrates social network information seamlessly with cross domain data. This is achieved by utilizing tensor factorization with topic based social regularization. This framework is able to transfer high dimensional data without the need for decomposition by finding shared implicit cluster-level tensor from multiple domains. Extensive experiments conducted on real world datasets indicate that the proposed framework outperforms state-of-art algorithms for item recommendation, user recommendation and tag recommendation.

【Keywords】: collaborative filtering; personalization; recommendation; social trust

104. Cascading outbreak prediction in networks: a data-driven approach.

Paper Link】 【Pages】:901-909

【Authors】: Peng Cui ; Shifei Jin ; Linyun Yu ; Fei Wang ; Wenwu Zhu ; Shiqiang Yang

【Abstract】: Cascades are ubiquitous in various network environments such as epidemic networks, traffic networks, water distribution networks and social networks. The outbreaks of cascades will often bring bad or even devastating effects. How to accurately predict the cascading outbreaks in early stage is of paramount importance for people to avoid these bad effects. Although there have been some pioneering works on cascading outbreaks detection, how to predict, rather than detect, the cascading outbreaks is still an open problem. In this paper, we attempt harnessing historical cascade data, propose a novel data driven approach to select important nodes as sensors, and predict the outbreaks based on the cascading behaviors of these sensors. In particular, we propose Orthogonal Sparse LOgistic Regression (OSLOR) method to jointly optimize node selection and outbreak prediction, where the prediction loss are combined with an orthogonal regularizer and L1 regularizer to guarantee good prediction accuracy, as well as the sparsity and low-redundancy of selected sensors. We evaluate the proposed method on a real online social network dataset including 182.7 million information cascades. The experimental results show that the proposed OSLOR significantly and consistently outperform topological measure based method and other data driven methods in prediction performances.

【Keywords】: data driven approach; information cascades; outbreak prediction; social network

105. Combining latent factor model with location features for event-based group recommendation.

Paper Link】 【Pages】:910-918

【Authors】: Wei Zhang ; Jianyong Wang ; Wei Feng

【Abstract】: Groups play an essential role in many social websites which promote users' interactions and accelerate the diffusion of information. Recommending groups that users are really interested to join is significant for both users and social media. While traditional group recommendation problem has been extensively studied, we focus on a new type of the problem, i.e., event-based group recommendation. Unlike the other forms of groups, users join this type of groups mainly for participating offline events organized by group members or inviting other users to attend events sponsored by them. These characteristics determine that previously proposed approaches for group recommendation cannot be adapted to the new problem easily as they ignore the geographical influence and other explicit features of groups and users. In this paper, we propose a method called Pairwise Tag enhAnced and featuRe-based Matrix factorIzation for Group recommendAtioN (PTARMIGAN), which considers location features, social features, and implicit patterns simultaneously in a unified model. More specifically, we exploit matrix factorization to model interactions between users and groups. Meanwhile, we incorporate their profile information into pairwise enhanced latent factors respectively. We also utilize the linear model to capture explicit features. Due to the reinforcement between explicit features and implicit patterns, our approach can provide better group recommendations. We conducted a comprehensive performance evaluation on real word data sets and the experimental results demonstrate the effectiveness of our method.

【Keywords】: event-based group recommendation; latent factor model; location feature

106. Cost-sensitive online active learning with application to malicious URL detection.

Paper Link】 【Pages】:919-927

【Authors】: Peilin Zhao ; Steven C. H. Hoi

【Abstract】: Malicious Uniform Resource Locator (URL) detection is an important problem in web search and mining, which plays a critical role in internet security. In literature, many existing studies have attempted to formulate the problem as a regular supervised binary classification task, which typically aims to optimize the prediction accuracy. However, in a real-world malicious URL detection task, the ratio between the number of malicious URLs and legitimate URLs is highly imbalanced, making it very inappropriate for simply optimizing the prediction accuracy. Besides, another key limitation of the existing work is to assume a large amount of training data is available, which is impractical as the human labeling cost could be potentially quite expensive. To solve these issues, in this paper, we present a novel framework of Cost-Sensitive Online Active Learning (CSOAL), which only queries a small fraction of training data for labeling and directly optimizes two cost-sensitive measures to address the class-imbalance issue. In particular, we propose two CSOAL algorithms and analyze their theoretical performance in terms of cost-sensitive bounds. We conduct an extensive set of experiments to examine the empirical performance of the proposed algorithms for a large-scale challenging malicious URL detection task, in which the encouraging results showed that the proposed technique by querying an extremely small-sized labeled data (about 0.5% out of 1-million instances) can achieve better or highly comparable classification performance in comparison to the state-of-the-art cost-insensitive and cost-sensitive online classification algorithms using a huge amount of labeled data.

【Keywords】: active learning; cost-sensitive learning; malicious url detection; online learning

107. The bang for the buck: fair competitive viral marketing from the host perspective.

Paper Link】 【Pages】:928-936

【Authors】: Wei Lu ; Francesco Bonchi ; Amit Goyal ; Laks V. S. Lakshmanan

【Abstract】: The key algorithmic problem in viral marketing is to identify a set of influential users (called seeds) in a social network, who, when convinced to adopt a product, shall influence other users in the network, leading to a large number of adoptions. When two or more players compete with similar products on the same network we talk about competitive viral marketing, which so far has been studied exclusively from the perspective of one of the competing players. In this paper we propose and study the novel problem of competitive viral marketing from the perspective of the host, i.e., the owner of the social network platform. The host sells viral marketing campaigns as a service to its customers, keeping control of the selection of seeds. Each company specifies its budget and the host allocates the seeds accordingly. From the host's perspective, it is important not only to choose the seeds to maximize the collective expected spread, but also to assign seeds to companies so that it guarantees the "bang for the buck" for all companies is nearly identical, which we formalize as the fair seed allocation problem. We propose a new propagation model capturing the competitive nature of viral marketing. Our model is intuitive and retains the desired properties of monotonicity and submodularity. We show that the fair seed allocation problem is NP-hard, and develop an efficient algorithm called Needy Greedy. We run experiments on three real-world social networks, showing that our algorithm is effective and scalable.

【Keywords】: influence propagation; social networks; viral marketing

108. Modeling the dynamics of composite social networks.

Paper Link】 【Pages】:937-945

【Authors】: ErHeng Zhong ; Wei Fan ; Yin Zhu ; Qiang Yang

【Abstract】: Modeling the dynamics of online social networks over time not only helps us understand the evolution of network structures and user behaviors, but also improves the performance of other analysis tasks, such as link prediction and community detection. Nowadays, users engage in multiple networks and form a "composite social network" by considering common users as the bridge. State-of-the-art network-dynamics analysis is performed in isolation for individual networks, but users' interactions in one network can influence their behaviors in other networks, and in an individual network, different types of user interactions also affect each other. Without considering the influences across networks, one may not be able to model the dynamics in a given network correctly due to the lack of information. In this paper, we study the problem of modeling the dynamics of composite networks, where the evolution processes of different networks are jointly considered. However, due to the difference in network properties, simply merging multiple networks into a single one is not ideal because individual evolution patterns may be ignored and network differences may bring negative impacts. The proposed solution is a nonparametric Bayesian model, which models each user's common latent features to extract the cross-network influences, and use network-specific factors to describe different networks' evolution patterns. Empirical studies on large-scale dynamic composite social networks demonstrate that the proposed approach improves the performance of link prediction over several state-of-the-art baselines and unfolds the network evolution accurately.

【Keywords】: social network analysis; transfer learning

109. A time-dependent enhanced support vector machine for time series regression.

Paper Link】 【Pages】:946-954

【Authors】: Goce Ristanoski ; Wei Liu ; James Bailey

【Abstract】: Support Vector Machines (SVMs) are a leading tool in machine learning and have been used with considerable success for the task of time series forecasting. However, a key challenge when using SVMs for time series is the question of how to deeply integrate time elements into the learning process. To address this challenge, we investigated the distribution of errors in the forecasts delivered by standard SVMs. Once we identified the samples that produced the largest errors, we observed their correlation with distribution shifts that occur in the time series. This motivated us to propose a time-dependent loss function which allows the inclusion of the information about the distribution shifts in the series directly into the SVM learning process. We present experimental results which indicate that using a time-dependent loss function is highly promising, reducing the overall variance of the errors, as well as delivering more accurate predictions.

【Keywords】: loss function; support vector machine; time series

110. A new collaborative filtering approach for increasing the aggregate diversity of recommender systems.

Paper Link】 【Pages】:955-963

【Authors】: Katja Niemann ; Martin Wolpers

【Abstract】: In order to satisfy and positively surprise the users, a recommender system needs to recommend items the users will like and most probably would not have found on their own. This requires the recommender system to recommend a broader range of items including niche items as well. Such an approach also support online-stores that often offer more items than traditional stores and need recommender systems to enable users to find the not so popular items as well. However, popular items that hold a lot of usage data are more easy to recommend and, thus, niche items are often excluded from the recommendations. In this paper, we propose a new collaborative filtering approach that is based on the items' usage contexts. The approach increases the rating predictions for niche items with fewer usage data available and improves the aggragate diversity of the recommendations.

【Keywords】: aggregate diversity; item-item similarity; long tail; niche items; recommender systems; usage context

111. Scalable inference in max-margin topic models.

Paper Link】 【Pages】:964-972

【Authors】: Jun Zhu ; Xun Zheng ; Li Zhou ; Bo Zhang

【Abstract】: Topic models have played a pivotal role in analyzing large collections of complex data. Besides discovering latent semantics, supervised topic models (STMs) can make predictions on unseen test data. By marrying with advanced learning techniques, the predictive strengths of STMs have been dramatically enhanced, such as max-margin supervised topic models, state-of-the-art methods that integrate max-margin learning with topic models. Though powerful, max-margin STMs have a hard non-smooth learning problem. Existing algorithms rely on solving multiple latent SVM subproblems in an EM-type procedure, which can be too slow to be applicable to large-scale categorization tasks. In this paper, we present a highly scalable approach to building max-margin supervised topic models. Our approach builds on three key innovations: 1) a new formulation of Gibbs max-margin supervised topic models for both multi-class and multi-label classification; 2) a simple ``augment-and-collapse" Gibbs sampling algorithm without making restricting assumptions on the posterior distributions; 3) an efficient parallel implementation that can easily tackle data sets with hundreds of categories and millions of documents. Furthermore, our algorithm does not need to solve SVM subproblems. Though performing the two tasks of topic discovery and learning predictive models jointly, which significantly improves the classification performance, our methods have comparable scalability as the state-of-the-art parallel algorithms for the standard LDA topic models which perform the single task of topic discovery only. Finally, an open-source implementation is also provided at: http://www.ml-thu.net/~jun/medlda.

【Keywords】: inference; large-scale systems; max-margin learning; topic models

112. A data-driven method for in-game decision making in MLB: when to pull a starting pitcher.

Paper Link】 【Pages】:973-979

【Authors】: Gartheeban Ganeshapillai ; John V. Guttag

【Abstract】: Professional sports is a roughly $500 billion dollar industry that is increasingly data-driven. In this paper we show how machine learning can be applied to generate a model that could lead to better on-field decisions by managers of professional baseball teams. Specifically we show how to use regularized linear regression to learn pitcher-specific predictive models that can be used to help decide when a starting pitcher should be replaced. A key step in the process is our method of converting categorical variables (e.g., the venue in which a game is played) into continuous variables suitable for the regression. Another key step is dealing with situations in which there is an insufficient amount of data to compute measures such as the effectiveness of a pitcher against specific batters. For each season we trained on the first 80% of the games, and tested on the rest. The results suggest that using our model could have led to better decisions than those made by major league managers. Applying our model would have led to a different decision 48% of the time. For those games in which a manager left a pitcher in that our model would have removed, the pitcher ended up performing poorly 60% of the time.

【Keywords】: mlb; predictive modeling

113. Exploiting user clicks for automatic seed set generation for entity matching.

Paper Link】 【Pages】:980-988

【Authors】: Xiao Bai ; Flavio Paiva Junqueira ; Srinivasan H. Sengamedu

【Abstract】: Matching entities from different information sources is a very important problem in data analysis and data integration. It is, however, challenging due to the number and diversity of information sources involved, and the significant editorial efforts required to collect sufficient training data. In this paper, we present an approach that leverages user clicks during Web search to automatically generate training data for entity matching. The key insight of our approach is that Web pages clicked for a given query are likely to be about the same entity. We use random walk with restart to reduce data sparseness, rely on co-clustering to group queries and Web pages, and exploit page similarity to improve matching precision. Experimental results show that: (i) With 360K pages from 6 major travel websites, we obtain 84K matchings (of 179K pages) that refer to the same entities, with an average precision of 0.826; (ii) The quality of matching obtained from a classifier trained on the resulted seed data is promising: the performance matches that of editorial data at small size and improves with size.

【Keywords】: co-clustering; entity matching; random walk; user clicks

114. Silence is also evidence: interpreting dwell time for recommendation from psychological perspective.

Paper Link】 【Pages】:989-997

【Authors】: Peifeng Yin ; Ping Luo ; Wang-Chien Lee ; Min Wang

【Abstract】: Social media is a platform for people to share and vote content. From the analysis of the social media data we found that users are quite inactive in rating/voting. For example, a user on average only votes 2 out of 100 accessed items. Traditional recommendation methods are mostly based on users' votes and thus can not cope with this situation. Based on the observation that the dwell time on an item may reflect the opinion of a user, we aim to enrich the user-vote matrix by converting the dwell time on items into users' ``pseudo votes'' and then help improve recommendation performance. However, it is challenging to correctly interpret the dwell time since many subjective human factors, e.g. user expectation, sensitivity to various item qualities, reading speed, are involved into the casual behavior of online reading. In psychology, it is assumed that people have choice threshold in decision making. The time spent on making decision reflects the decision maker's threshold. This idea inspires us to develop a View-Voting model, which can estimate how much the user likes the viewed item according to her dwell time, and thus make recommendations even if there is no voting data available. Finally, our experimental evaluation shows that the traditional rate-based recommendation's performance is greatly improved with the support of VV model.

【Keywords】: dwell time; psychological; recommendation

115. Efficient single-source shortest path and distance queries on large graphs.

Paper Link】 【Pages】:998-1006

【Authors】: Andy Diwen Zhu ; Xiaokui Xiao ; Sibo Wang ; Wenqing Lin

【Abstract】: This paper investigates two types of graph queries: single source distance (SSD) queries and single source shortest path (SSSP) queries. Given a node v in a graph G, an SSD query from v asks for the distance from $v$ to any other node in G, while an SSSP query retrieves the shortest path from v to any other node. These two types of queries find important applications in graph analysis, especially in the computation of graph measures. Most of the existing solutions for SSD and SSSP queries, however, require that the input graph fits in the main memory, which renders them inapplicable for the massive disk-resident graphs commonly used in web and social applications. There are several techniques that are designed to be I/O efficient, but they all focus on undirected and/or unweighted graphs, and they only offer sub-optimal query efficiency. To address the deficiency of existing work, this paper presents Highways-on-Disk (HoD), a disk-based index that supports both SSD and SSSP queries on directed and weighted graphs. The key idea of HoD is to augment the input graph with a set of auxiliary edges, and exploit them during query processing to reduce I/O and computation costs. We experimentally evaluate HoD on both directed and undirected real-world graphs with up to billions of nodes and edges, and we demonstrate that HoD significantly outperforms alternative solutions in terms of query efficiency.

【Keywords】: algorithm; distance queries; graph; shortest path queries

116. On community detection in real-world networks and the importance of degree assortativity.

Paper Link】 【Pages】:1007-1015

【Authors】: Marek Ciglan ; Michal Laclavik ; Kjetil Nørvåg

【Abstract】: Graph clustering, often addressed as community detection, is a prominent task in the domain of graph data mining with dozens of algorithms proposed in recent years. In this paper, we focus on several popular community detection algorithms with low computational complexity and with decent performance on the artificial benchmarks, and we study their behaviour on real-world networks. Motivated by the observation that there is a class of networks for which the community detection methods fail to deliver good community structure, we examine the assortativity coefficient of ground-truth communities and show that assortativity of a community structure can be very different from the assortativity of the original network. We then examine the possibility of exploiting the latter by weighting edges of a network with the aim to improve the community detection outputs for networks with assortative community structure. The evaluation shows that the proposed weighting can significantly improve the results of community detection methods on networks with assortative community structure.

【Keywords】: community detection; edge weighting; network assortativity

117. Trial and error in influential social networks.

Paper Link】 【Pages】:1016-1024

【Authors】: Xiaohui Bei ; Ning Chen ; Liyu Dou ; Xiangru Huang ; Ruixin Qiang

【Abstract】: In this paper, we introduce a trial-and-error model to study information diffusion in a social network. Specifically, in every discrete period, all individuals in the network concurrently try a new technology or product with certain respective probabilities. If it turns out that an individual observes a better utility, he will then adopt the trial; otherwise, the individual continues to choose his prior selection. We first demonstrate that the trial and error behavior of individuals characterizes certain global community structures of a social network, from which we are able to detect macro-communities through the observation of micro-behavior of individuals. We run simulations on classic benchmark testing graphs, and quite surprisingly, the results show that the trial and error dynamics even outperforms the Louvain method (a popular modularity maximization approach) if individuals have dense connections within communities. This gives a solid justification of the model. We then study the influence maximization problem in the trial-and-error dynamics. We give a heuristic algorithm based on community detection and provide experiments on both testing and large scale collaboration networks. Simulation results show that our algorithm significantly outperforms several well-studied heuristics including degree centrality and distance centrality in almost all of the scenarios. Our results reveal the relation between the budget that an advertiser invests and marketing strategies, and indicate that the mixing parameter, a benchmark evaluating network community structures, plays a critical role for information diffusion.

【Keywords】: social networks; trial and error

118. Collaborative matrix factorization with multiple similarities for predicting drug-target interactions.

Paper Link】 【Pages】:1025-1033

【Authors】: Xiaodong Zheng ; Hao Ding ; Hiroshi Mamitsuka ; Shanfeng Zhu

【Abstract】: We address the problem of predicting new drug-target interactions from three inputs: known interactions, similarities over drugs and those over targets. This setting has been considered by many methods, which however have a common problem of allowing to have only one similarity matrix over drugs and that over targets. The key idea of our approach is to use more than one similarity matrices over drugs as well as those over targets, where weights over the multiple similarity matrices are estimated from data to automatically select similarities, which are effective for improving the performance of predicting drug-target interactions. We propose a factor model, named Multiple Similarities Collaborative Matrix Factorization(MSCMF), which projects drugs and targets into a common low-rank feature space, which is further consistent with weighted similarity matrices over drugs and those over targets. These two low-rank matrices and weights over similarity matrices are estimated by an alternating least squares algorithm. Our approach allows to predict drug-target interactions by the two low-rank matrices collaboratively and to detect similarities which are important for predicting drug-target interactions. This approach is general and applicable to any binary relations with similarities over elements, being found in many applications, such as recommender systems. In fact, MSCMF is an extension of weighted low-rank approximation for one-class collaborative filtering. We extensively evaluated the performance of MSCMF by using both synthetic and real datasets. Experimental results showed nice properties of MSCMF on selecting similarities useful in improving the predictive performance and the performance advantage of MSCMF over six state-of-the-art methods for predicting drug-target interactions.

【Keywords】: chemoinformatics; drug-target interaction; multiple types of similarities over drug and targets; weighted low-rank approximation

119. FeaFiner: biomarker identification from medical data through feature generalization and selection.

Paper Link】 【Pages】:1034-1042

【Authors】: Jiayu Zhou ; Zhaosong Lu ; Jimeng Sun ; Lei Yuan ; Fei Wang ; Jieping Ye

【Abstract】: Traditionally, feature construction and feature selection are two important but separate processes in data mining. However, many real world applications require an integrated approach for creating, refining and selecting features. To address this problem, we propose FeaFiner (short for Feature Refiner), an efficient formulation that simultaneously generalizes low-level features into higher level concepts and then selects relevant concepts based on the target variable. Specifically, we formulate a double sparsity optimization problem that identifies groups in the low-level features, generalizes higher level features using the groups and performs feature selection. Since in many clinical researches non- overlapping groups are preferred for better interpretability, we further improve the formulation to generalize features using mutually exclusive feature groups. The proposed formulation is challenging to solve due to the orthogonality constraints, non-convexity objective and non-smoothness penal- ties. We apply a recently developed augmented Lagrangian method to solve this formulation in which each subproblem is solved by a non-monotone spectral projected gradient method. Our numerical experiments show that this approach is computationally efficient and also capable of producing solutions of high quality. We also present a generalization bound showing the consistency and the asymptotic behavior of the learning process of our proposed formulation. Finally, the proposed FeaFiner method is validated on Alzheimer's Disease Neuroimaging Initiative dataset, where low-level biomarkers are automatically generalized into robust higher level concepts which are then selected for predicting the disease status measured by Mini Mental State Examination and Alzheimer's Disease Assessment Scale cognitive subscore. Compared to existing predictive modeling methods, FeaFiner provides intuitive and robust feature concepts and competitive predictive accuracy.

【Keywords】: augmented lagrangian; biomarkers; feature generalization; feature selection; sparse learning; spectral gradient descent

120. Learning geographical preferences for point-of-interest recommendation.

Paper Link】 【Pages】:1043-1051

【Authors】: Bin Liu ; Yanjie Fu ; Zijun Yao ; Hui Xiong

【Abstract】: The problem of point of interest (POI) recommendation is to provide personalized recommendations of places of interests, such as restaurants, for mobile users. Due to its complexity and its connection to location based social networks (LBSNs), the decision process of a user choose a POI is complex and can be influenced by various factors, such as user preferences, geographical influences, and user mobility behaviors. While there are some studies on POI recommendations, it lacks of integrated analysis of the joint effect of multiple factors. To this end, in this paper, we propose a novel geographical probabilistic factor analysis framework which strategically takes various factors into consideration. Specifically, this framework allows to capture the geographical influences on a user's check-in behavior. Also, the user mobility behaviors can be effectively exploited in the recommendation model. Moreover, the recommendation model can effectively make use of user check-in count data as implicity user feedback for modeling user preferences. Finally, experimental results on real-world LBSNs data show that the proposed recommendation method outperforms state-of-the-art latent factor models with a significant margin.

【Keywords】: human mobility; location-based social networks; point-of-interest; recommender systems; user profiling

121. Learning mixed kronecker product graph models with simulated method of moments.

Paper Link】 【Pages】:1052-1060

【Authors】: Sebastián Moreno ; Jennifer Neville ; Sergey Kirshner

【Abstract】: There has recently been a great deal of work focused on developing statistical models of graph structure---with the goal of modeling probability distributions over graphs from which new, similar graphs can be generated by sampling from the estimated distributions. Although current graph models can capture several important characteristics of social network graphs (e.g., degree, path lengths), many of them do not generate graphs with sufficient variation to reflect the natural variability in real world graph domains. One exception is the mixed Kronecker Product Graph Model (mKPGM), a generalization of the Kronecker Product Graph Model, which uses parameter tying to capture variance in the underlying distribution [10]. The enhanced representation of mKPGMs enables them to match both the mean graph statistics and their spread as observed in real network populations, but unfortunately to date, the only method to estimate mKPGMs involves an exhaustive search over the parameters. In this work, we present the first learning algorithm for mKPGMs. The O(|E|) algorithm searches over the continuous parameter space using constrained line search and is based on simulated method of moments, where the objective function minimizes the distance between the observed moments in the training graph and the empirically estimated moments of the model. We evaluate the mKPGM learning algorithm by comparing it to several different graph models, including KPGMs. We use multi-dimensional KS distance to compare the generated graphs to the observed graphs and the results show mKPGMs are able to produce a closer match to real-world graphs (10-90% reduction in KS distance), while still providing natural variation in the generated graphs.

【Keywords】: kronecker models; link analysis; method of moments estimation; statistical graph models

122. Measuring spontaneous devaluations in user preferences.

Paper Link】 【Pages】:1061-1069

【Authors】: Komal Kapoor ; Nisheeth Srivastava ; Jaideep Srivastava ; Paul R. Schrater

【Abstract】: Spontaneous devaluation in preferences is ubiquitous, where yesterday's hit is today's affliction. Despite technological advances facilitating access to a wide range of media commodities, finding engaging content is a major enterprise with few principled solutions. Systems tracking spontaneous devaluation in user preferences can allow prediction of the onset of boredom in users potentially catering to their changed needs. In this work, we study the music listening histories of Last.fm users focusing on the changes in their preferences based on their choices for different artists at different points in time. A hazard function, commonly used in statistics for survival analysis, is used to capture the rate at which a user returns to an artist as a function of exposure to the artist. The analysis provides the first evidence of spontaneous devaluation in preferences of music listeners. Better understanding of the temporal dynamics of this phenomenon can inform solutions to the similarity-diversity dilemma of recommender systems.

【Keywords】: dynamic preferences; recommender systems; temporal models; user behavior modeling

123. Mining evidences for named entity disambiguation.

Paper Link】 【Pages】:1070-1078

【Authors】: Yang Li ; Chi Wang ; Fangqiu Han ; Jiawei Han ; Dan Roth ; Xifeng Yan

【Abstract】: Named entity disambiguation is the task of disambiguating named entity mentions in natural language text and link them to their corresponding entries in a knowledge base such as Wikipedia. Such disambiguation can help enhance readability and add semantics to plain text. It is also a central step in constructing high-quality information network or knowledge graph from unstructured text. Previous research has tackled this problem by making use of various textual and structural features from a knowledge base. Most of the proposed algorithms assume that a knowledge base can provide enough explicit and useful information to help disambiguate a mention to the right entity. However, the existing knowledge bases are rarely complete (likely will never be), thus leading to poor performance on short queries with not well-known contexts. In such cases, we need to collect additional evidences scattered in internal and external corpus to augment the knowledge bases and enhance their disambiguation power. In this work, we propose a generative model and an incremental algorithm to automatically mine useful evidences across documents. With a specific modeling of "background topic" and "unknown entities", our model is able to harvest useful evidences out of noisy information. Experimental results show that our proposed method outperforms the state-of-the-art approaches significantly: boosting the disambiguation accuracy from 43% (baseline) to 86% on short queries derived from tweets.

【Keywords】: entity disambiguation; evidence mining; generative model; knowledge expansion; semi-supervised learning

124. Privacy-preserving data exploration in genome-wide association studies.

Paper Link】 【Pages】:1079-1087

【Authors】: Aaron Johnson ; Vitaly Shmatikov

【Abstract】: Genome-wide association studies (GWAS) have become a popular method for analyzing sets of DNA sequences in order to discover the genetic basis of disease. Unfortunately, statistics published as the result of GWAS can be used to identify individuals participating in the study. To prevent privacy breaches, even previously published results have been removed from public databases, impeding researchers' access to the data and hindering collaborative research. Existing techniques for privacy-preserving GWAS focus on answering specific questions, such as correlations between a given pair of SNPs (DNA sequence variations). This does not fit the typical GWAS process, where the analyst may not know in advance which SNPs to consider and which statistical tests to use, how many SNPs are significant for a given dataset, etc. We present a set of practical, privacy-preserving data mining algorithms for GWAS datasets. Our framework supports exploratory data analysis, where the analyst does not know a priori how many and which SNPs to consider. We develop privacy-preserving algorithms for computing the number and location of SNPs that are significantly associated with the disease, the significance of any statistical test between a given SNP and the disease, any measure of correlation between SNPs, and the block structure of correlations. We evaluate our algorithms on real-world datasets and demonstrate that they produce significantly more accurate results than prior techniques while guaranteeing differential privacy.

【Keywords】: differential privacy; genome-wide association studies

125. Synthetic review spamming and defense.

Paper Link】 【Pages】:1088-1096

【Authors】: Huan Sun ; Alex Morales ; Xifeng Yan

【Abstract】: Online reviews have been popularly adopted in many applications. Since they can either promote or harm the reputation of a product or a service, buying and selling fake reviews becomes a profitable business and a big threat. In this paper, we introduce a very simple, but powerful review spamming technique that could fail the existing feature-based detection algorithms easily. It uses one truthful review as a template, and replaces its sentences with those from other reviews in a repository. Fake reviews generated by this mechanism are extremely hard to detect: Both the state-of-the-art computational approaches and human readers acquire an error rate of 35%-48%, just slightly better than a random guess. While it is challenging to detect such fake reviews, we have made solid progress in suppressing them. A novel defense method that leverages the difference of semantic flows between synthetic and truthful reviews is developed, which is able to reduce the detection error rate to approximately 22%, a significant improvement over the performance of existing approaches. Nevertheless, it is still a challenging research task to further decrease the error rate. Synthetic Review Spamming Demo: www.cs.ucsb.edu/~alex_morales/reviewspam/

【Keywords】: classification; review spam; spam detection

126. Information cartography: creating zoomable, large-scale maps of information.

Paper Link】 【Pages】:1097-1105

【Authors】: Dafna Shahaf ; Jaewon Yang ; Caroline Suen ; Jeff Jacobs ; Heidi Wang ; Jure Leskovec

【Abstract】: In an era of information overload, many people struggle to make sense of complex stories, such as presidential elections or economic reforms. We propose a methodology for creating structured summaries of information, which we call zoomable metro maps. Just as cartographic maps have been relied upon for centuries to help us understand our surroundings, metro maps can help us understand the information landscape. Given large collection of news documents our proposed algorithm generates a map of connections that explicitly captures story development. As different users might be interested in different levels of granularity, the maps are zoomable, with each level of zoom showing finer details and interactions. In this paper, we formalize characteristics of good zoomable maps and formulate their construction as an optimization problem. We provide efficient, scalable methods with theoretical guarantees for generating maps. Pilot user studies over real-world datasets demonstrate that our method helps users comprehend complex stories better than prior work.

【Keywords】: information; summarization; zoomable metro maps

127. Restreaming graph partitioning: simple versatile algorithms for advanced balancing.

Paper Link】 【Pages】:1106-1114

【Authors】: Joel Nishimura ; Johan Ugander

【Abstract】: Partitioning large graphs is difficult, especially when performed in the limited models of computation afforded to modern large scale computing systems. In this work we introduce restreaming graph partitioning and develop algorithms that scale similarly to streaming partitioning algorithms yet empirically perform as well as fully offline algorithms. In streaming partitioning, graphs are partitioned serially in a single pass. Restreaming partitioning is motivated by scenarios where approximately the same dataset is routinely streamed, making it possible to transform streaming partitioning algorithms into an iterative procedure. This combination of simplicity and powerful performance allows restreaming algorithms to be easily adapted to efficiently tackle more challenging partitioning objectives. In particular, we consider the problem of stratified graph partitioning, where each of many node attribute strata are balanced simultaneously. As such, stratified partitioning is well suited for the study of network effects on social networks, where it is desirable to isolate disjoint dense subgraphs with representative user demographics. To demonstrate, we partition a large social network such that each partition exhibits the same degree distribution in the original graph --- a novel achievement for non-regular graphs. As part of our results, we also observe a fundamental difference in the ease with which social graphs are partitioned when compared to web graphs. Namely, the modular structure of web graphs appears to motivate full offline optimization, whereas the locally dense structure of social graphs precludes significant gains from global manipulations.

【Keywords】: balanced partitioning; graph clustering; multi-constraint balance; social networks; stratified partitioning

128. Understanding evolution of research themes: a probabilistic generative model for citations.

Paper Link】 【Pages】:1115-1123

【Authors】: Xiaolong Wang ; Chengxiang Zhai ; Dan Roth

【Abstract】: Understanding how research themes evolve over time in a research community is useful in many ways (e.g., revealing important milestones and discovering emerging major research trends). In this paper, we propose a novel way of analyzing literature citation to explore the research topics and the theme evolution by modeling article citation relations with a probabilistic generative model. The key idea is to represent a research paper by a bag of citations'' and model such acitation document'' with a probabilistic topic model. We explore the extension of a particular topic model, i.e., Latent Dirichlet Allocation~(LDA), for citation analysis, and show that such a Citation-LDA can facilitate discovering of individual research topics as well as the theme evolution from multiple related topics, both of which in turn lead to the construction of evolution graphs for characterizing research themes. We test the proposed citation-LDA on two datasets: the ACL Anthology Network(AAN) of natural language research literatures and PubMed Central(PMC) archive of biomedical and life sciences literatures, and demonstrate that Citation-LDA can effectively discover the evolution of research themes, with better formed topics than (conventional) Content-LDA.

【Keywords】: citation analysis; theme evolution

129. On the equivalent of low-rank linear regressions and linear discriminant analysis based regressions.

Paper Link】 【Pages】:1124-1132

【Authors】: Xiao Cai ; Chris H. Q. Ding ; Feiping Nie ; Heng Huang

【Abstract】: The low-rank regression model has been studied and applied to capture the underlying classes/tasks correlation patterns, such that the regression/classification results can be enhanced. In this paper, we will prove that the low-rank regression model is equivalent to doing linear regression in the linear discriminant analysis (LDA) subspace. Our new theory reveals the learning mechanism of low-rank regression, and shows that the low-rank structures exacted from classes/tasks are connected to the LDA projection results. Thus, the low-rank regression efficiently works for the high-dimensional data. Moreover, we will propose new discriminant low-rank ridge regression and sparse low-rank regression methods. Both of them are equivalent to doing regularized regression in the regularized LDA subspace. These new regularized objectives provide better data mining results than existing low-rank regression in both theoretical and empirical validations. We evaluate our discriminant low-rank regression methods by six benchmark datasets. In all empirical results, our discriminant low-rank models consistently show better results than the corresponding full-rank methods.

【Keywords】: linear discriminant analysis; low-rank regression; low-rank ridge regression; sparse low-rank regression

Industry practice expo invited presentations 8

130. To buy or not to buy: that is the question.

Paper Link】 【Pages】:1133

【Authors】: Oren Etzioni

【Abstract】: Shopping can be decomposed into three basic questions: what, where, and when to buy? In this talk, I'll describe how we utilize advanced data-mining and text-mining techniques at Decide.com (and earlier at Farecast) to solve these problems for on-line shoppers. Our algorithms have predicted prices utilizing billions of data points, and ranked products based on millions of reviews.

【Keywords】: data mining; retail products

131. Mining the digital universe of data to develop personalized cancer therapies.

Paper Link】 【Pages】:1134

【Authors】: Eric E. Schadt

【Abstract】: The development of a personalized approach to medical care is now well recognized as an urgent priority. This approach is particularly important in oncology, where it is well understood that each cancer diagnosis is unique at the molecular level, arising from a particular and specific collection of genetic alterations. Furthermore, taking a personalized approach to oncology may expedite the treatment process, pre-empting therapeutic decisions based on fewer data in favor of treatments targeted to an individual's tumor. This directed course may be key to survival for many patients who are terminal or have failed standard therapies.

【Keywords】: dna sequencing; personalized cancer therapies; rna sequencing

132. The business impact of deep learning.

Paper Link】 【Pages】:1135

【Authors】: Jeremy Howard

【Abstract】: In the last year deep learning has gone from being a special purpose machine learning technique used mainly for image and speech recognition, to becoming a general purpose machine learning tool. This has broad implications for all organizations that rely on data analysis. It represents the latest development in a general trend towards more automated algorithms, and away from domain specific knowledge. For organizations that rely on domain expertise for their competitive advantage, this trend could be extremely disruptive. For start-ups interested in entering established markets, this trend could be a major opportunity. This talk will be a non-technical introduction to general-purpose deep learning, and its potential business impact.

【Keywords】: deep learning; machine learning

133. Adaptive adversaries: building systems to fight fraud and cyber intruders.

Paper Link】 【Pages】:1136

【Authors】: Ari Gesher

【Abstract】: Statistical machine learning / knowledge discovery techniques tend to fail when faced with an adaptive adversary attempting to evade detection in the data. Humans do an excellent job of correctly spotting adaptive adversaries given a good way to digest the data. On the other hand, humans are glacially slow and error-prone when it comes to moving through very large volumes of data, a task best left to the machines. Fighting complex fraud and cyber-security threats requires a symbiosis between the computers and teams of human analysts. The computers use algorithmic analysis, heuristics, and/or statistical characterization to find interesting 'simple' patterns in the data. These candidate events are then queued for in-depth human analysis in rich, expressive, interactive analysis environments. In this talk, we'll take a look at case studies of three different systems, using a partnership of automation and human analysis on large scale data to find the clandestine human behavior that these datasets hold, including a discussion of the backend systems architecture and a demo of the interactive analysis environment. The backend systems architecture is a mix of open source technologies, like Cassandra, Lucene, and Hadoop, and some new components that bind them all together. The interactive analysis environment allows seamless pivoting between semantic, geospatial, and temporal analysis with a powerful GUI interface that's usable by non-data scientists. The systems are real systems currently in use by commercial banks, pharmaceutical companies, and governments.

【Keywords】: cyber security; intelligence augmentation; knowledge discovery; machine learning

134. Targeting and influencing at scale: from presidential elections to social good.

Paper Link】 【Pages】:1137

【Authors】: Rayid Ghani

【Abstract】: If you're still recovering from the barrage of ads, news, emails, Facebook posts, and newspaper articles that were giving you the latest poll numbers, asking you to volunteer, donate money, and vote, this talk will give you a look behind the scenes on why you were seeing what you were seeing. I will talk about how machine learning and data mining along with randomized experiments were used to target and influence tens of millions of people. Beyond the presidential elections, these methodologies for targeting and influence have the power to solve big problems in education, healthcare, energy, transportation, and related areas. I will talk about some recent work we're doing at the University of Chicago Data Science for Social Good summer fellowship program working with non-profits and government organizations to tackle some of these challenges.

【Keywords】: randomized experimentation; social networks; targeting

135. Hadoop: a view from the trenches.

Paper Link】 【Pages】:1138

【Authors】: Milind Bhandarkar

【Abstract】: From it's beginnings as a framework for building web crawlers for small-scale search engines to being one of the most promising technologies for building datacenter-scale distributed computing and storage platforms, Apache Hadoop has come far in the last seven years. In this talk I will reminisce about the early days of Hadoop, and will give an overview of the current state of the Hadoop ecosystem, and some real-world use cases of this open source platform. I will conclude with some crystal gazing in the future of Hadoop and associated technologies.

【Keywords】: hadoop; open source

136. Cyber security: how visual analytics unlock insight.

Paper Link】 【Pages】:1139

【Authors】: Raffael Marty

【Abstract】: In the Cyber Security domain, we have been collecting 'big data' for almost two decades. The volume and variety of our data is extremely large, but understanding and capturing the semantics of the data is even more of a challenge. Finding the needle in the proverbial haystack has been attempted from many different angles. In this talk we will have a look at what approaches have been explored, what has worked, and what has not. We will see that there is still a large amount of work to be done and data mining is going to play a central role. We'll try to motivate that in order to successfully find bad guys, we will have to embrace a solution that not only leverages clever data mining, but employs the right mix between human computer interfaces, data mining, and scalable data platforms. Traditionally, cyber security has been having its challenges with data mining. We are different. We will explore how to adopt data mining algorithms to the security domain. Some approaches like predictive analytics are extremely hard, if not impossible. How would you predict the next cyber attack? Others need to be tailored to the security domain to make them work. Visualization and visual analytics seem to be extremely promising to solve cyber security issues. Situational awareness, large-scale data exploration, knowledge capture, and forensic investigations are four top use-cases we will discuss. Visualization alone, however, does not solve security problems. We need algorithms that support the visualizations. For example to reduce the amount of data so an analyst can deal with it, in both volume and semantics.

【Keywords】: cyber security; data visualization

137. Using "big data" to solve "small data" problems.

Paper Link】 【Pages】:1140

【Authors】: Chris Neumann

【Abstract】: The brief history of knowledge discovery is filled with products that promised to bring "BI to the masses". But how do you build a product that truly bridges the gap between the conceptual simplicity of "questions and answers" and the structure needed to query traditional data stores? In this talk, Chris Neumann will discuss how DataHero applied the principles of user-centric design and development over a year and a half to create a product with which more than 95% of new users can get answers on their first attempt. He'll demonstrate the process DataHero uses to determine the best combination of algorithms and user interface concepts needed to create intuitive solutions to potentially complex interactions, including: Determining the structure of files uploaded by users Accurately identifying data types within files Presenting users with an optimal visualization for any combination of data Helping users to ask questions of data when they don't know what to do Chris will also talk about what it's like to start a "Big Data" company and how he applied lessons from his time as the first engineer at Aster Data Systems to DataHero.

【Keywords】: analytics; big data; data mining

Industrial and government deployed 14

138. Financing lead triggers: empowering sales reps through knowledge discovery and fusion.

Paper Link】 【Pages】:1141-1149

【Authors】: Kareem S. Aggour ; Bethany Hoogs

【Abstract】: Sales representatives must have access to meaningful and actionable intelligence about potential customers to be effective in their roles. Historically, GE Capital Americas sales reps identified leads by manually searching through news reports and financial statements either in print or online. Here we describe a system built to automate the collection and aggregation of information on companies, which is then mined to identify actionable sales leads. The Financing Lead Triggers system is comprised of three core components that perform information fusion, knowledge discovery and information visualization. Together these components extract raw data from disparate sources, fuse that data into information, and then automatically mine that information for actionable sales leads driven by a combination of expert-defined and statistically derived triggers. A web-based interface provides sales reps access to the company information and sales leads in a single location. The use of the Lead Triggers system has significantly improved the performance of the sales reps, providing them with actionable intelligence that has improved their productivity by 30-50%. In 2010, Lead Triggers provided leads on opportunities that represented over $44B in new deal commitments for GE Capital.

【Keywords】: commercial finance; information fusion; knowledge discovery; lead generation; sales leads

Paper Link】 【Pages】:1150-1158

【Authors】: Ye Chen ; Weiguo Liu ; Jeonghee Yi ; Anton Schwaighofer ; Tak W. Yan

【Abstract】: In sponsored search auctions, the auctioneer operates the marketplace by setting a number of auction parameters such as reserve prices for the task of auction optimization. The auction parameters may be set for each individual keyword, but the optimization problem becomes intractable since the number of keywords is in the millions. To reduce the dimensionality and generalize well, one wishes to cluster keywords or queries into meaningful groups, and set parameters at the keyword-cluster level. For auction optimization, keywords shall be deemed as interchangeable commodities with respect to their valuations from advertisers, represented as bid distributions or landscapes. Clustering keywords for auction optimization shall thus be based on their bid distributions. In this paper we present a formalism of clustering probability distributions, and its application to query clustering where each query is represented as a probability density of click-through rate (CTR) weighted bid and distortion is measured by KL divergence. We first derive a k-means variant for clustering Gaussian densities, which have a closed-form KL divergence. We then develop an algorithm for clustering Gaussian mixture densities, which generalize a single Gaussian and are typically a more realistic parametric assumption for real-world data. The KL divergence between Gaussian mixture densities is no longer analytically tractable; hence we derive a variational EM algorithm that minimizes an upper bound of the total within-cluster KL divergence. The clustering algorithm has been deployed successfully into production, yielding significant improvement in revenue and clicks over the existing production system. While motivated by the specific setting of query clustering, the proposed clustering method is generally applicable to many real-world applications where an example is better characterized by a distribution than a finite-dimensional feature vector in Euclidean space as in the classical k-means.

【Keywords】: auction; bayesian methods; clustering; optimization; sponsored search

140. Analysis of advanced meter infrastructure data of water consumption in apartment buildings.

Paper Link】 【Pages】:1159-1167

【Authors】: Einat Kermany ; Hanna Mazzawi ; Dorit Baras ; Yehuda Naveh ; Hagai Michaelis

【Abstract】: We present our experience of using machine learning techniques over data originating from advanced meter infrastructure (AMI) systems for water consumption in a medium-size city. We focus on two new use cases that are of special importance to city authorities. One use case is the automatic identification of malfunctioning meters, with a focus on distinguishing them from legitimate non-consumption such as during periods when the household residents are on vacation. The other use case is the identification of leaks or theft in the unmetered common areas of apartment buildings. These two use cases are highly important to city authorities both because of the lost revenue they imply and because of the hassle to the residents in cases of delayed identification. Both cases are inherently complex to analyze and require advanced data mining techniques in order to achieve high levels of correct identification. Our results provide for faster and more accurate detection of malfunctioning meters as well as leaks in the common areas. This results in significant tangible value to the authorities in terms of increase in technician efficiency and a decrease in the amount of wasted, non-revenue, water.

【Keywords】: advanced meter infrastructure; leaks; machine learning; malfunction; water

141. Online controlled experiments at large scale.

Paper Link】 【Pages】:1168-1176

【Authors】: Ron Kohavi ; Alex Deng ; Brian Frasca ; Toby Walker ; Ya Xu ; Nils Pohlmann

【Abstract】: Web-facing companies, including Amazon, eBay, Etsy, Facebook, Google, Groupon, Intuit, LinkedIn, Microsoft, Netflix, Shop Direct, StumbleUpon, Yahoo, and Zynga use online controlled experiments to guide product development and accelerate innovation. At Microsoft's Bing, the use of controlled experiments has grown exponentially over time, with over 200 concurrent experiments now running on any given day. Running experiments at large scale requires addressing multiple challenges in three areas: cultural/organizational, engineering, and trustworthiness. On the cultural and organizational front, the larger organization needs to learn the reasons for running controlled experiments and the tradeoffs between controlled experiments and other methods of evaluating ideas. We discuss why negative experiments, which degrade the user experience short term, should be run, given the learning value and long-term benefits. On the engineering side, we architected a highly scalable system, able to handle data at massive scale: hundreds of concurrent experiments, each containing millions of users. Classical testing and debugging techniques no longer apply when there are billions of live variants of the site, so alerts are used to identify issues rather than relying on heavy up-front testing. On the trustworthiness front, we have a high occurrence of false positives that we address, and we alert experimenters to statistical interactions between experiments. The Bing Experimentation System is credited with having accelerated innovation and increased annual revenues by hundreds of millions of dollars, by allowing us to find and focus on key ideas evaluated through thousands of controlled experiments. A 1% improvement to revenue equals more than $10M annually in the US, yet many ideas impact key metrics by 1% and are not well estimated a-priori. The system has also identified many negative features that we avoided deploying, despite key stakeholders' early excitement, saving us similar large amounts.

【Keywords】: a/b testing; controlled experiments; randomized experiments

142. iHR: an online recruiting system for Xiamen Talent Service Center.

Paper Link】 【Pages】:1177-1185

【Authors】: Wenxing Hong ; Lei Li ; Tao Li ; Wenfu Pan

【Abstract】: Online recruiting systems have gained immense attention in the wake of more and more job seekers searching jobs and enterprises finding candidates on the Internet. A critical problem in a recruiting system is how to maximally satisfy the desires of both job seekers and enterprises with reasonable recommendations or search results. In this paper, we investigate and compare various online recruiting systems from a product perspective. We then point out several key functions that help achieve a win-win situation between job seekers and enterprises for a successful recruiting system. Based on the observations and key functions, we design, implement and deploy a web-based application of recruiting system, named iHR, for Xiamen Talent Service Center. The system utilizes the latest advances in data mining and recommendation technologies to create a user-oriented service for a myriad of audience in job marketing community. Empirical evaluation and online user studies demonstrate the efficacy and effectiveness of our proposed system. Currently, iHR has been deployed at http://i.xmrc.com.cn/XMRCIntel.

【Keywords】: bilateral recommendation; job matching system; job recommendation; reciprocal recommendation

143. Dynamic memory allocation policies for postings in real-time Twitter search.

Paper Link】 【Pages】:1186-1194

【Authors】: Nima Asadi ; Jimmy Lin ; Michael Busch

【Abstract】: We explore a real-time Twitter search application where tweets are arriving at a rate of several thousands per second. Real-time search demands that they be indexed and searchable immediately, which leads to a number of implementation challenges. In this paper, we focus on one aspect: dynamic postings allocation policies for index structures that are completely held in main memory. The core issue can be characterized as a "Goldilocks Problem". Because memory remains today a scare resource, an allocation policy that is too aggressive leads to inefficient utilization, while a policy that is too conservative is slow and leads to fragmented postings lists. We present a dynamic postings allocation policy that allocates memory in increasingly-larger "slices" from a small number of large, fixed pools of memory. With an analytical model and experiments, we explore different settings that balance time (query evaluation speed) and space (memory utilization).

【Keywords】: inverted indexing; memory allocation

Paper Link】 【Pages】:1195-1203

【Authors】: Luo Jie ; Sudarshan Lamkhede ; Rochit Sapra ; Evans Hsu ; Helen Song ; Yi Chang

【Abstract】: Today's popular web search engines expand the search process beyond crawled web pages to specialized corpora ("verticals") like images, videos, news, local, sports, finance, shopping etc., each with its own specialized search engine. Search federation deals with problems of the selection of search engines to query and merging of their results into a single result set. Despite a few recent advances, the problem is still very challenging. First, due to the heterogeneous nature of different verticals, how the system merges the vertical results with the web documents to serve the user's information need is still an open problem. Moreover, the scale of the search engine and the increasing number of vertical properties requires a solution which is efficient and scaleable. In this paper, we propose a unified framework for the search federation problem. We model the search federation as a contextual bandit problem. The system uses reward as a proxy for user satisfaction. Given a query, our system predicts the expected reward for each vertical, then organizes the search result page (SERP) in a way which maximizes the total reward. Instead of relying on human judges, our system leverages implicit user feedback to learn the model. The method is efficient to implement and can be applied to verticals of different nature. We have successfully deployed the system to three different markets, and it handles multiple verticals in each market. The system is now serving hundreds of millions of queries live each day, and has improved user metrics considerably.

【Keywords】: information retrieval; machine learning; search federation

145. Amplifying the voice of youth in Africa via text analytics.

Paper Link】 【Pages】:1204-1212

【Authors】: Prem Melville ; Vijil Chenthamarakshan ; Richard D. Lawrence ; James Powell ; Moses Mugisha ; Sharad Sapra ; Rajesh Anandan ; Solomon Assefa

【Abstract】: U-report is an open-source SMS platform operated by UNICEF Uganda, designed to give community members a voice on issues that impact them. Data received by the system are either SMS responses to a poll conducted by UNICEF, or unsolicited reports of a problem occurring within the community. There are currently 200,000 U-report participants, and they send up to 10,000 unsolicited text messages a week. The objective of the program in Uganda is to understand the data in real-time, and have issues addressed by the appropriate department in UNICEF in a timely manner. Given the high volume and velocity of the data streams, manual inspection of all messages is no longer sustainable. This paper describes an automated message-understanding and routing system deployed by IBM at UNICEF. We employ recent advances in data mining to get the most out of labeled training data, while incorporating domain knowledge from experts. We discuss the trade-offs, design choices and challenges in applying such techniques in a real-world deployment.

【Keywords】: machine learning; text classification

146. Scalable supervised dimensionality reduction using clustering.

Paper Link】 【Pages】:1213-1221

【Authors】: Troy Raeder ; Claudia Perlich ; Brian Dalessandro ; Ori Stitelman ; Foster J. Provost

【Abstract】: The automated targeting of online display ads at scale requires the simultaneous evaluation of a single prospect against many independent models. When deciding which ad to show to a user, one must calculate likelihood-to-convert scores for that user across all potential advertisers in the system. For modern machine-learning-based targeting, as conducted by Media6Degrees (M6D), this can mean scoring against thousands of models in a large, sparse feature space. Dimensionality reduction within this space is useful, as it decreases scoring time and model storage requirements. To meet this need, we develop a novel algorithm for scalable supervised dimensionality reduction across hundreds of simultaneous classification tasks. The algorithm performs hierarchical clustering in the space of model parameters from historical models in order to collapse related features into a single dimension. This allows us to implicitly incorporate feature and label data across all tasks without operating directly in a massive space. We present experimental results showing that for this task our algorithm outperforms other popular dimensionality-reduction algorithms across a wide variety of ad campaigns, as well as production results that showcase its performance in practice.

【Keywords】: clustering; supervised dimensionality reduction

147. Ad click prediction: a view from the trenches.

Paper Link】 【Pages】:1222-1230

【Authors】: H. Brendan McMahan ; Gary Holt ; David Sculley ; Michael Young ; Dietmar Ebner ; Julian Grady ; Lan Nie ; Todd Phillips ; Eugene Davydov ; Daniel Golovin ; Sharat Chikkerur ; Dan Liu ; Martin Wattenberg ; Arnar Mar Hrafnkelsson ; Tom Boulos ; Jeremy Kubica

【Abstract】: Predicting ad click-through rates (CTR) is a massive-scale learning problem that is central to the multi-billion dollar online advertising industry. We present a selection of case studies and topics drawn from recent experiments in the setting of a deployed CTR prediction system. These include improvements in the context of traditional supervised learning based on an FTRL-Proximal online learning algorithm (which has excellent sparsity and convergence properties) and the use of per-coordinate learning rates. We also explore some of the challenges that arise in a real-world system that may appear at first to be outside the domain of traditional machine learning research. These include useful tricks for memory savings, methods for assessing and visualizing performance, practical methods for providing confidence estimates for predicted probabilities, calibration methods, and methods for automated management of features. Finally, we also detail several directions that did not turn out to be beneficial for us, despite promising results elsewhere in the literature. The goal of this paper is to highlight the close relationship between theoretical advances and practical engineering in this industrial setting, and to show the depth of challenges that appear when applying traditional machine learning methods in a complex dynamic system.

【Keywords】: data mining; large-scale learning; online advertising

148. Modeling and probabilistic reasoning of population evacuation during large-scale disaster.

Paper Link】 【Pages】:1231-1239

【Authors】: Xuan Song ; Quanshi Zhang ; Yoshihide Sekimoto ; Teerayut Horanont ; Satoshi Ueyama ; Ryosuke Shibasaki

【Abstract】: The Great East Japan Earthquake and the Fukushima nuclear accident cause large human population movements and evacuations. Understanding and predicting these movements is critical for planning effective humanitarian relief, disaster management, and long-term societal reconstruction. In this paper, we construct a large human mobility database that stores and manages GPS records from mobile devices used by approximately 1.6 million people throughout Japan from 1 August 2010 to 31 July 2011. By mining this enormous set of Auto-GPS mobile sensor data, the short-term and long-term evacuation behaviors for individuals throughout Japan during this disaster are able to be automatically discovered. To better understand and simulate human mobility during the disasters, we develop a probabilistic model that is able to be effectively trained by the discovered evacuations via machine learning technique. Based on our training model, population mobility in various cities impacted by the disasters throughout the country is able to be automatically simulated or predicted. On the basis of the whole database, developed model, and experimental results, it is easy for us to find some new features or population mobility patterns after the recent severe earthquake, tsunami and release of radioactivity in Japan, which are likely to play a vital role in future disaster relief and management worldwide.

【Keywords】: data mining; disaster informatics; human mobility

149. Using co-visitation networks for detecting large scale online display advertising exchange fraud.

Paper Link】 【Pages】:1240-1248

【Authors】: Ori Stitelman ; Claudia Perlich ; Brian Dalessandro ; Rod Hook ; Troy Raeder ; Foster J. Provost

【Abstract】: Data generated by observing the actions of web browsers across the internet is being used at an ever increasing rate for both building models and making decisions. In fact, a quarter of the industry-track papers for KDD in 2012 were based on data generated by online actions. The models, analytics and decisions they inform all stem from the assumption that observed data captures the intent of users. However, a large portion of these observed actions are not intentional, and are effectively polluting the models. Much of this observed activity is either generated by robots traversing the internet or the result of unintended actions of real users. These non-intentional actions observed in the web logs severely bias both analytics and the models created from the data. In this paper, we will show examples of how non-intentional traffic that is produced by fraudulent activities adversely affects both general analytics and predictive models, and propose an approach using co-visitation networks to identify sites that have large amounts of this fraudulent traffic. We will then show how this approach, along with a second stage classifier that identifies non-intentional traffic at the browser level, is deployed in production at Media6Degrees (m6d), a targeting technology company for display advertising. This deployed product acts both to filter out the fraudulent traffic from the input data and to insure that we don't serve ads during unintended website visits.

【Keywords】: advertising exchanges; fraud detection

150. An integrated framework for optimizing automatic monitoring systems in large IT infrastructures.

Paper Link】 【Pages】:1249-1257

【Authors】: Liang Tang ; Tao Li ; Larisa Shwartz ; Florian Pinel ; Genady Grabarnik

【Abstract】: The competitive business climate and the complexity of IT environments dictate efficient and cost-effective service delivery and support of IT services. These are largely achieved by automating routine maintenance procedures, including problem detection, determination and resolution. System monitoring provides an effective and reliable means for problem detection. Coupled with automated ticket creation, it ensures that a degradation of the vital signs, defined by acceptable thresholds or monitoring conditions, is flagged as a problem candidate and sent to supporting personnel as an incident ticket. This paper describes an integrated framework for minimizing false positive tickets and maximizing the monitoring coverage for system faults. In particular, the integrated framework defines monitoring conditions and the optimal corresponding delay times based on an off-line analysis of historical alerts and incident tickets. Potential monitoring conditions are built on a set of predictive rules which are automatically generated by a rule-based learning algorithm with coverage, confidence and rule complexity criteria. These conditions and delay times are propagated as configurations into run-time monitoring systems. Moreover, a part of misconfigured monitoring conditions can be corrected according to false negative tickets that are discovered by another text classification algorithm in this framework. This paper also provides implementation details of a program product that uses this framework and shows some illustrative examples of successful results.

【Keywords】: system monitoring; ticket analysis

151. Improving quality control by early prediction of manufacturing outcomes.

Paper Link】 【Pages】:1258-1266

【Authors】: Sholom M. Weiss ; Amit Dhurandhar ; Robert J. Baseman

【Abstract】: We describe methods for continual prediction of manufactured product quality prior to final testing. In our most expansive modeling approach, an estimated final characteristic of a product is updated after each manufacturing operation. Our initial application is for the manufacture of microprocessors, and we predict final microprocessor speed. Using these predictions, early corrective manufacturing actions may be taken to increase the speed of expected slow wafers (a collection of microprocessors) or reduce the speed of fast wafers. Such predictions may also be used to initiate corrective supply chain management actions. Developing statistical learning models for this task has many complicating factors: (a) a temporally unstable population (b) missing data that is a result of sparsely sampled measurements and (c) relatively few available measurements prior to corrective action opportunities. In a real manufacturing pilot application, our automated models selected 125 fast wafers in real-time. As predicted, those wafers were significantly faster than average. During manufacture, downstream corrective processing restored 25 nominally unacceptable wafers to normal operation.

【Keywords】: manufacturing; prediction; quality control

Industrial and government discovery 4

152. A data mining driven risk profiling method for road asset management.

Paper Link】 【Pages】:1267-1275

【Authors】: Daniel Emerson ; Justin Weligamage ; Richi Nayak

【Abstract】: Road surface skid resistance has been shown to have a strong relationship to road crash risk, however, applying the current method of using investigatory levels to identify crash prone roads is problematic as they may fail in identifying risky roads outside of the norm. The proposed method analyses a complex and formerly impenetrable volume of data from roads and crashes using data mining. This method rapidly identifies roads with elevated crash-rate, potentially due to skid resistance deficit, for investigation. A hypothetical skid resistance/crash risk curve is developed for each road segment, driven by the model deployed in a novel regression tree extrapolation method. The method potentially solves the problem of missing skid resistance values which occurs during network-wide crash analysis, and allows risk assessment of the major proportion of roads without skid resistance values.

【Keywords】: data mining; missing data; model deployment; risk management; road asset managment; skid resistance

153. Why people hate your app: making sense of user feedback in a mobile app store.

Paper Link】 【Pages】:1276-1284

【Authors】: Bin Fu ; Jialiu Lin ; Lei Li ; Christos Faloutsos ; Jason I. Hong ; Norman M. Sadeh

【Abstract】: User review is a crucial component of open mobile app markets such as the Google Play Store. How do we automatically summarize millions of user reviews and make sense out of them? Unfortunately, beyond simple summaries such as histograms of user ratings, there are few analytic tools that can provide insights into user reviews. In this paper, we propose Wiscom, a system that can analyze tens of millions user ratings and comments in mobile app markets at three different levels of detail. Our system is able to (a) discover inconsistencies in reviews; (b) identify reasons why users like or dislike a given app, and provide an interactive, zoomable view of how users' reviews evolve over time; and (c) provide valuable insights into the entire app market, identifying users' major concerns and preferences of different types of apps. Results using our techniques are reported on a 32GB dataset consisting of over 13 million user reviews of 171,493 Android apps in the Google Play Store. We discuss how the techniques presented herein can be deployed to help a mobile app market operator such as Google as well as individual app developers and end-users.

【Keywords】: mobile app market; sentiment analysis; text mining; topic model; user rating and comments

154. Towards long-lead forecasting of extreme flood events: a data mining framework for precipitation cluster precursors identification.

Paper Link】 【Pages】:1285-1293

【Authors】: Dawei Wang ; Wei Ding ; Kui Yu ; Xindong Wu ; Ping Chen ; David L. Small ; Shafiqul Islam

【Abstract】: The development of disastrous flood forecasting techniques able to provide warnings at a long lead-time (5-15 days) is of great importance to society. Extreme Flood is usually a consequence of a sequence of precipitation events occurring over from several days to several weeks. Though precise short-term forecasting the magnitude and extent of individual precipitation event is still beyond our reach, long-term forecasting of precipitation clusters can be attempted by identifying persistent atmospheric regimes that are conducive for the precipitation clusters. However, such forecasting will suffer from overwhelming number of relevant features and high imbalance of sample sets. In this paper, we propose an integrated data mining framework for identifying the precursors to precipitation event clusters and use this information to predict extended periods of extreme precipitation and subsequent floods. We synthesize a representative feature set that describes the atmosphere motion, and apply a streaming feature selection algorithm to online identify the precipitation precursors from the enormous feature space. A hierarchical re-sampling approach is embedded in the framework to deal with the imbalance problem. An extensive empirical study is conducted on historical precipitation and associated flood data collected in the State of Iowa. Utilizing our framework a few physically meaningful precipitation cluster precursor sets are identified from millions of features. More than 90% of extreme precipitation events are captured by the proposed prediction model using precipitation cluster precursors with a lead time of more than 5 days.

【Keywords】: flood forecasting; online streaming feature selection; spatial-temporal data mining

155. Predictive model performance: offline and online evaluations.

Paper Link】 【Pages】:1294-1302

【Authors】: Jeonghee Yi ; Ye Chen ; Jie Li ; Swaraj Sett ; Tak W. Yan

【Abstract】: We study the accuracy of evaluation metrics used to estimate the efficacy of predictive models. Offline evaluation metrics are indicators of the expected model performance on real data. However, in practice we often experience substantial discrepancy between the offline and online performance of the models. We investigate the characteristics and behaviors of the evaluation metrics on offline and online testing both analytically and empirically by experimenting them on online advertising data from the Bing search engine. One of our findings is that some offline metrics like AUC (the Area Under the Receiver Operating Characteristic Curve) and RIG (Relative Information Gain) that summarize the model performance on the entire spectrum of operating points could be quite misleading sometimes and result in significant discrepancy in offline and online metrics. For example, for click prediction models for search advertising, errors in predictions in the very low range of predicted click scores impact the online performance much more negatively than errors in other regions. Most of the offline metrics we studied including AUC and RIG, however, are insensitive to such model behavior. We designed a new model evaluation paradigm that simulates the online behavior of predictive models. For a set of ads selected by a new prediction model, the online user behavior is estimated from the historic user behavior in the search logs. The experimental results on click prediction model for search advertising are highly promising.

【Keywords】: auc; click prediction; log-likelihood; model evaluation metric; offline evaluation; online advertising; online evaluation; prediction error; rig; simulated metric; sponsored search

Industrial and government emerging 16

156. Uncertainty in online experiments with dependent data: an evaluation of bootstrap methods.

Paper Link】 【Pages】:1303-1311

【Authors】: Eytan Bakshy ; Dean Eckles

【Abstract】: Many online experiments exhibit dependence between users and items. For example, in online advertising, observations that have a user or an ad in common are likely to be associated. Because of this, even in experiments involving millions of subjects, the difference in mean outcomes between control and treatment conditions can have substantial variance. Previous theoretical and simulation results demonstrate that not accounting for this kind of dependence structure can result in confidence intervals that are too narrow, leading to inaccurate hypothesis tests. We develop a framework for understanding how dependence affects uncertainty in user-item experiments and evaluate how bootstrap methods that account for differing levels of dependence perform in practice. We use three real datasets describing user behaviors on Facebook - user responses to ads, search results, and News Feed stories - to generate data for synthetic experiments in which there is no effect of the treatment on average by design. We then estimate empirical Type I error rates for each bootstrap method. Accounting for dependence within a single type of unit (i.e., within-user dependence) is often sufficient to get reasonable error rates. But when experiments have effects, as one might expect in the field, accounting for multiple units with a multiway bootstrap can be necessary to get close to the advertised Type I error rates. This work provides guidance to practitioners evaluating large-scale experiments, and highlights the importance of analysis of inferential methods for complex dependence structures common to online experiments.

【Keywords】: a/a tests; a/b testing; bootstrapping; field experiments; random effects; statistical inference; user-item data

157. Knowledge discovery from massive healthcare claims data.

Paper Link】 【Pages】:1312-1320

【Authors】: Varun Chandola ; Sreenivas R. Sukumar ; Jack C. Schryver

【Abstract】: he role of big data in addressing the needs of the present healthcare system in US and rest of the world has been echoed by government, private, and academic sectors. There has been a growing emphasis to explore the promise of big data analytics in tapping the potential of the massive healthcare data emanating from private and government health insurance providers. While the domain implications of such collaboration are well known, this type of data has been explored to a limited extent in the data mining community. The objective of this paper is two fold: first, we introduce the emerging domain of "big" healthcare claims data to the KDD community, and second, we describe the success and challenges that we encountered in analyzing this data using state of art analytics for massive data. Specifically, we translate the problem of analyzing healthcare data into some of the most well-known analysis problems in the data mining community, social network analysis, text mining, and temporal analysis and higher order feature construction, and describe how advances within each of these areas can be leveraged to understand the domain of healthcare. Each case study illustrates a unique intersection of data mining and healthcare with a common objective of improving the cost-care ratio by mining for opportunities to improve healthcare operations and reducing what seems to fall under fraud, waste, and abuse.

【Keywords】: fraud detection; healthcare analytics

Paper Link】 【Pages】:1321-1329

【Authors】: Anurag Bhardwaj ; Atish Das Sarma ; Wei Di ; Raffay Hamid ; Robinson Piramuthu ; Neel Sundaresan

【Abstract】: With the explosion of mobile devices with cameras, online search has moved beyond text to other modalities like images, voice, and writing. For many applications like Fashion, image-based search offers a compelling interface as compared to text forms by better capturing the visual attributes. In this paper we present a simple and fast search algorithm that uses color as the main feature for building visual search. We show that low level cues such as color can be used to quantify image similarity and also to discriminate among products with different visual appearances. We demonstrate the effectiveness of our approach through a mobile shopping application\footnote{eBay Fashion App available at https://itunes.apple.com/us/app/ebay-fashion/id378358380?mt=8 and eBay image swatch is the feature indexing millions of real world fashion images}. Our approach outperforms several other state-of-the-art image retrieval algorithms for large scale image data.

【Keywords】: e-commerce; image search; search engine; visual search

159. Heat pump detection from coarse grained smart meter data with positive and unlabeled learning.

Paper Link】 【Pages】:1330-1338

【Authors】: Hongliang Fei ; Younghun Kim ; Sambit Sahu ; Milind R. Naphade ; Sanjay K. Mamidipalli ; John Hutchinson

【Abstract】: Recent advances in smart metering technology enable utility companies to have access to tremendous amount of smart meter data, from which the utility companies are eager to gain more insight about their customers. In this paper, we aim to detect electric heat pumps from coarse grained smart meter data for a heat pump marketing campaign. However, appliance detection is a challenging task, especially given a very low granularity and partial labeled even unlabeled data. Traditional methods install either a high granularity smart meter or sensors at every appliance, which is either too expensive or requires technical expertise. We propose a novel approach to detect heat pumps that utilizes low granularity smart meter data, prior sales data and weather data. In particular, motivated by the characteristics of heat pump consumption pattern, we extract novel features that are highly relevant to heat pump usage from smart meter data and weather data. Under the constraint that only a subset of heat pump users are available, we formalize the problem into a positive and unlabeled data classification and apply biased Support Vector Machine (BSVM) to our extracted features. Our empirical study on a real-world data set demonstrates the effectiveness of our method. Furthermore, our method has been deployed in a real-life setting where the partner electric company runs a targeted campaign for 292,496 customers. Based on the initial feedback, our detection algorithm can successfully detect substantial number of non-heat pump users who were identified heat pump users with the prior algorithm the company had used.

【Keywords】: feature extraction; heat pump detection; positive and unlabeled learning; smart meter data mining

160. Empirical bayes model to combine signals of adverse drug reactions.

Paper Link】 【Pages】:1339-1347

【Authors】: Rave Harpaz ; William DuMouchel ; Paea LePendu ; Nigam H. Shah

【Abstract】: Data mining is a crucial tool for identifying risk signals of potential adverse drug reactions (ADRs). However, mining of ADR signals is currently limited to leveraging a single data source at a time. It is widely believed that combining ADR evidence from multiple data sources will result in a more accurate risk identification system. We present a methodology based on empirical Bayes modeling to combine ADR signals mined from ~5 million adverse event reports collected by the FDA, and healthcare data corresponding to 46 million patients' the main two types of information sources currently employed for signal detection. Based on four sets of test cases (gold standard), we demonstrate that our method leads to a statistically significant and substantial improvement in signal detection accuracy, averaging 40% over the use of each source independently, and an area under the ROC curve of 0.87. We also compare the method with alternative supervised learning approaches, and argue that our approach is preferable as it does not require labeled (training) samples whose availability is currently limited. To our knowledge, this is the first effort to combine signals from these two complementary data sources, and to demonstrate the benefits of a computationally integrative strategy for drug safety surveillance.

【Keywords】: empirical bayes; pharmacovigilance; signal detection

161. Efficiently rewriting large multimedia application execution traces with few event sequences.

Paper Link】 【Pages】:1348-1356

【Authors】: Christiane Kamdem Kengne ; Leon Constantin Fopa ; Alexandre Termier ; Noha Ibrahim ; Marie-Christine Rousset ; Takashi Washio ; Miguel Santana

【Abstract】: The analysis of multimedia application traces can reveal important information to enhance program execution comprehension. However typical size of traces can be in gigabytes, which hinders their effective exploitation by application developers. In this paper, we study the problem of finding a set of sequences of events that allows a reduced-size rewriting of the original trace. These sequences of events, that we call blocks, can simplify the exploration of large execution traces by allowing application developers to see an abstraction instead of low-level events. The problem of computing such set of blocks is NP-hard and naive approaches lead to prohibitive running times that prevent analysing real world traces. We propose a novel algorithm that directly mines the set of blocks. Our experiments show that our algorithm can analyse real traces of up to two hours of video. We also show experimentally the quality of the set of blocks proposed, and the interest of the rewriting to understand actual trace data.

【Keywords】: combinatorial optimization; execution traces; multimedia apllications; pattern mining

162. Discriminant malware distance learning on structural information for automated malware classification.

Paper Link】 【Pages】:1357-1365

【Authors】: Deguang Kong ; Guanhua Yan

【Abstract】: The voluminous malware variants that appear in the Internet have posed severe threats to its security. In this work, we explore techniques that can automatically classify malware variants into their corresponding families. We present a generic framework that extracts structural information from malware programs as attributed function call graphs, in which rich malware features are encoded as attributes at the function level. Our framework further learns discriminant malware distance metrics that evaluate the similarity between the attributed function call graphs of two malware programs. To combine various types of malware attributes, our method adaptively learns the confidence level associated with the classification capability of each attribute type and then adopts an ensemble of classifiers for automated malware classification. We evaluate our approach with a number of Windows-based malware instances belonging to 11 families, and experimental results show that our automated malware classification method is able to achieve high classification accuracy.

【Keywords】: attribution; distance learning; function call graph; graph matching; malware; metric learning; optimization; structure

163. Assessing team strategy using spatiotemporal data.

Paper Link】 【Pages】:1366-1374

【Authors】: Patrick Lucey ; Dean Oliver ; G. Peter K. Carr ; Joe Roth ; Iain A. Matthews

【Abstract】: The "Moneyball" revolution coincided with a shift in the way professional sporting organizations handle and utilize data in terms of decision making processes. Due to the demand for better sports analytics and the improvement in sensor technology, there has been a plethora of ball and player tracking information generated within professional sports for analytical purposes. However, due to the continuous nature of the data and the lack of associated high-level labels to describe it - this rich set of information has had very limited use especially in the analysis of a team's tactics and strategy. In this paper, we give an overview of the types of analysis currently performed mostly with hand-labeled event data and highlight the problems associated with the influx of spatiotemporal data. By way of example, we present an approach which uses an entire season of ball tracking data from the English Premier League (2010-2011 season) to reinforce the common held belief that teams should aim to "win home games and draw away ones". We do this by: i) forming a representation of team behavior by chunking the incoming spatiotemporal signal into a series of quantized bins, and ii) generate an expectation model of team behavior based on a code-book of past performances. We show that home advantage in soccer is partly due to the conservative strategy of the away team. We also show that our approach can flag anomalous team behavior which has many potential applications.

【Keywords】: representation

164. Exploratory analysis of highly heterogeneous document collections.

Paper Link】 【Pages】:1375-1383

【Authors】: Arun S. Maiya ; John P. Thompson ; Francisco Loaiza-Lemos ; Robert M. Rolfe

【Abstract】: We present an effective multifaceted system for exploratory analysis of highly heterogeneous document collections. Our system is based on intelligently tagging individual documents in a purely automated fashion and exploiting these tags in a powerful faceted browsing framework. Tagging strategies employed include both unsupervised and supervised approaches based on machine learning and natural language processing. As one of our key tagging strategies, we introduce the KERA algorithm (Keyword Extraction for Reports and Articles). KERA extracts topic-representative terms from individual documents in a purely unsupervised fashion and is revealed to be significantly more effective than state-of-the-art methods. Finally, we evaluate our system in its ability to help users locate documents pertaining to military critical technologies buried deep in a large heterogeneous sea of information.

【Keywords】: faceted browsing; faceted navigation; keyphrase extraction; machine learning; tag clouds; topic modeling

165. Experience from hosting a corporate prediction market: benefits beyond the forecasts.

Paper Link】 【Pages】:1384-1392

【Authors】: Thomas A. Montgomery ; Paul M. Stieg ; Michael J. Cavaretta ; Paul E. Moraal

【Abstract】: Prediction markets are virtual stock markets used to gain insight and forecast events by leveraging the wisdom of crowds. Popularly applied in the public to cultural questions (election results, box-office returns), they have recently been applied by corporations to leverage employee knowledge and forecast answers to business questions (sales volumes, products and features, release timing). Determining whether to run a prediction market requires practical experience that is rarely described. Over the last few years, Ford Motor Company obtained practical experience by deploying one of the largest corporate prediction markets known. Business partners in the US, Europe, and South America provided questions on new vehicle features, sales volumes, take rates, pricing, and macroeconomic trends. We describe our experience, including both the strong and weak correlations found between predictions and real world results. Evaluating this methodology goes beyond prediction accuracy, however, since there are many side benefits. In addition to the predictions, we discuss the value of comments, stock price changes over time, the ability to overcome bureaucratic limits, and flexibly filling holes in corporate knowledge, enabling better decision making. We conclude with advice on running prediction markets, including writing good questions, market duration, motivating traders and protecting confidential information.

【Keywords】: artificial markets; forecasting; organizational knowledge; prediction markets; social media

166. Detecting insider threats in a real corporate database of computer usage activity.

Paper Link】 【Pages】:1393-1401

【Authors】: Ted E. Senator ; Henry G. Goldberg ; Alex Memory ; William T. Young ; Brad Rees ; Robert Pierce ; Daniel Huang ; Matthew Reardon ; David A. Bader ; Edmond Chow ; Irfan A. Essa ; Joshua Jones ; Vinay Bettadapura ; Duen Horng Chau ; Oded Green ; Oguz Kaya ; Anita Zakrzewska ; Erica Briscoe ; Rudolph L. Mappus IV ; Robert McColl ; Lora Weiss ; Thomas G. Dietterich ; Alan Fern ; Weng-Keen Wong ; Shubhomoy Das ; Andrew Emmott ; Jed Irvine ; Jay Yoon Lee ; Danai Koutra ; Christos Faloutsos ; Daniel D. Corkill ; Lisa Friedland ; Amanda Gentzel ; David Jensen

【Abstract】: This paper reports on methods and results of an applied research project by a team consisting of SAIC and four universities to develop, integrate, and evaluate new approaches to detect the weak signals characteristic of insider threats on organizations' information systems. Our system combines structural and semantic information from a real corporate database of monitored activity on their users' computers to detect independently developed red team inserts of malicious insider activities. We have developed and applied multiple algorithms for anomaly detection based on suspected scenarios of malicious insider behavior, indicators of unusual activities, high-dimensional statistical patterns, temporal sequences, and normal graph evolution. Algorithms and representations for dynamic graph processing provide the ability to scale as needed for enterprise-level deployments on real-time data streams. We have also developed a visual language for specifying combinations of features, baselines, peer groups, time periods, and algorithms to detect anomalies suggestive of instances of insider threat behavior. We defined over 100 data features in seven categories based on approximately 5.5 million actions per day from approximately 5,500 users. We have achieved area under the ROC curve values of up to 0.979 and lift values of 65 on the top 50 user-days identified on two months of real data.

【Keywords】: anomaly detection; insider threat

167. Mining for geographically disperse communities in social networks by leveraging distance modularity.

Paper Link】 【Pages】:1402-1409

【Authors】: Paulo Shakarian ; Patrick Roos ; Devon Callahan ; Cory Kirk

【Abstract】: Social networks where the actors occupy geospatial locations are prevalent in military, intelligence, and policing operations such as counter-terrorism, counter-insurgency, and combating organized crime. These networks are often derived from a variety of intelligence sources. The discovery of communities that are geographically disperse stems from the requirement to identify higher-level organizational structures, such as a logistics group that provides support to various geographically disperse terrorist cells. We apply a variant of Newman-Girvan modularity to this problem known as distance modularity. To address the problem of finding geographically disperse communities, we modify the well-known Louvain algorithm to find partitions of networks that provide near-optimal solutions to this quantity. We apply this algorithm to numerous samples from two real-world social networks and a terrorism network data set whose nodes have associated geospatial locations. Our experiments show this to be an effective approach and highlight various practical considerations when applying the algorithm to distance modularity maximization. Several military, intelligence, and law-enforcement organizations are working with us to further test and field software for this emerging application.

【Keywords】: complex networks; geospatial reasoning

168. An integrated framework for suicide risk prediction.

Paper Link】 【Pages】:1410-1418

【Authors】: Truyen Tran ; Dinh Q. Phung ; Wei Luo ; Richard Harvey ; Michael Berk ; Svetha Venkatesh

【Abstract】: Suicide is a major concern in society. Despite of great attention paid by the community with very substantive medico-legal implications, there has been no satisfying method that can reliably predict the future attempted or completed suicide. We present an integrated machine learning framework to tackle this challenge. Our proposed framework consists of a novel feature extraction scheme, an embedded feature selection process, a set of risk classifiers and finally, a risk calibration procedure. For temporal feature extraction, we cast the patient's clinical history into a temporal image to which a bank of one-side filters are applied. The responses are then partly transformed into mid-level features and then selected in l1-norm framework under the extreme value theory. A set of probabilistic ordinal risk classifiers are then applied to compute the risk probabilities and further re-rank the features. Finally, the predicted risks are calibrated. Together with our Australian partner, we perform comprehensive study on data collected for the mental health cohort, and the experiments validate that our proposed framework outperforms risk assessment instruments by medical practitioners.

【Keywords】: filter bank; machine learning; medical data analysis; one-sided convolutional kernels; risk modelling; risk prediction; suicide

169. Gaussian multiple instance learning approach for mapping the slums of the world using very high resolution imagery.

Paper Link】 【Pages】:1419-1426

【Authors】: Ranga Raju Vatsavai

【Abstract】: In this paper, we present a computationally efficient algorithm based on multiple instance learning for mapping informal settlements (slums) using very high-resolution remote sensing imagery. From remote sensing perspective, informal settlements share unique spatial characteristics that distinguish them from other urban structures like industrial, commercial, and formal residential settlements. However, regular pattern recognition and machine learning methods, which are predominantly single-instance or per-pixel classifiers, often fail to accurately map the informal settlements as they do not capture the complex spatial patterns. To overcome these limitations we employed a multiple instance based machine learning approach, where groups of contiguous pixels (image patches) are modeled as generated by a Gaussian distribution. We have conducted several experiments on very high-resolution satellite imagery, representing four unique geographic regions across the world. Our method showed consistent improvement in accurately identifying informal settlements.

【Keywords】: mil; remote sensing; spatial data mining

170. A privacy preserving framework for managing vehicle data in road pricing systems.

Paper Link】 【Pages】:1427-1435

【Authors】: Huayu Wu ; Wee Siong Ng ; Kian-Lee Tan ; Wei Wu ; Shili Xiang ; Mingqiang Xue

【Abstract】: The Electronic Road Pricing (ERP) system was implemented by the Land Transport Authority of Singapore to control traffic by road pricing since 1998. To better understand the traffic condition and improve the pricing scheme, the government initiated the next generation ERP (ERP 2) project, which aims to use the Global Navigation Satellite System (GNSS) collecting positional data from vehicles for analysis. However, most drivers fear of being monitored once the government installs the devices in their vehicles to collect GPS data. The existing data stream management systems (DSMS) centralize both data management and privacy control at server site. This framework assumes DSMS server is secure and trustable, and protects providers' data from illegal access by data users. In ERP 2, the DSMS server is maintained by the government, i.e., data user. Thus, the existing framework is not adoptable. We propose a novel framework in which privacy protection is pushed to data provider site. By doing this, the system could be safer and more efficient. Our framework can be used for the situations such as ERP 2, i.e., data providers would like to control their own privacy policies and/or the workload of DSMS server needs to be reduced.

【Keywords】: decentralized framework; hippocratic data stream systems; privacy preservation; road pricing

171. U-Air: when urban air quality inference meets big data.

Paper Link】 【Pages】:1436-1444

【Authors】: Yu Zheng ; Furui Liu ; Hsun-Ping Hsieh

【Abstract】: Information about urban air quality, e.g., the concentration of PM2.5, is of great importance to protect human health and control air pollution. While there are limited air-quality-monitor-stations in a city, air quality varies in urban spaces non-linearly and depends on multiple factors, such as meteorology, traffic volume, and land uses. In this paper, we infer the real-time and fine-grained air quality information throughout a city, based on the (historical and real-time) air quality data reported by existing monitor stations and a variety of data sources we observed in the city, such as meteorology, traffic flow, human mobility, structure of road networks, and point of interests (POIs). We propose a semi-supervised learning approach based on a co-training framework that consists of two separated classifiers. One is a spatial classifier based on an artificial neural network (ANN), which takes spatially-related features (e.g., the density of POIs and length of highways) as input to model the spatial correlation between air qualities of different locations. The other is a temporal classifier based on a linear-chain conditional random field (CRF), involving temporally-related features (e.g., traffic and meteorology) to model the temporal dependency of air quality in a location. We evaluated our approach with extensive experiments based on five real data sources obtained in Beijing and Shanghai. The results show the advantages of our method over four categories of baselines, including linear/Gaussian interpolations, classical dispersion models, well-known classification models like decision tree and CRF, and ANN.

【Keywords】: air quality; city dynamics; human mobility; spatial trajectories

Panel 1

172. Panel: a data scientist's guide to making money from start-ups.

Paper Link】 【Pages】:1445

【Authors】: Foster J. Provost ; Geoffrey I. Webb

【Abstract】:

【Keywords】: data science; money; start-ups

Demonstrations 19

173. LAICOS: an open source platform for personalized social web search.

Paper Link】 【Pages】:1446-1449

【Authors】: Mohamed Reda Bouadjenek ; Hakim Hacid ; Mokrane Bouzeghoub

【Abstract】: In this paper, we introduce LAICOS, a social Web search engine as a contribution to the growing area of Social Information Retrieval (SIR). Social information and personalization are at the heart of LAICOS. On the one hand, the social context of documents is added as a layer to their textual content traditionally used for indexing to provide Personalized Social Document Representations. On the other hand, the social context of users is used for the query expansion process using the Personalized Social Query Expansion framework (PSQE) proposed in our earlier works. We describe the different components of the system while relying on social bookmarking systems as a source of social information for personalizing and enhancing the IR process. We show how the internal structure of indexes as well as the query expansion process operated using social information.

【Keywords】: information retrieval; personalization; social information retrieval; social networks

Paper Link】 【Pages】:1450-1453

【Authors】: Yu Cheng ; Yusheng Xie ; Zhengzhang Chen ; Ankit Agrawal ; Alok N. Choudhary ; Songtao Guo

【Abstract】: The various kinds of booming social media not only provide a platform where people can communicate with each other, but also spread useful domain information, such as career and job market information. For example, LinkedIn publishes a large amount of messages either about people who want to seek jobs or companies who want to recruit new members. By collecting information, we can have a better understanding of the job market and provide insights to job-seekers, companies and even decision makers. In this paper, we analyze the job information from the social network point of view. We first collect the job-related information from various social media sources. Then we construct an inter-company job-hopping network, with the vertices denoting companies and the edges denoting flow of personnel between companies. We subsequently employ graphmining techniques to mine influential companies and related company groups based on the job-hopping network model. Demonstration on LinkedIn data shows that our system JobMiner can provide a better understanding of the dynamic processes and a more accurate identification of important entities in the job market.

【Keywords】: graph mining; influence analysis; job market; social media; temporal network

175. Inferring distant-time location in low-sampling-rate trajectories.

Paper Link】 【Pages】:1454-1457

【Authors】: Meng-Fen Chiang ; Yung-Hsiang Lin ; Wen-Chih Peng ; Philip S. Yu

【Abstract】: With the growth of location-based services and social services, low- sampling-rate trajectories from check-in data or photos with geo- tag information becomes ubiquitous. In general, most detailed mov- ing information in low-sampling-rate trajectories are lost. Prior works have elaborated on distant-time location prediction in high- sampling-rate trajectories. However, existing prediction models are pattern-based and thus not applicable due to the sparsity of data points in low-sampling-rate trajectories. To address the sparsity in low-sampling-rate trajectories, we develop a Reachability-based prediction model on Time-constrained Mobility Graph (RTMG) to predict locations for distant-time queries. Specifically, we de- sign an adaptive temporal exploration approach to extract effective supporting trajectories that are temporally close to the query time. Based on the supporting trajectories, a Time-constrained mobility Graph (TG) is constructed to capture mobility information at the given query time. In light of TG, we further derive the reacha- bility probabilities among locations in TG. Thus, a location with maximum reachability from the current location among all possi- ble locations in supporting trajectories is considered as the predic- tion result. To efficiently process queries, we proposed the index structure Sorted Interval-Tree (SOIT) to organize location records. Extensive experiments with real data demonstrated the effective- ness and efficiency of RTMG. First, RTMG with adaptive tempo- ral exploration significantly outperforms the existing pattern-based prediction model HPM [2] over varying data sparsity in terms of higher accuracy and higher coverage. Also, the proposed index structure SOIT can efficiently speedup RTMG in large-scale trajec- tory dataset. In the future, we could extend RTMG by considering more factors (e.g., staying durations in locations, application us- ages in smart phones) to further improve the prediction accuracy.

【Keywords】: location prediction; reachability; sparsity

176. AMETHYST: a system for mining and exploring topical hierarchies of heterogeneous data.

Paper Link】 【Pages】:1458-1461

【Authors】: Marina Danilevsky ; Chi Wang ; Fangbo Tao ; Son Nguyen ; Gong Chen ; Nihit Desai ; Lidan Wang ; Jiawei Han

【Abstract】: In this demo we present AMETHYST, a system for exploring and analyzing a topical hierarchy constructed from a heterogeneous information network (HIN). HINs, composed of multiple types of entities and links are very common in the real world. Many have a text component, and thus can benefit from a high quality hierarchical organization of the topics in the network dataset. By organizing the topics into a hierarchy, AMETHYST helps understand search results in the context of an ontology, and explain entity relatedness at different granularities. The automatically constructed topical hierarchy reflects a domain-specific ontology, interacts with multiple types of linked entities, and can be tailored for both free text and OLAP queries.

【Keywords】: entity mining; heterogeneous network; network analysis; topic modeling

177. A tool for collecting provenance data in social media.

Paper Link】 【Pages】:1462-1465

【Authors】: Pritam Gundecha ; Suhas Ranganath ; Zhuo Feng ; Huan Liu

【Abstract】: In recent years, social media sites have provided a large amount of information. Recipients of such information need mechanisms to know more about the received information, including the provenance. Previous research has shown that some attributes related to the received information provide additional context, so that a recipient can assess the amount of value, trust, and validity to be placed in the received information. Personal attributes of a user, including name, location, education, ethnicity, gender, and political and religious affiliations, can be found in social media sites. In this paper, we present a novel web-based tool for collecting the attributes of interest associated with a particular social media user related to the received information. This tool provides a way to combine different attributes available at different social media sites into a single user profile. Using different types of Twitter users, we also evaluate the performance of the tool in terms of number of attribute values collected, validity of these values, and total amount of retrieval time.

【Keywords】: information provenance; provenance attributes; social media

178. STED: semi-supervised targeted-interest event detectionin in twitter.

Paper Link】 【Pages】:1466-1469

【Authors】: Ting Hua ; Feng Chen ; Liang Zhao ; Chang-Tien Lu ; Naren Ramakrishnan

【Abstract】: Social microblogs such as Twitter and Weibo are experiencing an explosive growth with billions of global users sharing their daily observations and thoughts. Beyond public interests (e.g., sports, music), microblogs can provide highly detailed information for those interested in public health, homeland security, and financial analysis. However, the language used in Twitter is heavily informal, ungrammatical, and dynamic. Existing data mining algorithms require extensive manually labeling to build and maintain a supervised system. This paper presents STED, a semi-supervised system that helps users to automatically detect and interactively visualize events of a targeted type from twitter, such as crimes, civil unrests, and disease outbreaks. Our model first applies transfer learning and label propagation to automatically generate labeled data, then learns a customized text classifier based on mini-clustering, and finally applies fast spatial scan statistics to estimate the locations of events. We demonstrate STED's usage and benefits using twitter data collected from Latin America countries, and show how our system helps to detect and track example events such as civil unrests and crimes.

【Keywords】: event detection; text mining

179. Forex-foreteller: currency trend modeling using news articles.

Paper Link】 【Pages】:1470-1473

【Authors】: Fang Jin ; Nathan Self ; Parang Saraf ; Patrick Butler ; Wei Wang ; Naren Ramakrishnan

【Abstract】: Financial markets are quite sensitive to unanticipated news and events. Identifying the effect of news on the market is a challenging task. In this demo, we present Forex-foreteller (FF) which mines news articles and makes forecasts about the movement of foreign currency markets. The system uses a combination of language models, topic clustering, and sentiment analysis to identify relevant news articles. These articles along with the historical stock index and currency exchange values are used in a linear regression model to make forecasts. The system has an interactive visualizer designed specifically for touch-sensitive devices which depicts forecasts along with the chronological news events and financial data used for making the forecasts.

【Keywords】: currency markets; sentiment analysis; topic discovery

180. Real-time disease surveillance using Twitter data: demonstration on flu and cancer.

Paper Link】 【Pages】:1474-1477

【Authors】: Kathy Lee ; Ankit Agrawal ; Alok N. Choudhary

【Abstract】: Social media is producing massive amounts of data on an unprecedented scale. Here people share their experiences and opinions on various topics, including personal health issues, symptoms, treatments, side-effects, and so on. This makes publicly available social media data an invaluable resource for mining interesting and actionable healthcare insights. In this paper, we describe a novel real-time flu and cancer surveillance system that uses spatial, temporal, and text mining on Twitter data. The real-time analysis results are reported visually in terms of US disease surveillance maps, distribution and timelines of disease types, symptoms, and treatments, in addition to overall disease activity timelines on our project website. Our surveillance system can be very useful not only for early prediction of seasonal disease outbreaks such as flu, but also for monitoring distribution of cancer patients with different cancer types and symptoms in each state and the popularity of treatments used. The resulting insights are expected to help facilitate faster response to and preparation for epidemics and also be very useful for both patients and doctors to make more informed decisions.

【Keywords】: cancer; disease detection; disease surveillance; epidemics; influenza; public health; social media; twitter

Paper Link】 【Pages】:1478-1481

【Authors】: Pei Lee ; Laks V. S. Lakshmanan ; Evangelos E. Milios

【Abstract】: Online social streams such as Twitter/Facebook timelines and forum discussions have emerged as prevalent channels for information dissemination. As these social streams surge quickly, information overload has become a huge problem. Existing keyword search engines on social streams like Twitter Search are not successful in overcoming the problem, because they merely return an overwhelming list of posts, with little aggregation or semantics. In this demo, we provide a new solution called \keysee by grouping posts into events, and track the evolution patterns of events as new posts stream in and old posts fade out. Noise and redundancy problems are effectively addressed in our system. Our demo supports refined keyword query on evolving events by allowing users to specify the time span and designated evolution pattern. For each event result, we provide various analytic views such as frequency curves, word clouds and GPS distributions. We deploy \keysee on real Twitter streams and the results show that our demo outperforms existing keyword search engines on both quality and usability.

【Keywords】: event evolution; keysee; keyword search; social stream

182. Understanding Twitter data with TweetXplorer.

Paper Link】 【Pages】:1482-1485

【Authors】: Fred Morstatter ; Shamanth Kumar ; Huan Liu ; Ross Maciejewski

【Abstract】: In the era of big data it is increasingly difficult for an analyst to extract meaningful knowledge from a sea of information. We present TweetXplorer, a system for analysts with little information about an event to gain knowledge through the use of effective visualization techniques. Using tweets collected during Hurricane Sandy as an example, we will lead the reader through a workflow that exhibits the functionality of the system.

【Keywords】: big data; geospatial analysis; retweet network; twitter visualization

183. An online system with end-user services: mining novelty concepts from tv broadcast subtitles.

Paper Link】 【Pages】:1486-1489

【Authors】: Mika Rautiainen ; Jouni Sarvanko ; Arto Heikkinen ; Mika Ylianttila ; Vassilis Kostakos

【Abstract】: Better tools for content-based access of video are needed to improve access to time-continuous video data. Particularly information about linear TV broadcast programs has been available in a form limited to program guides that provide short manually described overviews of the program content. Recent development in digitalization of TV broadcasting and emergence of web-based services for catch-up and on-demand viewing bring out new possibilities to access data. In this paper we introduce our data mining system and accompanying services for summarizing Finnish DVB broadcast streams from seven national channels. We describe how data mining of novelty concepts can be extracted from DVB subtitles to augment web-based "Catch-Up TV Guide" and "Novelty Cloud" TV services. Furthermore, our system allows accessing media fragments as Picture Quotes via generated word lists and provides content-based recommendations to find new programs that have content similar to the user selected programs. Our index consists of over 180 000 programs that are used to recommend relevant programs. The service has been under development and available online since 2010. It has registered over 5000 user sessions.

【Keywords】: broadcast data mining; novelty concept detection; online tv; video analysis

184. When TEDDY meets GrizzLY: temporal dependency discovery for triggering road deicing operations.

Paper Link】 【Pages】:1490-1493

【Authors】: Céline Robardet ; Vasile-Marian Scuturici ; Marc Plantevit ; Antoine Fraboulet

【Abstract】: Temporal dependencies between multiple sensor data sources link two types of events if the occurrence of one is repeatedly followed by the appearance of the other in a certain time interval. TEDDY algorithm aims at discovering such dependencies, identifying the statically significant time intervals with a chi2 test. We present how these dependencies can be used within the GrizzLY project to tackle an environmental and technical issue: the deicing of the roads. This project aims to wisely organize the deicing operations of an urban area, based on several sensor network measures of local atmospheric phenomena. A spatial and temporal dependency-based model is built from these data to predict freezing alerts.

【Keywords】: temporal dependency discovery

Paper Link】 【Pages】:1494-1497

【Authors】: Fangbo Tao ; Kin Hou Lei ; Jiawei Han ; Chengxiang Zhai ; Xiao Cheng ; Marina Danilevsky ; Nihit Desai ; Bolin Ding ; Jing Ge ; Heng Ji ; Rucha Kanade ; Anne Kao ; Qi Li ; Yanen Li ; Cindy Xide Lin ; Jialu Liu ; Nikunj C. Oza ; Ashok N. Srivastava ; Rodney Tjoelker ; Chi Wang ; Duo Zhang ; Bo Zhao

【Abstract】: A large portion of real world data is either text or structured (e.g., relational) data. Moreover, such data objects are often linked together (e.g., structured specification of products linking with the corresponding product descriptions and customer comments). Even for text data such as news data, typed entities can be extracted with entity extraction tools. The EventCube project constructs TextCube and TopicCube from interconnected structured and text data (or from text data via entity extraction and dimension building), and performs multidimensional search and analysis on such datasets, in an informative, powerful, and user-friendly manner. This proposed EventCube demo will show the power of the system not only on the originally designed ASRS (Aviation Safety Report System) data sets, but also on news datasets collected from multiple news agencies, and academic datasets constructed from the DBLP and web data. The system has high potential to be extended in many powerful ways and serve as a general platform for search, OLAP (online analytical processing) and data mining on integrated text and structured data. After the system demo in the conference, the system will be put on the web for public access and evaluation.

【Keywords】: data cube system; multidimensional data

186. SEA: a system for event analysis on chinese tweets.

Paper Link】 【Pages】:1498-1501

【Authors】: Yaqiong Wang ; Hongfu Liu ; Hao Lin ; Junjie Wu ; Zhiang Wu ; Jie Cao

【Abstract】: Recent years have witnessed the explosive growth of online social media. Weibo, a famous "Chinese Twitter", has attracted over 0.5 billion users in less than four years, with more than 1000 tweets generated in every second. These tweets are informative but very fragmented, and thus would be better archived from an event perspective, as done by Weibo itself in the "Micro-Topic" program. This effort, however, is yet far from satisfaction for not providing enough analytical power to events. In light of this, in this demo paper, we propose SEA, a System for Event Analysis on Chinese tweets. In general, SEA is an event-centric, multi-functional platform that conducts panoramic analysis on Weibo events from various aspects, including the semantic information of the events, the temporal and spatial trends, the public sentiments, the hidden sub-events, the key users in the event diffusion and their preferences, etc. These functions are enabled by the integration of various analytical models and by the noSQL techniques adopted purposefully for massive tweets management. Finally, a case study on the "Spring Festival" event demonstrates the effectiveness of SEA. To our best knowledge, SEA is the first third-party system that provides panoramic analysis to Weibo events.

【Keywords】: chinese tweets; event analysis; public sentiments; weibo

187. SAE: social analytic engine for large networks.

Paper Link】 【Pages】:1502-1505

【Authors】: Yang Yang ; Jianfei Wang ; Yutao Zhang ; Wei Chen ; Jing Zhang ; Honglei Zhuang ; Zhilin Yang ; Bo Ma ; Zhanpeng Fang ; Sen Wu ; Xiaoxiao Li ; Debing Liu ; Jie Tang

【Abstract】: Online social networks become a bridge to connect our physical daily life and the virtual Web space, which not only provides rich data for mining, but also brings many new challenges. In this paper, we present a novel Social Analytic Engine (SAE) for large online social networks. The key issues we pursue in the analytic engine are concerned with the following problems: 1) at the micro-level, how do people form different types of social ties and how people influence each other? 2) at the meso-level, how do people group into communities? 3) at the macro-level, what are the hottest topics in a social network and how the topics evolve over time? We propose methods to address the above questions. The methods are general and can be applied to various social networking data. We have deployed and validated the proposed analytic engine over multiple different networks and validated the effectiveness and efficiency of the proposed methods.

【Keywords】: social analytic engine; social influence; social network

188. FIU-Miner: a fast, integrated, and user-friendly system for data mining in distributed environment.

Paper Link】 【Pages】:1506-1509

【Authors】: Chunqiu Zeng ; Yexi Jiang ; Li Zheng ; Jingxuan Li ; Lei Li ; Hongtai Li ; Chao Shen ; Wubai Zhou ; Tao Li ; Bing Duan ; Ming Lei ; Pengnian Wang

【Abstract】: The advent of Big Data era drives data analysts from different domains to use data mining techniques for data analysis. However, performing data analysis in a specific domain is not trivial; it often requires complex task configuration, onerous integration of algorithms, and efficient execution in distributed environments.Few efforts have been paid on developing effective tools to facilitate data analysts in conducting complex data analysis tasks. In this paper, we design and implement FIU-Miner, a Fast, Integrated, and User-friendly system to ease data analysis. FIU-Miner allows users to rapidly configure a complex data analysis task without writing a single line of code. It also helps users conveniently import and integrate different analysis programs. Further, it significantly balances resource utilization and task execution in heterogeneous environments. A case study of a real-world application demonstrates the efficacy and effectiveness of our proposed system.

【Keywords】: data mining; distributed environment; hadoop; workflow

189. LAFT-Explorer: inferring, visualizing and predicting how your social network expands.

Paper Link】 【Pages】:1510-1513

【Authors】: Jun Zhang ; Chaokun Wang ; Yuanchi Ning ; Yichi Liu ; Jianmin Wang ; Philip S. Yu

【Abstract】: The study of social network evolution has attracted many attentions from both the industry and academia. In this paper we demonstrate LaFT-Explorer, a general toolkit for explaining and reproducing the network growth process based on the friendship propagation. LaFT-Explorer presents multiple perspectives for analyzing the network evolution process and structure, including LaFT-Tree, LaFT-Trace and LaFT-Flow. Upon that we build LaFT-Rec, a new visualized interactive friend recommendation service based on the friendship propagation. LaFT-Rec not only shows whom one may make friends with, but also tells the user that why you should make friends with him and how you can reach him. We demonstrate our system built upon the academic social network of DBLP.

【Keywords】: friendship propagation; laft-explorer; laft-tree; social networks evolution; transitivity of friendship

190. A transfer learning based framework of crowd-selection on twitter.

Paper Link】 【Pages】:1514-1517

【Authors】: Zhou Zhao ; Da Yan ; Wilfred Ng ; Shi Gao

【Abstract】: Crowd selection is essential to crowd sourcing applications, since choosing the right workers with particular expertise to carry out crowdsourced tasks is extremely important. The central problem is simple but tricky: given a crowdsourced task, who are the most knowledgable users to ask? In this demo, we show our framework that tackles the problem of crowdsourced task assignment on Twitter according to the social activities of its users. Since user profiles on Twitter do not reveal user interests and skills, we transfer the knowledge from categorized Yahoo! Answers datasets for learning user expertise. Then, we select the right crowd for certain tasks based on user expertise. We study the effectiveness of our system using extensive user evaluation. We further engage the attendees to participate a game called--Whom to Ask on Twitter?. This helps understand our ideas in an interactive manner. Our crowd selection can be accessed by the following url http://webproject2.cse.ust.hk:8034/tcrowd/.

【Keywords】: crowdsourcing; tweet

191. Risk-O-Meter: an intelligent clinical risk calculator.

Paper Link】 【Pages】:1518-1521

【Authors】: Kiyana Zolfaghar ; Jayshree Agarwal ; Deepthi Sistla ; Si-Chi Chin ; Senjuti Basu Roy ; Nele Verbiest

【Abstract】: We present a system called Risk-O-Meter to predict and an- alyze clinical risk via data imputation, visualization, predic- tive modeling, and association rule exploration. Clinical risk calculators provide information about a person's chance of having a disease or encountering a clinical event. Such tools could be highly useful to educate patients to understand and monitor their health conditions. Unlike existing risk calcu- lators that are primarily designed for domain experts, Risk- O-Meter is useful to patients who are unfamiliar with medi- cal terminologies, or providers who have limited information about a patient. Risk-O-Meter is designed in a way such that it is flexible enough to accept limited or incomplete data in- puts, and still manages to predict the clinical risk efficiently and effectively. Current version of Risk-O-Meter evaluates 30-day risk of hospital readmission. However, the proposed system framework is applicable to general clinical risk pre- dictions. In this demonstration paper, we describe different components of Risk-O-Meter and the intelligent algorithms associated with each of these components to evaluate risk of readmission using incomplete patient data inputs.

【Keywords】: association rule mining; clinical risk calculator; clustering; risk of hospital readmission predication

Tutorials 6

192. Algorithmic techniques for modeling and mining large graphs (AMAzING).

Paper Link】 【Pages】:1523

【Authors】: Alan M. Frieze ; Aristides Gionis ; Charalampos E. Tsourakakis

【Abstract】: Network science has emerged over the last years as an interdisciplinary area spanning traditional domains including mathematics, computer science, sociology, biology and economics. Since complexity in social, biological and economical systems, and more generally in complex systems, arises through pairwise interactions there exists a surging interest in understanding networks. In this tutorial, we will provide an in-depth presentation of the most popular random-graph models used for modeling real-world networks. We will then discuss efficient algorithmic techniques for mining large graphs, with emphasis on the problems of extracting graph sparsifiers, partitioning graphs into densely connected components, and finding dense subgraphs. We will motivate the problems we will discuss and the algorithms we will present with real-world applications. Our aim is to survey important results in the areas of modeling and mining large graphs, to uncover the intuition behind the key ideas, and to present future research directions.

【Keywords】:

193. Mining data from mobile devices: a survey of smart sensing and analytics.

Paper Link】 【Pages】:1524

【Authors】: Spiros Papadimitriou ; Tina Eliassi-Rad

【Abstract】: Mobile connected devices, and smartphones in particular, are rapidly emerging as a dominant computing and sensing platform. This poses several unique opportunities for data collection and analysis, as well as new challenges. In this tutorial, we survey the state-of-the-art in terms of mining data from mobile devices across different application areas such as ads, healthcare, geosocial, public policy, etc. Our tutorial has three parts. In part one, we summarize data collection in terms of various sensing modalities. In part two, we present cross-cutting challenges such as real-time analysis, security, and we outline cross cutting methods for mobile data mining such as network inference, streaming algorithms, etc. In the last part, we specifically overview emerging and fast-growing application areas, such as noted above. Concluding, we briefly highlight the opportunities for joint design of new data collection techniques and analysis methods, suggesting additional directions for future research.

【Keywords】:

194. Big data analytics for healthcare.

Paper Link】 【Pages】:1525

【Authors】: Jimeng Sun ; Chandan K. Reddy

【Abstract】: Large amounts of heterogeneous medical data have become available in various healthcare organizations (payers, providers, pharmaceuticals). Those data could be an enabling resource for deriving insights for improving care delivery and reducing waste. The enormity and complexity of these datasets present great challenges in analyses and subsequent applications to a practical clinical environment. In this tutorial, we introduce the characteristics and related mining challenges on dealing with big medical data. Many of those insights come from medical informatics community, which is highly related to data mining but focuses on biomedical specifics. We survey various related papers from data mining venues as well as medical informatics venues to share with the audiences key problems and trends in healthcare analytics research, with different applications ranging from clinical text mining, predictive modeling, survival analysis, patient similarity, genetic data analysis, and public health. The tutorial will include several case studies dealing with some of the important healthcare applications.

【Keywords】:

195. Entity resolution for big data.

Paper Link】 【Pages】:1527

【Authors】: Lise Getoor ; Ashwin Machanavajjhala

【Abstract】: Entity resolution (ER), the problem of extracting, matching and resolving entity mentions in structured and unstructured data, is a long-standing challenge in database management, information retrieval, machine learning, natural language processing and statistics. Accurate and fast entity resolution has huge practical implications in a wide variety of commercial, scientific and security domains. Despite the long history of work on entity resolution, there is still a surprising diversity of approaches, and lack of guiding theory. Meanwhile, in the age of big data, the need for high quality entity resolution is growing, as we are inundated with more and more data, all of which needs to be integrated, aligned and matched, before further utility can be extracted. In this tutorial, we bring together perspectives on entity resolution from a variety of fields, including databases, information retrieval, natural language processing and machine learning, to provide, in one setting, a survey of a large body of work. We discuss both the practical aspects and theoretical underpinnings of ER. We describe existing solutions, current challenges and open research problems. In addition to giving attendees a thorough understanding of existing ER models, algorithms and evaluation methods, the tutorial will cover important research topics such as scalable ER, active and lightly supervised ER, and query-driven ER.

【Keywords】:

196. Network sampling.

Paper Link】 【Pages】:1528

【Authors】: Lise Getoor ; Ashwin Machanavajjhala

【Abstract】: Network data appears in various domains, including social, communication, and information sciences. Analysis of such data is crucial for making inferences and predictions about these networks, and moreover, for understanding the different processes that drive their evolution. However, a major bottleneck to perform such an analysis is the massive size of real-life networks, which makes modeling and analyzing these networks simply infeasible. Further, many networks, specifically those that belong to social and communication domains, are not visible to the public due to privacy concerns, and other networks, such as the Web, are only accessible via crawling. Therefore, to overcome the above challenges, researchers use network sampling overwhelmingly as a key statistical approach to select a sub-population of interest that can be studied thoroughly. In this tutorial, we aim to cover a diverse collection of methodologies and applications of network sampling. We will begin with a discussion of the problem setting in terms of objectives (such as, sampling a representative subgraph, sampling graphlets, etc.), population of interest (vertices, edges, motifs), and sampling methodologies (such as Metropolis-Hastings, random walk, and snowball sampling). We will then present a number of applications of these methods, and will outline both the resulting opportunities and possible biases of different methods in each application.

【Keywords】:

197. The dataminer's guide to scalable mixed-membership and nonparametric bayesian models.

Paper Link】 【Pages】:1529

【Authors】: Amr Ahmed ; Alexander J. Smola

【Abstract】: Large amounts of data arise in a multitude of situations, ranging from bioinformatics to astronomy, manufacturing, and medical applications. For concreteness our tutorial focuses on data obtained in the context of the internet, such as user generated content (microblogs, e-mails, messages), behavioral data (locations, interactions, clicks, queries), and graphs. Due to its magnitude, much of the challenges are to extract structure and interpretable models without the need for additional labels, i.e. to design effective unsupervised techniques. We present design patterns for hierarchical nonparametric Bayesian models, efficient inference algorithms, and modeling tools to describe salient aspects of the data.

【Keywords】: